Playwright Web Scraper

Node Details

Name: playwrightWebScraper
Type: Document
Category: Document Loaders
Version: 1.0

Input Parameters

URL (required)
- Type: string
- Description: The URL of the webpage to scrape.
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the scraped content.
Get Relative Links Method (optional)
- Type: options
- Options:
  - Web Crawl: Crawl relative links from HTML URL
  - Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Default: Web Crawl
- Description: Method to retrieve relative links for multi-page scraping.
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limit the number of relative links to scrape. Set to 0 for no limit.
Wait Until (optional)
- Type: options
- Options:
  - Load: Wait until the load event is fired.
  - DOM Content Loaded: Wait until the DOMContentLoaded event is fired.
  - Network Idle: Wait until there are no more connections for at least 500 ms.
  - Commit: Wait until network response is received and the document started loading.
- Description: Specifies when to consider the page navigation finished.
Wait for selector to load (optional)
- Type: string
- Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”).
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents.
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.

Output

The node outputs an array of IDocument objects, each representing a scraped webpage. These documents contain the page content and associated metadata.

Functionality

Validates the input URL.
Sets up Playwright with specified options (headless mode, wait conditions, etc.).
Scrapes the specified URL or multiple URLs if crawling is enabled.
Processes the scraped content with a text splitter if provided.
Adds or modifies metadata as specified in the inputs.
Returns the processed documents.

Use Cases

Web content extraction for analysis or processing
Building training datasets from web content
Automating data collection from websites
Creating web archives or snapshots

Notes

The node uses a headless browser, making it suitable for scraping dynamic, JavaScript-rendered content.
It respects robots.txt by default and should be used responsibly.
Error handling and logging are implemented for debugging purposes.

Puppeteer Web ScraperThe Puppeteer Web Scraper is a document loader node that uses Puppeteer to load and extract data from web pages. It can scrape single pages or crawl multiple pages, and offers various options for customizing the scraping process.

On this page

Node Details
Input Parameters
Output
Functionality
Use Cases
Notes

Components

​Node Details

​Input Parameters

​Output

​Functionality

​Use Cases

​Notes

Node Details

Input Parameters

Output

Functionality

Use Cases

Notes