Puppeteer Web Scraper

Node Details

URL (required)
- Type: string
- Description: The URL of the webpage to scrape.
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the scraped content.
Get Relative Links Method (optional)
- Type: options
- Options:
  - Web Crawl: Crawl relative links from HTML URL
  - Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Default: Web Crawl
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limit the number of relative links to retrieve. Set to 0 for no limit.
Wait Until (optional)
- Type: options
- Options:
  - Load: When the initial HTML document’s DOM has been loaded and parsed
  - DOM Content Loaded: When the complete HTML document’s DOM has been loaded and parsed
  - Network Idle 0: Navigation is finished when there are no more than 0 network connections for at least 500 ms
  - Network Idle 2: Navigation is finished when there are no more than 2 network connections for at least 500 ms
Wait for selector to load (optional)
- Type: string
- Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”)
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the final document. Use ”*” to omit all default metadata.

The node outputs an array of IDocument objects, each representing a scraped webpage with its content and metadata.

The node uses headless Chrome for scraping, which allows it to handle JavaScript-rendered content.
Be mindful of the website’s terms of service and robots.txt file when using this scraper.
Large-scale scraping may require additional considerations for rate limiting and respecting server resources.