Node Details

  • Name: puppeteerWebScraper
  • Type: Document
  • Category: Document Loaders
  • Version: 1.0

Input Parameters

  1. URL (required)

    • Type: string
    • Description: The URL of the webpage to scrape.
  2. Text Splitter (optional)

    • Type: TextSplitter
    • Description: A text splitter to process the scraped content.
  3. Get Relative Links Method (optional)

    • Type: options
    • Options:
      • Web Crawl: Crawl relative links from HTML URL
      • Scrape XML Sitemap: Scrape relative links from XML sitemap URL
    • Default: Web Crawl
  4. Get Relative Links Limit (optional)

    • Type: number
    • Default: 10
    • Description: Limit the number of relative links to retrieve. Set to 0 for no limit.
  5. Wait Until (optional)

    • Type: options
    • Options:
      • Load: When the initial HTML document’s DOM has been loaded and parsed
      • DOM Content Loaded: When the complete HTML document’s DOM has been loaded and parsed
      • Network Idle 0: Navigation is finished when there are no more than 0 network connections for at least 500 ms
      • Network Idle 2: Navigation is finished when there are no more than 2 network connections for at least 500 ms
  6. Wait for selector to load (optional)

    • Type: string
    • Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”)
  7. Additional Metadata (optional)

    • Type: json
    • Description: Additional metadata to be added to the extracted documents
  8. Omit Metadata Keys (optional)

    • Type: string
    • Description: Comma-separated list of metadata keys to omit from the final document. Use ”*” to omit all default metadata.

Output

The node outputs an array of IDocument objects, each representing a scraped webpage with its content and metadata.

Functionality

  1. Validates the input URL
  2. Configures Puppeteer options based on input parameters
  3. Scrapes the specified URL(s) using Puppeteer
  4. Processes the scraped content with a text splitter if provided
  5. Handles relative link crawling if specified
  6. Applies additional metadata and omits specified metadata keys
  7. Returns the processed documents

Use Cases

  • Web content extraction for analysis or processing
  • Building training datasets from web content
  • Automating web research tasks
  • Creating web archives or snapshots

Notes

  • The node uses headless Chrome for scraping, which allows it to handle JavaScript-rendered content.
  • Be mindful of the website’s terms of service and robots.txt file when using this scraper.
  • Large-scale scraping may require additional considerations for rate limiting and respecting server resources.