Node Details

  • Name: playwrightWebScraper
  • Type: Document
  • Category: Document Loaders
  • Version: 1.0

Input Parameters

  1. URL (required)

    • Type: string
    • Description: The URL of the webpage to scrape.
  2. Text Splitter (optional)

    • Type: TextSplitter
    • Description: A text splitter to process the scraped content.
  3. Get Relative Links Method (optional)

    • Type: options
    • Options:
      • Web Crawl: Crawl relative links from HTML URL
      • Scrape XML Sitemap: Scrape relative links from XML sitemap URL
    • Default: Web Crawl
    • Description: Method to retrieve relative links for multi-page scraping.
  4. Get Relative Links Limit (optional)

    • Type: number
    • Default: 10
    • Description: Limit the number of relative links to scrape. Set to 0 for no limit.
  5. Wait Until (optional)

    • Type: options
    • Options:
      • Load: Wait until the load event is fired.
      • DOM Content Loaded: Wait until the DOMContentLoaded event is fired.
      • Network Idle: Wait until there are no more connections for at least 500 ms.
      • Commit: Wait until network response is received and the document started loading.
    • Description: Specifies when to consider the page navigation finished.
  6. Wait for selector to load (optional)

    • Type: string
    • Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”).
  7. Additional Metadata (optional)

    • Type: json
    • Description: Additional metadata to be added to the extracted documents.
  8. Omit Metadata Keys (optional)

    • Type: string
    • Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.

Output

The node outputs an array of IDocument objects, each representing a scraped webpage. These documents contain the page content and associated metadata.

Functionality

  1. Validates the input URL.
  2. Sets up Playwright with specified options (headless mode, wait conditions, etc.).
  3. Scrapes the specified URL or multiple URLs if crawling is enabled.
  4. Processes the scraped content with a text splitter if provided.
  5. Adds or modifies metadata as specified in the inputs.
  6. Returns the processed documents.

Use Cases

  • Web content extraction for analysis or processing
  • Building training datasets from web content
  • Automating data collection from websites
  • Creating web archives or snapshots

Notes

  • The node uses a headless browser, making it suitable for scraping dynamic, JavaScript-rendered content.
  • It respects robots.txt by default and should be used responsibly.
  • Error handling and logging are implemented for debugging purposes.