
Node Details
- Name: playwrightWebScraper
- Type: Document
- Category: Document Loaders
- Version: 1.0
Input Parameters
-
URL (required)
- Type: string
- Description: The URL of the webpage to scrape.
-
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the scraped content.
-
Get Relative Links Method (optional)
- Type: options
- Options:
- Web Crawl: Crawl relative links from HTML URL
- Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Default: Web Crawl
- Description: Method to retrieve relative links for multi-page scraping.
-
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limit the number of relative links to scrape. Set to 0 for no limit.
-
Wait Until (optional)
- Type: options
- Options:
- Load: Wait until the load event is fired.
- DOM Content Loaded: Wait until the DOMContentLoaded event is fired.
- Network Idle: Wait until there are no more connections for at least 500 ms.
- Commit: Wait until network response is received and the document started loading.
- Description: Specifies when to consider the page navigation finished.
-
Wait for selector to load (optional)
- Type: string
- Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”).
-
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents.
-
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.
Output
The node outputs an array of IDocument objects, each representing a scraped webpage. These documents contain the page content and associated metadata.Functionality
- Validates the input URL.
- Sets up Playwright with specified options (headless mode, wait conditions, etc.).
- Scrapes the specified URL or multiple URLs if crawling is enabled.
- Processes the scraped content with a text splitter if provided.
- Adds or modifies metadata as specified in the inputs.
- Returns the processed documents.
Use Cases
- Web content extraction for analysis or processing
- Building training datasets from web content
- Automating data collection from websites
- Creating web archives or snapshots
Notes
- The node uses a headless browser, making it suitable for scraping dynamic, JavaScript-rendered content.
- It respects robots.txt by default and should be used responsibly.
- Error handling and logging are implemented for debugging purposes.