
Node Details
- Name: puppeteerWebScraper
- Type: Document
- Category: Document Loaders
- Version: 1.0
Input Parameters
-
URL (required)
- Type: string
- Description: The URL of the webpage to scrape.
-
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the scraped content.
-
Get Relative Links Method (optional)
- Type: options
- Options:
- Web Crawl: Crawl relative links from HTML URL
- Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Default: Web Crawl
-
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limit the number of relative links to retrieve. Set to 0 for no limit.
-
Wait Until (optional)
- Type: options
- Options:
- Load: When the initial HTML document’s DOM has been loaded and parsed
- DOM Content Loaded: When the complete HTML document’s DOM has been loaded and parsed
- Network Idle 0: Navigation is finished when there are no more than 0 network connections for at least 500 ms
- Network Idle 2: Navigation is finished when there are no more than 2 network connections for at least 500 ms
-
Wait for selector to load (optional)
- Type: string
- Description: CSS selector to wait for before scraping (e.g., “.div” or “#div”)
-
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents
-
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the final document. Use ”*” to omit all default metadata.
Output
The node outputs an array of IDocument objects, each representing a scraped webpage with its content and metadata.Functionality
- Validates the input URL
- Configures Puppeteer options based on input parameters
- Scrapes the specified URL(s) using Puppeteer
- Processes the scraped content with a text splitter if provided
- Handles relative link crawling if specified
- Applies additional metadata and omits specified metadata keys
- Returns the processed documents
Use Cases
- Web content extraction for analysis or processing
- Building training datasets from web content
- Automating web research tasks
- Creating web archives or snapshots
Notes
- The node uses headless Chrome for scraping, which allows it to handle JavaScript-rendered content.
- Be mindful of the website’s terms of service and robots.txt file when using this scraper.
- Large-scale scraping may require additional considerations for rate limiting and respecting server resources.