
Node Details
- Name: cheerioWebScraper
- Type: Document
- Version: 1.1
- Category: Document Loaders
Input Parameters
-
URL (required)
- Type: string
- Description: The URL of the webpage to scrape or the starting point for crawling.
-
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content.
-
Get Relative Links Method (optional)
- Type: options
- Default: “webCrawl”
- Options:
- Web Crawl: Crawl relative links from HTML URL
- Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Description: Method to retrieve relative links for multi-page scraping.
-
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limits the number of relative links to process. Set to 0 for no limit.
-
Selector (CSS) (optional)
- Type: string
- Description: CSS selector to target specific content on the page.
-
Additional Metadata (optional)
- Type: json
- Description: Custom metadata to add to the extracted documents.
-
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.
Functionality
- Validates the input URL.
- Sets up the Cheerio loader with optional CSS selector.
- Handles single-page scraping or multi-page crawling based on the selected method.
- Applies text splitting if a TextSplitter is provided.
- Manages metadata, including adding custom metadata and omitting specified keys.
- Handles errors and provides debug logging.
Output
The node outputs an array of IDocument objects, each representing a scraped page or section, including the extracted content and associated metadata.Use Cases
- Web content extraction for analysis or processing
- Building datasets from web sources
- Automating data collection from websites
- Creating custom web crawlers for specific domains
Notes
- The node includes error handling for invalid URLs and unsupported file types (e.g., PDFs).
- It respects rate limiting and responsible scraping practices.
- Debug logging is available when the DEBUG environment variable is set to ‘true’.