Cheerio Web Scraper - Ardor Docs

Node Details

URL (required)
- Type: string
- Description: The URL of the webpage to scrape or the starting point for crawling.
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content.
Get Relative Links Method (optional)
- Type: options
- Default: “webCrawl”
- Options:
  - Web Crawl: Crawl relative links from HTML URL
  - Scrape XML Sitemap: Scrape relative links from XML sitemap URL
- Description: Method to retrieve relative links for multi-page scraping.
Get Relative Links Limit (optional)
- Type: number
- Default: 10
- Description: Limits the number of relative links to process. Set to 0 for no limit.
Selector (CSS) (optional)
- Type: string
- Description: CSS selector to target specific content on the page.
Additional Metadata (optional)
- Type: json
- Description: Custom metadata to add to the extracted documents.
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.

Validates the input URL.
Sets up the Cheerio loader with optional CSS selector.
Handles single-page scraping or multi-page crawling based on the selected method.
Applies text splitting if a TextSplitter is provided.
Manages metadata, including adding custom metadata and omitting specified keys.
Handles errors and provides debug logging.

The node outputs an array of IDocument objects, each representing a scraped page or section, including the extracted content and associated metadata.

The node includes error handling for invalid URLs and unsupported file types (e.g., PDFs).
It respects rate limiting and responsible scraping practices.
Debug logging is available when the DEBUG environment variable is set to ‘true’.

On this page