Node Details

  • Name: cheerioWebScraper
  • Type: Document
  • Version: 1.1
  • Category: Document Loaders

Input Parameters

  1. URL (required)

    • Type: string
    • Description: The URL of the webpage to scrape or the starting point for crawling.
  2. Text Splitter (optional)

    • Type: TextSplitter
    • Description: A text splitter to process the extracted content.
  3. Get Relative Links Method (optional)

    • Type: options
    • Default: “webCrawl”
    • Options:
      • Web Crawl: Crawl relative links from HTML URL
      • Scrape XML Sitemap: Scrape relative links from XML sitemap URL
    • Description: Method to retrieve relative links for multi-page scraping.
  4. Get Relative Links Limit (optional)

    • Type: number
    • Default: 10
    • Description: Limits the number of relative links to process. Set to 0 for no limit.
  5. Selector (CSS) (optional)

    • Type: string
    • Description: CSS selector to target specific content on the page.
  6. Additional Metadata (optional)

    • Type: json
    • Description: Custom metadata to add to the extracted documents.
  7. Omit Metadata Keys (optional)

    • Type: string
    • Description: Comma-separated list of metadata keys to omit from the output. Use ”*” to omit all default metadata.

Functionality

  1. Validates the input URL.
  2. Sets up the Cheerio loader with optional CSS selector.
  3. Handles single-page scraping or multi-page crawling based on the selected method.
  4. Applies text splitting if a TextSplitter is provided.
  5. Manages metadata, including adding custom metadata and omitting specified keys.
  6. Handles errors and provides debug logging.

Output

The node outputs an array of IDocument objects, each representing a scraped page or section, including the extracted content and associated metadata.

Use Cases

  • Web content extraction for analysis or processing
  • Building datasets from web sources
  • Automating data collection from websites
  • Creating custom web crawlers for specific domains

Notes

  • The node includes error handling for invalid URLs and unsupported file types (e.g., PDFs).
  • It respects rate limiting and responsible scraping practices.
  • Debug logging is available when the DEBUG environment variable is set to ‘true’.