Node Details

  • Name: ApifyWebsiteContentCrawler_DocumentLoaders
  • Type: Document
  • Category: Document Loaders
  • Version: 2.0

Input Parameters

  1. Text Splitter (optional)

    • Type: TextSplitter
    • Description: A text splitter to process the extracted content.
  2. Start URLs (required)

  3. Crawler Type (required)

    • Type: options
    • Options:
      • Headless web browser (Chrome+Playwright)
      • Stealthy web browser (Firefox+Playwright)
      • Raw HTTP client (Cheerio)
      • Raw HTTP client with JavaScript execution (JSDOM) [experimental]
    • Default: Stealthy web browser (Firefox+Playwright)
    • Description: Select the crawling engine for the task.
  4. Max Crawling Depth (optional)

    • Type: number
    • Default: 1
    • Description: The maximum depth of pages to crawl.
  5. Max Crawl Pages (optional)

    • Type: number
    • Default: 3
    • Description: The maximum number of pages to crawl.
  6. Additional Input (optional)

    • Type: JSON
    • Default:
    • Description: Additional input options for the crawler. Refer to the Apify documentation for more details.
  7. Additional Metadata (optional)

    • Type: JSON
    • Description: Additional metadata to be added to the extracted documents.
  8. Omit Metadata Keys (optional)

    • Type: string
    • Description: A comma-separated list of metadata keys to omit from the final documents. Use * to omit all metadata keys except those specified in the Additional Metadata field.

Credential

  • Name: apifyApi
  • Type: API Key
  • Description: Apify API token for authentication.

Output

The node outputs an array of Document objects. Each Document contains:

  • pageContent: The extracted text content from the webpage.
  • metadata: An object containing metadata about the document, including the source URL and any additional metadata specified in the input.

Functionality

  1. The node initializes the Apify Website Content Crawler with the provided input parameters.
  2. It uses the ApifyDatasetLoader to load documents from the crawler’s output.
  3. If a text splitter is provided, it splits the documents accordingly.
  4. Additional metadata is added to each document if specified.
  5. Metadata keys are omitted based on the “Omit Metadata Keys” input.
  6. The processed documents are returned as the output.

Use Cases

  • Web scraping for content analysis
  • Building training datasets for language models
  • Extracting information from multiple web pages for research or data aggregation
  • Creating knowledge bases from web content

Notes

  • The crawler respects robots.txt rules and website terms of service.
  • Be mindful of the number of pages and depth of crawling to avoid overloading target websites.
  • The Apify API token is required to use this node. Ensure you have a valid Apify account and API token.