Apify Website Content Crawler

Node Details

Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content.
Start URLs (required)
- Type: string
- Description: One or more URLs where the crawler will start, separated by commas.
- Example: https://js.langchain.com/docs/
Crawler Type (required)
- Type: options
- Options:
  - Headless web browser (Chrome+Playwright)
  - Stealthy web browser (Firefox+Playwright)
  - Raw HTTP client (Cheerio)
  - Raw HTTP client with JavaScript execution (JSDOM) [experimental]
- Default: Stealthy web browser (Firefox+Playwright)
- Description: Select the crawling engine for the task.
Max Crawling Depth (optional)
- Type: number
- Default: 1
- Description: The maximum depth of pages to crawl.
Max Crawl Pages (optional)
- Type: number
- Default: 3
- Description: The maximum number of pages to crawl.
Additional Input (optional)
- Type: JSON
- Default:
- Description: Additional input options for the crawler. Refer to the Apify documentation for more details.
Additional Metadata (optional)
- Type: JSON
- Description: Additional metadata to be added to the extracted documents.
Omit Metadata Keys (optional)
- Type: string
- Description: A comma-separated list of metadata keys to omit from the final documents. Use * to omit all metadata keys except those specified in the Additional Metadata field.

The node outputs an array of Document objects. Each Document contains:

pageContent: The extracted text content from the webpage.
metadata: An object containing metadata about the document, including the source URL and any additional metadata specified in the input.

The node initializes the Apify Website Content Crawler with the provided input parameters.
It uses the ApifyDatasetLoader to load documents from the crawler’s output.
If a text splitter is provided, it splits the documents accordingly.
Additional metadata is added to each document if specified.
Metadata keys are omitted based on the “Omit Metadata Keys” input.
The processed documents are returned as the output.

The crawler respects robots.txt rules and website terms of service.
Be mindful of the number of pages and depth of crawling to avoid overloading target websites.
The Apify API token is required to use this node. Ensure you have a valid Apify account and API token.