Document loaders
Apify Website Content Crawler
The Apify Website Content Crawler is a document loader node that uses Apify’s Website Content Crawler to extract content from websites. It’s designed to crawl web pages, extract their content, and return it as a collection of documents that can be used in various natural language processing tasks.
Node Details
- Name: ApifyWebsiteContentCrawler_DocumentLoaders
- Type: Document
- Category: Document Loaders
- Version: 2.0
Input Parameters
-
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content.
-
Start URLs (required)
- Type: string
- Description: One or more URLs where the crawler will start, separated by commas.
- Example: https://js.langchain.com/docs/
-
Crawler Type (required)
- Type: options
- Options:
- Headless web browser (Chrome+Playwright)
- Stealthy web browser (Firefox+Playwright)
- Raw HTTP client (Cheerio)
- Raw HTTP client with JavaScript execution (JSDOM) [experimental]
- Default: Stealthy web browser (Firefox+Playwright)
- Description: Select the crawling engine for the task.
-
Max Crawling Depth (optional)
- Type: number
- Default: 1
- Description: The maximum depth of pages to crawl.
-
Max Crawl Pages (optional)
- Type: number
- Default: 3
- Description: The maximum number of pages to crawl.
-
Additional Input (optional)
- Type: JSON
- Default:
- Description: Additional input options for the crawler. Refer to the Apify documentation for more details.
-
Additional Metadata (optional)
- Type: JSON
- Description: Additional metadata to be added to the extracted documents.
-
Omit Metadata Keys (optional)
- Type: string
- Description: A comma-separated list of metadata keys to omit from the final documents. Use * to omit all metadata keys except those specified in the Additional Metadata field.
Credential
- Name: apifyApi
- Type: API Key
- Description: Apify API token for authentication.
Output
The node outputs an array of Document objects. Each Document contains:
pageContent
: The extracted text content from the webpage.metadata
: An object containing metadata about the document, including the source URL and any additional metadata specified in the input.
Functionality
- The node initializes the Apify Website Content Crawler with the provided input parameters.
- It uses the ApifyDatasetLoader to load documents from the crawler’s output.
- If a text splitter is provided, it splits the documents accordingly.
- Additional metadata is added to each document if specified.
- Metadata keys are omitted based on the “Omit Metadata Keys” input.
- The processed documents are returned as the output.
Use Cases
- Web scraping for content analysis
- Building training datasets for language models
- Extracting information from multiple web pages for research or data aggregation
- Creating knowledge bases from web content
Notes
- The crawler respects robots.txt rules and website terms of service.
- Be mindful of the number of pages and depth of crawling to avoid overloading target websites.
- The Apify API token is required to use this node. Ensure you have a valid Apify account and API token.
Was this page helpful?