Document loaders
Spider Document Loaders
The Spider Document Loaders node is a component designed to scrape and crawl web content using the Spider API. It allows users to extract text content from web pages, either by scraping a single page or crawling multiple pages within the same domain.
Node Details
- Name: spiderDocumentLoaders
- Type: Document
- Category: Document Loaders
- Version: 1.0
Parameters
Input Parameters
-
Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content
-
Mode
- Type: options
- Options:
- Scrape: Extract content from a single page
- Crawl: Extract content from multiple pages within the same domain
- Default: scrape
-
Web Page URL
- Type: string
- Description: The URL of the web page to scrape or the starting point for crawling
-
Limit
- Type: number
- Default: 25
- Description: The maximum number of pages to crawl (applicable in crawl mode)
-
Additional Metadata (optional)
- Type: JSON
- Description: Additional metadata to be added to the extracted documents
-
Additional Parameters (optional)
- Type: JSON
- Description: Additional parameters for the Spider API (refer to Spider API documentation)
-
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output
Credential
- Credential Name: spiderApi
- Required Parameter: spiderApiKey
Functionality
- The node initializes a SpiderLoader with the provided parameters.
- Depending on the selected mode (scrape or crawl), it calls the appropriate Spider API endpoint.
- The extracted content is processed and converted into Document objects.
- If a text splitter is provided, the content is split accordingly.
- Additional metadata is added, and specified metadata keys are omitted if requested.
- The resulting documents are returned as output.
Output
An array of Document objects, each containing:
pageContent
: The extracted text content from the web pagemetadata
: A combination of default metadata (e.g., source URL) and any additional metadata provided
Use Cases
- Web scraping for data collection
- Content aggregation from multiple web pages
- Preparing web content for further processing or analysis in a language model pipeline
Notes
- The Spider API key must be provided through the credential system.
- The node supports both single-page scraping and multi-page crawling.
- Users can customize the extraction process using additional parameters supported by the Spider API.
- The output format is set to markdown by default.
Was this page helpful?