Spider Document Loaders

Node Details

Text Splitter (optional)
- Type: TextSplitter
- Description: A text splitter to process the extracted content
Mode
- Type: options
- Options:
  - Scrape: Extract content from a single page
  - Crawl: Extract content from multiple pages within the same domain
- Default: scrape
Web Page URL
- Type: string
- Description: The URL of the web page to scrape or the starting point for crawling
Limit
- Type: number
- Default: 25
- Description: The maximum number of pages to crawl (applicable in crawl mode)
Additional Metadata (optional)
- Type: JSON
- Description: Additional metadata to be added to the extracted documents
Additional Parameters (optional)
- Type: JSON
- Description: Additional parameters for the Spider API (refer to Spider API documentation)
Omit Metadata Keys (optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output

The node initializes a SpiderLoader with the provided parameters.
Depending on the selected mode (scrape or crawl), it calls the appropriate Spider API endpoint.
The extracted content is processed and converted into Document objects.
If a text splitter is provided, the content is split accordingly.
Additional metadata is added, and specified metadata keys are omitted if requested.
The resulting documents are returned as output.

An array of Document objects, each containing:

pageContent: The extracted text content from the web page
metadata: A combination of default metadata (e.g., source URL) and any additional metadata provided

Web scraping for data collection
Content aggregation from multiple web pages
Preparing web content for further processing or analysis in a language model pipeline

The Spider API key must be provided through the credential system.
The node supports both single-page scraping and multi-page crawling.
Users can customize the extraction process using additional parameters supported by the Spider API.
The output format is set to markdown by default.