Node Details

  • Name: spiderDocumentLoaders
  • Type: Document
  • Category: Document Loaders
  • Version: 1.0

Parameters

Input Parameters

  1. Text Splitter (optional)

    • Type: TextSplitter
    • Description: A text splitter to process the extracted content
  2. Mode

    • Type: options
    • Options:
      • Scrape: Extract content from a single page
      • Crawl: Extract content from multiple pages within the same domain
    • Default: scrape
  3. Web Page URL

    • Type: string
    • Description: The URL of the web page to scrape or the starting point for crawling
  4. Limit

    • Type: number
    • Default: 25
    • Description: The maximum number of pages to crawl (applicable in crawl mode)
  5. Additional Metadata (optional)

    • Type: JSON
    • Description: Additional metadata to be added to the extracted documents
  6. Additional Parameters (optional)

    • Type: JSON
    • Description: Additional parameters for the Spider API (refer to Spider API documentation)
  7. Omit Metadata Keys (optional)

    • Type: string
    • Description: Comma-separated list of metadata keys to omit from the output

Credential

  • Credential Name: spiderApi
  • Required Parameter: spiderApiKey

Functionality

  1. The node initializes a SpiderLoader with the provided parameters.
  2. Depending on the selected mode (scrape or crawl), it calls the appropriate Spider API endpoint.
  3. The extracted content is processed and converted into Document objects.
  4. If a text splitter is provided, the content is split accordingly.
  5. Additional metadata is added, and specified metadata keys are omitted if requested.
  6. The resulting documents are returned as output.

Output

An array of Document objects, each containing:

  • pageContent: The extracted text content from the web page
  • metadata: A combination of default metadata (e.g., source URL) and any additional metadata provided

Use Cases

  • Web scraping for data collection
  • Content aggregation from multiple web pages
  • Preparing web content for further processing or analysis in a language model pipeline

Notes

  • The Spider API key must be provided through the credential system.
  • The node supports both single-page scraping and multi-page crawling.
  • Users can customize the extraction process using additional parameters supported by the Spider API.
  • The output format is set to markdown by default.