Node Details

  • Name: UnstructuredFolder_DocumentLoaders
  • Type: Document
  • Version: 2.0
  • Category: Document Loaders

Input Parameters

Required Parameters

  1. Folder Path

    • Type: string
    • Description: The path to the folder containing the documents to be processed.
  2. Unstructured API URL

Optional Parameters

  1. Strategy

    • Type: options (hi_res, fast, ocr_only, auto)
    • Default: “auto”
    • Description: The strategy to use for partitioning PDF/image documents.
  2. Encoding

    • Type: string
    • Default: “utf-8”
    • Description: The encoding method used to decode the text input.
  3. Skip Infer Table Types

    • Type: multiOptions
    • Default: [“pdf”, “jpg”, “png”]
    • Description: Document types to skip table extraction with.
  4. Hi-Res Model Name

    • Type: options (chipper, detectron2_onnx, yolox, yolox_quantized)
    • Default: “detectron2_onnx”
    • Description: The name of the inference model used when strategy is hi_res.
  5. Chunking Strategy

    • Type: options (None, by_title)
    • Default: “by_title”
    • Description: Strategy to chunk the returned elements.
  6. OCR Languages

    • Type: multiOptions
    • Description: Languages to use for OCR.
  7. Source ID Key

    • Type: string
    • Default: “source”
    • Description: Key used to get the true source of document.
  8. Coordinates

    • Type: boolean
    • Default: false
    • Description: Whether to return coordinates for each element.
  9. Include Page Breaks

    • Type: boolean
    • Description: Whether to include page break elements when supported.
  10. XML Keep Tags

    • Type: boolean
    • Description: Whether to keep XML tags in the output.
  11. Multi-Page Sections

    • Type: boolean
    • Description: Whether to treat multi-page documents as separate sections.
  12. Combine Under N Chars

    • Type: number
    • Description: Combine elements until a section reaches a specified length.
  13. New After N Chars

    • Type: number
    • Description: Cut off new sections after reaching a specified length (soft max).
  14. Max Characters

    • Type: number
    • Default: 500
    • Description: Cut off new sections after reaching a specified length (hard max).
  15. Additional Metadata

    • Type: json
    • Description: Additional metadata to be added to the extracted documents.
  16. Omit Metadata Keys

    • Type: string
    • Description: List of metadata keys to omit from the output.

Output

The node outputs an array of document objects, each containing the extracted text content and associated metadata from the processed files.

Usage

This node is particularly useful for:

  1. Bulk document processing from a directory
  2. Extracting structured data from various file formats
  3. Preparing documents for further NLP tasks or analysis
  4. Customizing document parsing and metadata extraction

Notes

  • The node supports a wide range of file formats, including PDF, images, and various text-based formats.
  • It integrates with the Unstructured.io API, which may require separate setup and potentially a paid subscription for advanced features.
  • The node offers extensive customization options, allowing fine-tuning of the document processing pipeline to suit specific needs.