Unstructured Folder Loader

Node Details

Name: UnstructuredFolder_DocumentLoaders
Type: Document
Version: 2.0
Category: Document Loaders

Input Parameters

Required Parameters

Folder Path
- Type: string
- Description: The path to the folder containing the documents to be processed.
Unstructured API URL
- Type: string
- Default: “http://localhost:8000/general/v0/general”
- Description: The URL of the Unstructured API endpoint.

Optional Parameters

Strategy
- Type: options (hi_res, fast, ocr_only, auto)
- Default: “auto”
- Description: The strategy to use for partitioning PDF/image documents.
Encoding
- Type: string
- Default: “utf-8”
- Description: The encoding method used to decode the text input.
Skip Infer Table Types
- Type: multiOptions
- Default: [“pdf”, “jpg”, “png”]
- Description: Document types to skip table extraction with.
Hi-Res Model Name
- Type: options (chipper, detectron2_onnx, yolox, yolox_quantized)
- Default: “detectron2_onnx”
- Description: The name of the inference model used when strategy is hi_res.
Chunking Strategy
- Type: options (None, by_title)
- Default: “by_title”
- Description: Strategy to chunk the returned elements.
OCR Languages
- Type: multiOptions
- Description: Languages to use for OCR.
Source ID Key
- Type: string
- Default: “source”
- Description: Key used to get the true source of document.
Coordinates
- Type: boolean
- Default: false
- Description: Whether to return coordinates for each element.
Include Page Breaks
- Type: boolean
- Description: Whether to include page break elements when supported.
XML Keep Tags
- Type: boolean
- Description: Whether to keep XML tags in the output.
Multi-Page Sections
- Type: boolean
- Description: Whether to treat multi-page documents as separate sections.
Combine Under N Chars
- Type: number
- Description: Combine elements until a section reaches a specified length.
New After N Chars
- Type: number
- Description: Cut off new sections after reaching a specified length (soft max).
Max Characters
- Type: number
- Default: 500
- Description: Cut off new sections after reaching a specified length (hard max).
Additional Metadata
- Type: json
- Description: Additional metadata to be added to the extracted documents.
Omit Metadata Keys
- Type: string
- Description: List of metadata keys to omit from the output.

Output

The node outputs an array of document objects, each containing the extracted text content and associated metadata from the processed files.

Usage

This node is particularly useful for:

Bulk document processing from a directory
Extracting structured data from various file formats
Preparing documents for further NLP tasks or analysis
Customizing document parsing and metadata extraction

Notes

The node supports a wide range of file formats, including PDF, images, and various text-based formats.
It integrates with the Unstructured.io API, which may require separate setup and potentially a paid subscription for advanced features.
The node offers extensive customization options, allowing fine-tuning of the document processing pipeline to suit specific needs.

Changelog

​Node Details

​Input Parameters

​Required Parameters

​Optional Parameters

​Output

​Usage

​Notes

Node Details

Input Parameters

Required Parameters

Optional Parameters

Output

Usage

Notes