Unstructured File Loader

Node Details

Name: UnstructuredFile_DocumentLoaders
Type: Document
Category: Document Loaders
Version: 3.0

Input Parameters

Main Parameters

File Path (optional)
- Type: string
- Description: Path to the file to be processed. This will be deprecated in future releases.
Files Upload
- Type: file
- Description: Files to be processed. Multiple files can be uploaded.
- Supported file types: .txt, .text, .pdf, .docx, .doc, .jpg, .jpeg, .eml, .html, .htm, .md, .pptx, .ppt, .msg, .rtf, .xlsx, .xls, .odt, .epub
Unstructured API URL
- Type: string
- Default: “http://localhost:8000/general/v0/general”
- Description: The URL for the Unstructured API.
Strategy
- Type: options
- Options: Hi-Res, Fast, OCR Only, Auto
- Default: “auto”
- Description: The strategy to use for partitioning PDF/image.

Additional Parameters

Encoding (optional)
- Type: string
- Default: “utf-8”
- Description: The encoding method used to decode the text input.
Skip Infer Table Types (optional)
- Type: multiOptions
- Default: [“pdf”, “jpg”, “png”]
- Description: Document types to skip table extraction with.
Hi-Res Model Name (optional)
- Type: options
- Options: chipper, detectron2_onnx, yolox, yolox_quantized
- Default: “detectron2_onnx”
- Description: The name of the inference model used when strategy is hi_res.
Chunking Strategy (optional)
- Type: options
- Options: None, By Title
- Default: “by_title”
- Description: Strategy to chunk the returned elements.
OCR Languages (optional)
- Type: multiOptions
- Description: The languages to use for OCR.
Source ID Key (optional)
- Type: string
- Default: “source”
- Description: Key used to get the true source of document.
Coordinates (optional)
- Type: boolean
- Default: false
- Description: If true, return coordinates for each element.
XML Keep Tags (optional)
- Type: boolean
- Description: Whether to keep XML tags in the output.
Include Page Breaks (optional)
- Type: boolean
- Description: When true, include page break elements when the filetype supports it.
Multi-Page Sections (optional)
- Type: boolean
- Description: Whether to treat multi-page documents as separate sections.
Combine Under N Chars (optional)
- Type: number
- Description: Combine elements until a section reaches a length of n chars.
New After N Chars (optional)
- Type: number
- Description: Cut off new sections after reaching a length of n chars (soft max).
Max Characters (optional)
- Type: number
- Default: 500
- Description: Cut off new sections after reaching a length of n chars (hard max).
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents.
Omit Metadata Keys (optional)
- Type: string
- Description: List of metadata keys to omit from the output.

Output

The node outputs an array of IDocument objects, each representing a processed document with its content and metadata.

Credentials

Unstructured API Key (optional): API key for accessing the Unstructured.io API.

Use Cases

Extracting text and structure from complex document formats like PDFs or scanned images.
Processing multiple files of different types in a single workflow.
Preparing unstructured data for further NLP tasks such as summarization, question answering, or information extraction.

Notes

This node is particularly useful for handling documents with complex layouts or mixed content types.
The Unstructured.io API provides advanced document processing capabilities, including OCR for images and scanned documents.
Users can fine-tune the processing by

Unstructured Folder LoaderThe Unstructured Folder Loader is a document loader that uses Unstructured.io to load data from a folder. It's designed to process various file types within a specified directory, extracting and structuring their contents for further use in document processing pipelines.

On this page

Node Details
Input Parameters
Main Parameters
Additional Parameters
Output
Credentials
Use Cases
Notes

Components

​Node Details

​Input Parameters

​Main Parameters

​Additional Parameters

​Output

​Credentials

​Use Cases

​Notes

Node Details

Input Parameters

Main Parameters

Additional Parameters

Output

Credentials

Use Cases

Notes