
Node Details
- Name: UnstructuredFile_DocumentLoaders
- Type: Document
- Category: Document Loaders
- Version: 3.0
Input Parameters
Main Parameters
-
File Path (optional)
- Type: string
- Description: Path to the file to be processed. This will be deprecated in future releases.
-
Files Upload
- Type: file
- Description: Files to be processed. Multiple files can be uploaded.
- Supported file types: .txt, .text, .pdf, .docx, .doc, .jpg, .jpeg, .eml, .html, .htm, .md, .pptx, .ppt, .msg, .rtf, .xlsx, .xls, .odt, .epub
-
Unstructured API URL
- Type: string
- Default: “http://localhost:8000/general/v0/general”
- Description: The URL for the Unstructured API.
-
Strategy
- Type: options
- Options: Hi-Res, Fast, OCR Only, Auto
- Default: “auto”
- Description: The strategy to use for partitioning PDF/image.
Additional Parameters
-
Encoding (optional)
- Type: string
- Default: “utf-8”
- Description: The encoding method used to decode the text input.
-
Skip Infer Table Types (optional)
- Type: multiOptions
- Default: [“pdf”, “jpg”, “png”]
- Description: Document types to skip table extraction with.
-
Hi-Res Model Name (optional)
- Type: options
- Options: chipper, detectron2_onnx, yolox, yolox_quantized
- Default: “detectron2_onnx”
- Description: The name of the inference model used when strategy is hi_res.
-
Chunking Strategy (optional)
- Type: options
- Options: None, By Title
- Default: “by_title”
- Description: Strategy to chunk the returned elements.
-
OCR Languages (optional)
- Type: multiOptions
- Description: The languages to use for OCR.
-
Source ID Key (optional)
- Type: string
- Default: “source”
- Description: Key used to get the true source of document.
-
Coordinates (optional)
- Type: boolean
- Default: false
- Description: If true, return coordinates for each element.
-
XML Keep Tags (optional)
- Type: boolean
- Description: Whether to keep XML tags in the output.
-
Include Page Breaks (optional)
- Type: boolean
- Description: When true, include page break elements when the filetype supports it.
-
Multi-Page Sections (optional)
- Type: boolean
- Description: Whether to treat multi-page documents as separate sections.
-
Combine Under N Chars (optional)
- Type: number
- Description: Combine elements until a section reaches a length of n chars.
-
New After N Chars (optional)
- Type: number
- Description: Cut off new sections after reaching a length of n chars (soft max).
-
Max Characters (optional)
- Type: number
- Default: 500
- Description: Cut off new sections after reaching a length of n chars (hard max).
-
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents.
-
Omit Metadata Keys (optional)
- Type: string
- Description: List of metadata keys to omit from the output.
Output
The node outputs an array of IDocument objects, each representing a processed document with its content and metadata.Credentials
- Unstructured API Key (optional): API key for accessing the Unstructured.io API.
Use Cases
- Extracting text and structure from complex document formats like PDFs or scanned images.
- Processing multiple files of different types in a single workflow.
- Preparing unstructured data for further NLP tasks such as summarization, question answering, or information extraction.
Notes
- This node is particularly useful for handling documents with complex layouts or mixed content types.
- The Unstructured.io API provides advanced document processing capabilities, including OCR for images and scanned documents.
- Users can fine-tune the processing by