Unstructured File Loader
The Unstructured File Loader is a document loader node that uses Unstructured.io to load and process data from various file types. It’s designed to extract structured information from unstructured documents, making it easier to work with complex file formats in natural language processing pipelines.
Node Details
- Name: UnstructuredFile_DocumentLoaders
- Type: Document
- Category: Document Loaders
- Version: 3.0
Input Parameters
Main Parameters
-
File Path (optional)
- Type: string
- Description: Path to the file to be processed. This will be deprecated in future releases.
-
Files Upload
- Type: file
- Description: Files to be processed. Multiple files can be uploaded.
- Supported file types: .txt, .text, .pdf, .docx, .doc, .jpg, .jpeg, .eml, .html, .htm, .md, .pptx, .ppt, .msg, .rtf, .xlsx, .xls, .odt, .epub
-
Unstructured API URL
- Type: string
- Default: “http://localhost:8000/general/v0/general”
- Description: The URL for the Unstructured API.
-
Strategy
- Type: options
- Options: Hi-Res, Fast, OCR Only, Auto
- Default: “auto”
- Description: The strategy to use for partitioning PDF/image.
Additional Parameters
-
Encoding (optional)
- Type: string
- Default: “utf-8”
- Description: The encoding method used to decode the text input.
-
Skip Infer Table Types (optional)
- Type: multiOptions
- Default: [“pdf”, “jpg”, “png”]
- Description: Document types to skip table extraction with.
-
Hi-Res Model Name (optional)
- Type: options
- Options: chipper, detectron2_onnx, yolox, yolox_quantized
- Default: “detectron2_onnx”
- Description: The name of the inference model used when strategy is hi_res.
-
Chunking Strategy (optional)
- Type: options
- Options: None, By Title
- Default: “by_title”
- Description: Strategy to chunk the returned elements.
-
OCR Languages (optional)
- Type: multiOptions
- Description: The languages to use for OCR.
-
Source ID Key (optional)
- Type: string
- Default: “source”
- Description: Key used to get the true source of document.
-
Coordinates (optional)
- Type: boolean
- Default: false
- Description: If true, return coordinates for each element.
-
XML Keep Tags (optional)
- Type: boolean
- Description: Whether to keep XML tags in the output.
-
Include Page Breaks (optional)
- Type: boolean
- Description: When true, include page break elements when the filetype supports it.
-
Multi-Page Sections (optional)
- Type: boolean
- Description: Whether to treat multi-page documents as separate sections.
-
Combine Under N Chars (optional)
- Type: number
- Description: Combine elements until a section reaches a length of n chars.
-
New After N Chars (optional)
- Type: number
- Description: Cut off new sections after reaching a length of n chars (soft max).
-
Max Characters (optional)
- Type: number
- Default: 500
- Description: Cut off new sections after reaching a length of n chars (hard max).
-
Additional Metadata (optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents.
-
Omit Metadata Keys (optional)
- Type: string
- Description: List of metadata keys to omit from the output.
Output
The node outputs an array of IDocument objects, each representing a processed document with its content and metadata.
Credentials
- Unstructured API Key (optional): API key for accessing the Unstructured.io API.
Use Cases
- Extracting text and structure from complex document formats like PDFs or scanned images.
- Processing multiple files of different types in a single workflow.
- Preparing unstructured data for further NLP tasks such as summarization, question answering, or information extraction.
Notes
- This node is particularly useful for handling documents with complex layouts or mixed content types.
- The Unstructured.io API provides advanced document processing capabilities, including OCR for images and scanned documents.
- Users can fine-tune the processing by
Was this page helpful?