S3 Document Loader - Ardor Docs

Node Details

AWS Credential: Optional. If provided, it should contain AWS access key ID and secret access key.

Unstructured API KEY: (password) API key for the Unstructured service.
Strategy: (options) The strategy for partitioning PDF/image files (hi_res, fast, ocr_only, auto).
Encoding: (string) The encoding method for decoding text input.
Skip Infer Table Types: (multiOptions) Document types to skip table extraction for.
Hi-Res Model Name: (options) The inference model used for hi-res strategy.
Chunking Strategy: (options) Strategy for chunking returned elements.
OCR Languages: (multiOptions) Languages to use for OCR.
Source ID Key: (string) Key used to identify the true source of the document.
Coordinates: (boolean) Whether to return coordinates for each element.
XML Keep Tags: (boolean) Whether to retain XML tags in the output.
Include Page Breaks: (boolean) Whether to include page break elements in the output.
Multi-Page Sections: (boolean) Whether to treat multi-page documents as separate sections.
Combine Under N Chars: (number) Character limit for combining elements in chunking.
New After N Chars: (number) Character limit for starting new sections in chunking.
Max Characters: (number) Maximum character limit for sections.
Additional Metadata: (json) Extra metadata to add to extracted documents.
Omit Metadata Keys: (string) List of metadata keys to omit from the output.

The node first authenticates with AWS using provided credentials (if any).
It then downloads the specified file from the S3 bucket to a temporary local directory.
The downloaded file is processed using the Unstructured API with the specified parameters.
The processed content is returned as structured documents.
Additional metadata is added, and specified metadata keys are omitted if requested.
The temporary file is deleted after processing.

The node outputs an array of structured documents, each containing the extracted content and associated metadata.

Extracting text and structure from documents stored in S3 buckets
Processing various file types (PDF, images, Office documents, etc.) for NLP tasks
Preparing data for further analysis or machine learning pipelines

Ensure that the Unstructured API is accessible and properly configured.
Be mindful of AWS and Unstructured API usage costs when processing large volumes of documents.
The node supports a wide range of file formats, but performance may vary depending on the file type and chosen strategy.

On this page