
Node Details
- Name: S3
- Type: Document
- Version: 3.0
- Category: Document Loaders
Credentials
- AWS Credential: Optional. If provided, it should contain AWS access key ID and secret access key.
Input Parameters
Required Parameters
- Bucket: (string) The name of the S3 bucket containing the document.
- Object Key: (string) The unique identifier for the object in the S3 bucket.
- Region: (string) The AWS region where the S3 bucket is located.
- Unstructured API URL: (string) The URL of the Unstructured API service.
Optional Parameters
- Unstructured API KEY: (password) API key for the Unstructured service.
- Strategy: (options) The strategy for partitioning PDF/image files (hi_res, fast, ocr_only, auto).
- Encoding: (string) The encoding method for decoding text input.
- Skip Infer Table Types: (multiOptions) Document types to skip table extraction for.
- Hi-Res Model Name: (options) The inference model used for hi-res strategy.
- Chunking Strategy: (options) Strategy for chunking returned elements.
- OCR Languages: (multiOptions) Languages to use for OCR.
- Source ID Key: (string) Key used to identify the true source of the document.
- Coordinates: (boolean) Whether to return coordinates for each element.
- XML Keep Tags: (boolean) Whether to retain XML tags in the output.
- Include Page Breaks: (boolean) Whether to include page break elements in the output.
- Multi-Page Sections: (boolean) Whether to treat multi-page documents as separate sections.
- Combine Under N Chars: (number) Character limit for combining elements in chunking.
- New After N Chars: (number) Character limit for starting new sections in chunking.
- Max Characters: (number) Maximum character limit for sections.
- Additional Metadata: (json) Extra metadata to add to extracted documents.
- Omit Metadata Keys: (string) List of metadata keys to omit from the output.
Functionality
- The node first authenticates with AWS using provided credentials (if any).
- It then downloads the specified file from the S3 bucket to a temporary local directory.
- The downloaded file is processed using the Unstructured API with the specified parameters.
- The processed content is returned as structured documents.
- Additional metadata is added, and specified metadata keys are omitted if requested.
- The temporary file is deleted after processing.
Output
The node outputs an array of structured documents, each containing the extracted content and associated metadata.Use Cases
- Extracting text and structure from documents stored in S3 buckets
- Processing various file types (PDF, images, Office documents, etc.) for NLP tasks
- Preparing data for further analysis or machine learning pipelines
Notes
- Ensure that the Unstructured API is accessible and properly configured.
- Be mindful of AWS and Unstructured API usage costs when processing large volumes of documents.
- The node supports a wide range of file formats, but performance may vary depending on the file type and chosen strategy.