Node Details

  • Name: S3
  • Type: Document
  • Version: 3.0
  • Category: Document Loaders

Credentials

  • AWS Credential: Optional. If provided, it should contain AWS access key ID and secret access key.

Input Parameters

Required Parameters

  1. Bucket: (string) The name of the S3 bucket containing the document.
  2. Object Key: (string) The unique identifier for the object in the S3 bucket.
  3. Region: (string) The AWS region where the S3 bucket is located.
  4. Unstructured API URL: (string) The URL of the Unstructured API service.

Optional Parameters

  1. Unstructured API KEY: (password) API key for the Unstructured service.
  2. Strategy: (options) The strategy for partitioning PDF/image files (hi_res, fast, ocr_only, auto).
  3. Encoding: (string) The encoding method for decoding text input.
  4. Skip Infer Table Types: (multiOptions) Document types to skip table extraction for.
  5. Hi-Res Model Name: (options) The inference model used for hi-res strategy.
  6. Chunking Strategy: (options) Strategy for chunking returned elements.
  7. OCR Languages: (multiOptions) Languages to use for OCR.
  8. Source ID Key: (string) Key used to identify the true source of the document.
  9. Coordinates: (boolean) Whether to return coordinates for each element.
  10. XML Keep Tags: (boolean) Whether to retain XML tags in the output.
  11. Include Page Breaks: (boolean) Whether to include page break elements in the output.
  12. Multi-Page Sections: (boolean) Whether to treat multi-page documents as separate sections.
  13. Combine Under N Chars: (number) Character limit for combining elements in chunking.
  14. New After N Chars: (number) Character limit for starting new sections in chunking.
  15. Max Characters: (number) Maximum character limit for sections.
  16. Additional Metadata: (json) Extra metadata to add to extracted documents.
  17. Omit Metadata Keys: (string) List of metadata keys to omit from the output.

Functionality

  1. The node first authenticates with AWS using provided credentials (if any).
  2. It then downloads the specified file from the S3 bucket to a temporary local directory.
  3. The downloaded file is processed using the Unstructured API with the specified parameters.
  4. The processed content is returned as structured documents.
  5. Additional metadata is added, and specified metadata keys are omitted if requested.
  6. The temporary file is deleted after processing.

Output

The node outputs an array of structured documents, each containing the extracted content and associated metadata.

Use Cases

  • Extracting text and structure from documents stored in S3 buckets
  • Processing various file types (PDF, images, Office documents, etc.) for NLP tasks
  • Preparing data for further analysis or machine learning pipelines

Notes

  • Ensure that the Unstructured API is accessible and properly configured.
  • Be mindful of AWS and Unstructured API usage costs when processing large volumes of documents.
  • The node supports a wide range of file formats, but performance may vary depending on the file type and chosen strategy.