Document loaders
S3 Directory Node
The S3 Directory node is a document loader that retrieves and processes files from an Amazon S3 bucket or a compatible S3-like storage service. It supports various file formats and can be used to load data for further processing in a document pipeline.
Node Details
- Name: s3Directory
- Type: Document
- Category: Document Loaders
- Version: 3.0
Parameters
Credential (Optional)
- Type: credential
- Credential Names: awsApi
- Description: AWS API credentials for accessing the S3 bucket
Inputs
-
Text Splitter (Optional)
- Type: TextSplitter
- Description: A text splitter to process the loaded documents
-
Bucket
- Type: string
- Description: The name of the S3 bucket to load files from
-
Region
- Type: asyncOptions
- Default: “us-east-1”
- Description: AWS region where the S3 bucket is located
-
Server URL (Optional)
- Type: string
- Description: Custom endpoint URL for S3-compatible services
-
Prefix (Optional)
- Type: string
- Description: Limits the response to keys that begin with the specified prefix
-
Pdf Usage (Optional)
- Type: options
- Options:
- One document per page
- One document per file
- Default: “One document per page”
- Description: Determines how PDF files are processed
-
Additional Metadata (Optional)
- Type: json
- Description: Additional metadata to be added to the extracted documents
-
Omit Metadata Keys (Optional)
- Type: string
- Description: Comma-separated list of metadata keys to omit from the output
Functionality
- Connects to the specified S3 bucket using provided credentials
- Lists and downloads all files from the bucket (or within the specified prefix)
- Processes each file based on its extension using appropriate loaders
- Applies text splitting if a text splitter is provided
- Manages metadata for each document
- Returns an array of processed documents
Use Cases
- Loading large datasets stored in S3 for natural language processing tasks
- Preprocessing documents from S3 for search indexing or analysis
- Integrating S3-stored documents into AI/ML pipelines
Supported File Formats
JSON, TXT, CSV, DOCX, PDF, ASPX, ASP, CPP, C, CS, CSS, GO, H, KT, JAVA, JS, LESS, TS, PHP, PROTO, PYTHON, PY, RST, RUBY, RB, RS, SCALA, SC, SCSS, SOL, SQL, SWIFT, MARKDOWN, MD, TEX, LTX, HTML, VB, XML
Notes
- Temporary files are created locally and cleaned up after processing
- Handles nested directory structures within the S3 bucket
- Provides options for customizing metadata and PDF processing
Was this page helpful?