Available Components

Airtable Document Loader

Extracts data from Airtable bases and tables

API Loader

Loads documents from REST API endpoints

Apify Website Content Crawler

Crawls websites using Apify’s web scraping platform

Cheerio Web Scraper

Extracts content from web pages using Cheerio

Confluence Document Loader

Loads content from Confluence pages and spaces

Custom Document Loader

Create custom loaders for specialized document types

CSV File Node

Processes CSV files into structured documents

Document Store Loader

Loads documents from document storage systems

Docx File Node

Extracts content from Microsoft Word documents

Figma Document Loader

Retrieves design content from Figma files

File Loader Node

Generic loader for various file types

FireCrawl Document Loader

Web crawler for content extraction

Folder with Files Node

Processes multiple files within a directory

Github Document Loader

Extracts content from GitHub repositories

Gitbook Document Loader

Loads content from Gitbook documentation

JSON File Document Loader

Processes JSON files into documents

JSON Lines File Node

Handles JSONL format files

Notion Database Document Loader

Extracts content from Notion databases

Notion Folder Document Loader

Processes multiple Notion pages in a folder

Notion Page Document Loader

Loads content from individual Notion pages

PDF Document Loader

Extracts text from PDF files

Plain Text Document Loader

Processes plain text files

S3 Directory Node

Loads documents from AWS S3 directories

S3 Document Loader

Processes individual files from AWS S3

SerpAPI For Web Search

Retrieves search results as documents

Spider Document Loaders

Crawls websites to extract content

Text File Document Loader

Processes text files into documents

Unstructured File Loader

Handles various unstructured file formats

Unstructured Folder Loader

Processes folders of unstructured files

VectorStore To Document

Converts vector store entries to documents

Use Cases

These Document Loaders are beneficial for various use cases, including:

  1. Data Extraction: Pulling content from diverse sources like web pages, APIs, databases, and file systems.
  2. Text Processing: Converting different file formats (PDF, DOCX, CSV, JSON) into processable text.
  3. Web Scraping: Extracting data from websites and web applications.
  4. Knowledge Base Creation: Building structured datasets from unstructured or semi-structured sources.
  5. Content Aggregation: Collecting and organizing information from multiple sources.
  6. Data Preprocessing: Preparing data for natural language processing tasks or machine learning models.
  7. Document Analysis: Extracting and structuring information from complex document formats.
  8. API Integration: Fetching and processing data from various third-party APIs.
  9. Cloud Storage Access: Retrieving and processing documents stored in cloud services like S3.
  10. Version Control Integration: Extracting content from version control systems like GitHub.
  11. Design Tool Integration: Accessing and processing design data from tools like Figma.
  12. Collaborative Tool Integration: Extracting data from collaborative platforms like Notion and Confluence.

These Document Loaders provide a flexible foundation for ingesting data from a wide array of sources, making it easier to build comprehensive and diverse datasets for AI and machine learning applications.