
Node Information
- Name: tokenTextSplitter
- Type: TokenTextSplitter
- Category: Text Splitters
- Version: 1.0
Parameters
Encoding Name
- Type: Options
- Default: gpt2
-
Available Options:
- gpt2
- r50k_base
- p50k_base
- p50k_edit
- cl100k_base
- Description: Specifies the encoding scheme to use for tokenization. Different models may use different encodings.
Chunk Size
- Type: Number
- Default: 1000
- Optional: Yes
- Description: The number of characters in each chunk. This determines the maximum size of each text segment after splitting.
Chunk Overlap
- Type: Number
- Default: 200
- Optional: Yes
- Description: The number of characters to overlap between chunks. This helps maintain context between chunks.
Input/Output
- Input: Raw text string
- Output: An array of text chunks
Usage
This node is particularly useful in scenarios where you need to process large amounts of text with language models that have a maximum token limit. By splitting the text into smaller chunks, you can process each chunk separately and then combine the results. Common use cases include:- Preparing text for summarization
- Breaking down large documents for question-answering systems
- Preprocessing text for semantic search or embeddings
Implementation Details
The node uses theTokenTextSplitter
class from the LangChain library, which in turn uses the TikToken library for tokenization. The splitting process ensures that the text is split at token boundaries rather than arbitrary character positions, which can be more semantically meaningful for many NLP tasks.