Token Text Splitter
The Token Text Splitter is a component used for splitting text into smaller chunks based on token count. It utilizes the TikToken library for tokenization, which is commonly used in language models like GPT.
Node Information
-
Name: tokenTextSplitter
-
Type: TokenTextSplitter
-
Category: Text Splitters
-
Version: 1.0
Parameters
Encoding Name
-
Type: Options
-
Default: gpt2
-
Available Options:
-
gpt2
-
r50k_base
-
p50k_base
-
p50k_edit
-
cl100k_base
-
-
Description: Specifies the encoding scheme to use for tokenization. Different models may use different encodings.
Chunk Size
-
Type: Number
-
Default: 1000
-
Optional: Yes
-
Description: The number of characters in each chunk. This determines the maximum size of each text segment after splitting.
Chunk Overlap
-
Type: Number
-
Default: 200
-
Optional: Yes
-
Description: The number of characters to overlap between chunks. This helps maintain context between chunks.
Input/Output
-
Input: Raw text string
-
Output: An array of text chunks
Usage
This node is particularly useful in scenarios where you need to process large amounts of text with language models that have a maximum token limit. By splitting the text into smaller chunks, you can process each chunk separately and then combine the results.
Common use cases include:
-
Preparing text for summarization
-
Breaking down large documents for question-answering systems
-
Preprocessing text for semantic search or embeddings
Implementation Details
The node uses the TokenTextSplitter
class from the LangChain library, which in turn uses the TikToken library for tokenization. The splitting process ensures that the text is split at token boundaries rather than arbitrary character positions, which can be more semantically meaningful for many NLP tasks.
Note
When using this splitter, be aware that the actual number of tokens in each chunk may vary slightly from the specified chunk size, as the splitter converts tokens back to text for the final output. The chunk size parameter is used as a target, but the exact size may differ to maintain token integrity.
Was this page helpful?