Token chunker component

A component that splits texts into chunks, respecting a maximum token limit and using delimiters to find optimal breakpoints.


A Token chunker component is a text splitter that creates chunks by respecting a recommended maximum token length, using delimiters to ensure logical chunk breakpoints. It splits long texts into appropriately-sized, semantically related chunks.

Scenario

A Token chunker component is optional, usually placed immediately after Parser or Title chunker.

Configurations

Overlapped percent (%)

This defines the overlap percentage between chunks. An appropriate degree of overlap ensures semantic coherence without creating excessive, redundant tokens for the LLM.

  • Default: 0

  • Maximum: 30%

Delimiters

Defaults to \n. Click the right-hand Recycle bin button to remove it, or click + Add to add a delimiter.

Output

The global variable name for the output of the Token chunker component, which can be referenced by subsequent components in the ingestion pipeline.

  • Default: chunks

  • Type: Array<Object>