Skip to content

Commit 7abe9a0

Browse files
committed
Add documentation for custom punctuation mark support in Text splitter
Signed-off-by: Ilayaperumal Gopinathan <ilayaperumal.gopinathan@broadcom.com>
1 parent c0e279a commit 7abe9a0

File tree

1 file changed

+76
-10
lines changed

1 file changed

+76
-10
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/etl-pipeline.adoc

Lines changed: 76 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -614,6 +614,8 @@ The `TokenTextSplitter` is an implementation of `TextSplitter` that splits text
614614

615615
==== Usage
616616

617+
===== Basic Usage
618+
617619
[source,java]
618620
----
619621
@Component
@@ -625,43 +627,103 @@ class MyTokenTextSplitter {
625627
}
626628
627629
public List<Document> splitCustomized(List<Document> documents) {
628-
TokenTextSplitter splitter = new TokenTextSplitter(1000, 400, 10, 5000, true);
630+
TokenTextSplitter splitter = new TokenTextSplitter(1000, 400, 10, 5000, true, List.of('.', '?', '!', '\n'));
631+
return splitter.apply(documents);
632+
}
633+
}
634+
----
635+
636+
===== Using the Builder Pattern
637+
638+
The recommended way to create a `TokenTextSplitter` is using the builder pattern, which provides a more readable and flexible API:
639+
640+
[source,java]
641+
----
642+
@Component
643+
class MyTokenTextSplitter {
644+
645+
public List<Document> splitWithBuilder(List<Document> documents) {
646+
TokenTextSplitter splitter = TokenTextSplitter.builder()
647+
.withChunkSize(1000)
648+
.withMinChunkSizeChars(400)
649+
.withMinChunkLengthToEmbed(10)
650+
.withMaxNumChunks(5000)
651+
.withKeepSeparator(true)
652+
.build();
653+
654+
return splitter.apply(documents);
655+
}
656+
}
657+
----
658+
659+
===== Custom Punctuation Marks
660+
661+
You can customize the punctuation marks used for splitting text into semantically meaningful chunks. This is particularly useful for internationalization:
662+
663+
[source,java]
664+
----
665+
@Component
666+
class MyInternationalTextSplitter {
667+
668+
public List<Document> splitChineseText(List<Document> documents) {
669+
// Use Chinese punctuation marks
670+
TokenTextSplitter splitter = TokenTextSplitter.builder()
671+
.withChunkSize(800)
672+
.withMinChunkSizeChars(350)
673+
.withPunctuationMarks(List.of('。', '?', '!', ';')) // Chinese punctuation
674+
.build();
675+
676+
return splitter.apply(documents);
677+
}
678+
679+
public List<Document> splitWithCustomMarks(List<Document> documents) {
680+
// Mix of English and other punctuation marks
681+
TokenTextSplitter splitter = TokenTextSplitter.builder()
682+
.withChunkSize(800)
683+
.withPunctuationMarks(List.of('.', '?', '!', '\n', ';', ':', '。'))
684+
.build();
685+
629686
return splitter.apply(documents);
630687
}
631688
}
632689
----
633690

634691
==== Constructor Options
635692

636-
The `TokenTextSplitter` provides two constructor options:
693+
The `TokenTextSplitter` provides three constructor options:
637694

638695
1. `TokenTextSplitter()`: Creates a splitter with default settings.
639-
2. `TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)`
696+
2. `TokenTextSplitter(boolean keepSeparator)`: Creates a splitter with custom separator behavior.
697+
3. `TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator, List<Character> punctuationMarks)`: Full constructor with all customization options.
640698

699+
NOTE: The builder pattern (shown above) is the recommended approach for creating instances with custom configurations.
641700

642701
==== Parameters
643702

644-
* `defaultChunkSize`: The target size of each text chunk in tokens (default: 800).
703+
* `chunkSize`: The target size of each text chunk in tokens (default: 800).
645704
* `minChunkSizeChars`: The minimum size of each text chunk in characters (default: 350).
646705
* `minChunkLengthToEmbed`: The minimum length of a chunk to be included (default: 5).
647706
* `maxNumChunks`: The maximum number of chunks to generate from a text (default: 10000).
648707
* `keepSeparator`: Whether to keep separators (like newlines) in the chunks (default: true).
708+
* `punctuationMarks`: List of characters to use as sentence boundaries for splitting (default: `.`, `?`, `!`, `\n`).
649709

650710
==== Behavior
651711

652712
The `TokenTextSplitter` processes text content as follows:
653713

654714
1. It encodes the input text into tokens using the CL100K_BASE encoding.
655-
2. It splits the encoded text into chunks based on the `defaultChunkSize`.
715+
2. It splits the encoded text into chunks based on the `chunkSize`.
656716
3. For each chunk:
657-
a. It decodes the chunk back into text.
658-
b. It attempts to find a suitable break point (period, question mark, exclamation mark, or newline) after the `minChunkSizeChars`.
659-
c. If a break point is found, it truncates the chunk at that point.
660-
d. It trims the chunk and optionally removes newline characters based on the `keepSeparator` setting.
661-
e. If the resulting chunk is longer than `minChunkLengthToEmbed`, it's added to the output.
717+
a. It decodes the chunk back into text.
718+
b. *Only if the total token count exceeds the chunk size*, it attempts to find a suitable break point (using the configured `punctuationMarks`) after the `minChunkSizeChars`.
719+
c. If a break point is found, it truncates the chunk at that point.
720+
d. It trims the chunk and optionally removes newline characters based on the `keepSeparator` setting.
721+
e. If the resulting chunk is longer than `minChunkLengthToEmbed`, it's added to the output.
662722
4. This process continues until all tokens are processed or `maxNumChunks` is reached.
663723
5. Any remaining text is added as a final chunk if it's longer than `minChunkLengthToEmbed`.
664724

725+
IMPORTANT: Punctuation-based splitting only applies when the token count exceeds the chunk size. Text that exactly matches or is smaller than the chunk size is returned as a single chunk without punctuation-based truncation. This prevents unnecessary splitting of small texts.
726+
665727
==== Example
666728

667729
[source,java]
@@ -688,6 +750,10 @@ for (Document doc : splitDocuments) {
688750
* Metadata from the original documents is preserved and copied to all chunks derived from that document.
689751
* The content formatter (if set) from the original document is also copied to the derived chunks if `copyContentFormatter` is set to `true` (default behavior).
690752
* This splitter is particularly useful for preparing text for large language models that have token limits, ensuring that each chunk is within the model's processing capacity.
753+
* *Custom Punctuation Marks*: The default punctuation marks (`.`, `?`, `!`, `\n`) work well for English text. For other languages or specialized content, customize the punctuation marks using the builder's `withPunctuationMarks()` method.
754+
* *Performance Consideration*: While the splitter can handle any number of punctuation marks, it's recommended to keep the list reasonably small (under 20 characters) for optimal performance, as each mark is checked for every chunk.
755+
* *Extensibility*: The `getLastPunctuationIndex(String)` method is `protected`, allowing subclasses to override the punctuation detection logic for specialized use cases.
756+
* *Small Text Handling*: As of version 2.0, small texts (with token count at or below the chunk size) are no longer split at punctuation marks, preventing unnecessary fragmentation of content that already fits within the size limits.
691757

692758
=== ContentFormatTransformer
693759
Ensures uniform content formats across all documents.

0 commit comments

Comments
 (0)