You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can customize the punctuation marks used for splitting text into semantically meaningful chunks. This is particularly useful for internationalization:
662
+
663
+
[source,java]
664
+
----
665
+
@Component
666
+
class MyInternationalTextSplitter {
667
+
668
+
public List<Document> splitChineseText(List<Document> documents) {
The `TokenTextSplitter` provides two constructor options:
693
+
The `TokenTextSplitter` provides three constructor options:
637
694
638
695
1. `TokenTextSplitter()`: Creates a splitter with default settings.
639
-
2. `TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)`
696
+
2. `TokenTextSplitter(boolean keepSeparator)`: Creates a splitter with custom separator behavior.
697
+
3. `TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator, List<Character> punctuationMarks)`: Full constructor with all customization options.
640
698
699
+
NOTE: The builder pattern (shown above) is the recommended approach for creating instances with custom configurations.
641
700
642
701
==== Parameters
643
702
644
-
* `defaultChunkSize`: The target size of each text chunk in tokens (default: 800).
703
+
* `chunkSize`: The target size of each text chunk in tokens (default: 800).
645
704
* `minChunkSizeChars`: The minimum size of each text chunk in characters (default: 350).
646
705
* `minChunkLengthToEmbed`: The minimum length of a chunk to be included (default: 5).
647
706
* `maxNumChunks`: The maximum number of chunks to generate from a text (default: 10000).
648
707
* `keepSeparator`: Whether to keep separators (like newlines) in the chunks (default: true).
708
+
* `punctuationMarks`: List of characters to use as sentence boundaries for splitting (default: `.`, `?`, `!`, `\n`).
649
709
650
710
==== Behavior
651
711
652
712
The `TokenTextSplitter` processes text content as follows:
653
713
654
714
1. It encodes the input text into tokens using the CL100K_BASE encoding.
655
-
2. It splits the encoded text into chunks based on the `defaultChunkSize`.
715
+
2. It splits the encoded text into chunks based on the `chunkSize`.
656
716
3. For each chunk:
657
-
a. It decodes the chunk back into text.
658
-
b. It attempts to find a suitable break point (period, question mark, exclamation mark, or newline) after the `minChunkSizeChars`.
659
-
c. If a break point is found, it truncates the chunk at that point.
660
-
d. It trims the chunk and optionally removes newline characters based on the `keepSeparator` setting.
661
-
e. If the resulting chunk is longer than `minChunkLengthToEmbed`, it's added to the output.
717
+
a. It decodes the chunk back into text.
718
+
b. *Only if the total token count exceeds the chunk size*, it attempts to find a suitable break point (using the configured `punctuationMarks`) after the `minChunkSizeChars`.
719
+
c. If a break point is found, it truncates the chunk at that point.
720
+
d. It trims the chunk and optionally removes newline characters based on the `keepSeparator` setting.
721
+
e. If the resulting chunk is longer than `minChunkLengthToEmbed`, it's added to the output.
662
722
4. This process continues until all tokens are processed or `maxNumChunks` is reached.
663
723
5. Any remaining text is added as a final chunk if it's longer than `minChunkLengthToEmbed`.
664
724
725
+
IMPORTANT: Punctuation-based splitting only applies when the token count exceeds the chunk size. Text that exactly matches or is smaller than the chunk size is returned as a single chunk without punctuation-based truncation. This prevents unnecessary splitting of small texts.
726
+
665
727
==== Example
666
728
667
729
[source,java]
@@ -688,6 +750,10 @@ for (Document doc : splitDocuments) {
688
750
* Metadata from the original documents is preserved and copied to all chunks derived from that document.
689
751
* The content formatter (if set) from the original document is also copied to the derived chunks if `copyContentFormatter` is set to `true` (default behavior).
690
752
* This splitter is particularly useful for preparing text for large language models that have token limits, ensuring that each chunk is within the model's processing capacity.
753
+
* *Custom Punctuation Marks*: The default punctuation marks (`.`, `?`, `!`, `\n`) work well for English text. For other languages or specialized content, customize the punctuation marks using the builder's `withPunctuationMarks()` method.
754
+
* *Performance Consideration*: While the splitter can handle any number of punctuation marks, it's recommended to keep the list reasonably small (under 20 characters) for optimal performance, as each mark is checked for every chunk.
755
+
* *Extensibility*: The `getLastPunctuationIndex(String)` method is `protected`, allowing subclasses to override the punctuation detection logic for specialized use cases.
756
+
* *Small Text Handling*: As of version 2.0, small texts (with token count at or below the chunk size) are no longer split at punctuation marks, preventing unnecessary fragmentation of content that already fits within the size limits.
691
757
692
758
=== ContentFormatTransformer
693
759
Ensures uniform content formats across all documents.
0 commit comments