From eb6ffda8edd00f7cd88fb82df79c7e86007be389 Mon Sep 17 00:00:00 2001 From: kela4 Date: Sat, 25 Apr 2026 20:36:08 +0200 Subject: [PATCH 1/2] docs: add token_transformers text splitter option and related env variables --- docs/reference/env-configuration.mdx | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/docs/reference/env-configuration.mdx b/docs/reference/env-configuration.mdx index b7bb6612c..efbdb1909 100644 --- a/docs/reference/env-configuration.mdx +++ b/docs/reference/env-configuration.mdx @@ -1975,6 +1975,11 @@ When configured, these custom schemes will be validated alongside `http` and `ht - Default: `False` - Description: Determines whether to allow custom models defined on the Hub in their own modeling files. +#### `RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE` + +- Type: `bool` +- Default: `False` +- Description: Determines whether to allow custom models defined on the Hub in their own modeling files. #### `RAG_RERANKING_MODEL_TRUST_REMOTE_CODE` @@ -1989,6 +1994,12 @@ modeling files for reranking. - Default: `True` - Description: Toggles automatic update of the Sentence-Transformer model. +#### `RAG_TOKENIZER_MODEL_AUTO_UPDATE` + +- Type: `bool` +- Default: `True` +- Description: Toggles automatic update of the tokenizer model. + #### `RAG_RERANKING_MODEL_AUTO_UPDATE` - Type: `bool` @@ -3175,8 +3186,16 @@ Provide a clear and direct response to the user's query, including inline citati - Options: - `character` - `token` + - `token_transformers` - Default: `character` -- Description: Sets the text splitter for RAG models. Use `character` for RecursiveCharacterTextSplitter or `token` for TokenTextSplitter (Tiktoken-based). +- Description: Sets the text splitter for RAG models. Use `character` for RecursiveCharacterTextSplitter or `token` for TokenTextSplitter (Tiktoken-based). `token_transformers` uses RecursiveCharacterTextSplitter with length function of a specified tokenizer which requires `RAG_TOKENIZER_MODEL` if not using local embedding model. +- Persistence: This environment variable is a `PersistentConfig` variable. + +#### `RAG_TOKENIZER_MODEL` + +- Type: `str` +- Default: empty +- Description: HuggingFace repo name or local model path of the tokenizer to use with the `token_transformers` text splitter. Leave empty when using a local embedding model — the embedding model's built-in tokenizer is used automatically. Required when using an external embedding API (e.g. Ollama, OpenAI-compatible) with `RAG_TEXT_SPLITTER=token_transformers`. - Persistence: This environment variable is a `PersistentConfig` variable. #### `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` From 75b1d1f944189d4832b5f29913fb57d184268a74 Mon Sep 17 00:00:00 2001 From: kela4 Date: Sat, 25 Apr 2026 20:36:18 +0200 Subject: [PATCH 2/2] docs: add explanation section for text splitter options --- docs/features/chat-conversations/rag/index.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/docs/features/chat-conversations/rag/index.md b/docs/features/chat-conversations/rag/index.md index 06f6a2bf5..505b0cc3d 100644 --- a/docs/features/chat-conversations/rag/index.md +++ b/docs/features/chat-conversations/rag/index.md @@ -53,6 +53,27 @@ Use the **Chunk Min Size Target** setting (found in **Admin Panel > Settings > D ::: +## Text Splitter Options + +Open WebUI supports three text splitter modes, selectable via **Admin Panel > Settings > Documents > Text Splitter**: + +- **Character (default)**: Splits at natural text boundaries (paragraphs, sentences, whitespace) and measures chunk size in characters. Works well for most setups. Note that characters and tokens are not equivalent — for non-Latin scripts (e.g. Chinese, Japanese, Korean) the character-to-token ratio can approach 1:1, so a large character-based chunk size can silently exceed your embedding model's token limit, causing truncated embeddings or API errors. +- **Token (Tiktoken)**: Measures chunk size in tokens using OpenAI's Tiktoken encoding. Produces accurate token counts for OpenAI embedding models, but can be inaccurate for non-OpenAI models. +- **Token (Transformers)**: Measures chunk size using the exact tokenizer of your embedding model. This is the recommended choice when using a non-OpenAI embedding model (e.g. BGE, GTE, Qwen via Ollama or an external API). Tiktoken produces incorrect token counts for these models, which can cause chunks to silently exceed the model's maximum sequence length, resulting in truncated embeddings or API errors. + +### Token (Transformers) — Tokenizer Model + +When **Token (Transformers)** is selected, a **Tokenizer Model** field appears in the UI: + +- **Local embedding model**: leave the field empty. Open WebUI automatically uses the tokenizer bundled with the local embedding model. +- **External embedding API** (Ollama, OpenAI-compatible, etc.): the field is required. Enter the HuggingFace repo name of the model whose tokenizer you want to use (e.g. `BAAI/bge-large-en`). Click the download button next to the field to fetch the tokenizer. A local model snapshot path is also accepted. + +:::tip + +Set **Chunk Size** to at most the embedding model's maximum sequence length minus its special tokens (e.g. 510 for a 512-token BERT-family model). Open WebUI will log a warning at startup and during ingestion if `CHUNK_SIZE` exceeds this effective limit. + +::: + ## Chunking Configuration Open WebUI allows you to fine-tune how documents are split into chunks for embedding. This is crucial for optimal retrieval performance.