Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/features/chat-conversations/rag/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,27 @@ Use the **Chunk Min Size Target** setting (found in **Admin Panel > Settings > D

:::

## Text Splitter Options

Open WebUI supports three text splitter modes, selectable via **Admin Panel > Settings > Documents > Text Splitter**:

- **Character (default)**: Splits at natural text boundaries (paragraphs, sentences, whitespace) and measures chunk size in characters. Works well for most setups. Note that characters and tokens are not equivalent — for non-Latin scripts (e.g. Chinese, Japanese, Korean) the character-to-token ratio can approach 1:1, so a large character-based chunk size can silently exceed your embedding model's token limit, causing truncated embeddings or API errors.
- **Token (Tiktoken)**: Measures chunk size in tokens using OpenAI's Tiktoken encoding. Produces accurate token counts for OpenAI embedding models, but can be inaccurate for non-OpenAI models.
- **Token (Transformers)**: Measures chunk size using the exact tokenizer of your embedding model. This is the recommended choice when using a non-OpenAI embedding model (e.g. BGE, GTE, Qwen via Ollama or an external API). Tiktoken produces incorrect token counts for these models, which can cause chunks to silently exceed the model's maximum sequence length, resulting in truncated embeddings or API errors.

### Token (Transformers) — Tokenizer Model

When **Token (Transformers)** is selected, a **Tokenizer Model** field appears in the UI:

- **Local embedding model**: leave the field empty. Open WebUI automatically uses the tokenizer bundled with the local embedding model.
- **External embedding API** (Ollama, OpenAI-compatible, etc.): the field is required. Enter the HuggingFace repo name of the model whose tokenizer you want to use (e.g. `BAAI/bge-large-en`). Click the download button next to the field to fetch the tokenizer. A local model snapshot path is also accepted.

:::tip

Set **Chunk Size** to at most the embedding model's maximum sequence length minus its special tokens (e.g. 510 for a 512-token BERT-family model). Open WebUI will log a warning at startup and during ingestion if `CHUNK_SIZE` exceeds this effective limit.

:::

## Chunking Configuration

Open WebUI allows you to fine-tune how documents are split into chunks for embedding. This is crucial for optimal retrieval performance.
Expand Down
21 changes: 20 additions & 1 deletion docs/reference/env-configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1975,6 +1975,11 @@ When configured, these custom schemes will be validated alongside `http` and `ht
- Default: `False`
- Description: Determines whether to allow custom models defined on the Hub in their own modeling files.

#### `RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE`

- Type: `bool`
- Default: `False`
- Description: Determines whether to allow custom models defined on the Hub in their own modeling files.

#### `RAG_RERANKING_MODEL_TRUST_REMOTE_CODE`

Expand All @@ -1989,6 +1994,12 @@ modeling files for reranking.
- Default: `True`
- Description: Toggles automatic update of the Sentence-Transformer model.

#### `RAG_TOKENIZER_MODEL_AUTO_UPDATE`

- Type: `bool`
- Default: `True`
- Description: Toggles automatic update of the tokenizer model.

#### `RAG_RERANKING_MODEL_AUTO_UPDATE`

- Type: `bool`
Expand Down Expand Up @@ -3175,8 +3186,16 @@ Provide a clear and direct response to the user's query, including inline citati
- Options:
- `character`
- `token`
- `token_transformers`
- Default: `character`
- Description: Sets the text splitter for RAG models. Use `character` for RecursiveCharacterTextSplitter or `token` for TokenTextSplitter (Tiktoken-based).
- Description: Sets the text splitter for RAG models. Use `character` for RecursiveCharacterTextSplitter or `token` for TokenTextSplitter (Tiktoken-based). `token_transformers` uses RecursiveCharacterTextSplitter with length function of a specified tokenizer which requires `RAG_TOKENIZER_MODEL` if not using local embedding model.
- Persistence: This environment variable is a `PersistentConfig` variable.

#### `RAG_TOKENIZER_MODEL`

- Type: `str`
- Default: empty
- Description: HuggingFace repo name or local model path of the tokenizer to use with the `token_transformers` text splitter. Leave empty when using a local embedding model — the embedding model's built-in tokenizer is used automatically. Required when using an external embedding API (e.g. Ollama, OpenAI-compatible) with `RAG_TEXT_SPLITTER=token_transformers`.
- Persistence: This environment variable is a `PersistentConfig` variable.

#### `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER`
Expand Down