Skip to content

Commit 8bc8b51

Browse files
committed
fix: Qwen3 thinking model output leak — correct sampling, think filter, search thinking block
- Replace greedy decoding with Qwen3 model card sampling params (temp=0.6/top_p=0.95 for thinking) - Filter thinking content in ai-worker via skip_special_tokens:false + state machine - Increase thinking model token limit from 1024 to 4096 - Add I'll/I'm contraction patterns + trailing cleanup to cleanThinkingArtifacts - Search results shown in collapsible thinking block before AI response - Move changelog to changelogs/ directory
1 parent 24c97fa commit 8bc8b51

7 files changed

Lines changed: 248 additions & 52 deletions

File tree

ai-worker-common.js

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ const SYSTEM_PROMPTS = {
3737
autocomplete:
3838
'You are a helpful writing assistant. Continue writing the text naturally. Only output the continuation, do not repeat the existing text. Write 1-2 sentences.',
3939
generate:
40-
'You are a helpful content generation assistant. Generate content based on the user\'s request. Output in well-formatted markdown.',
40+
'You are a helpful content generation assistant. Generate content based on the user\'s request. Output in well-formatted markdown. Do NOT use LaTeX $...$ or $$...$$ notation for math — use plain text or Unicode instead (e.g. write "x²" not "$x^2$"). Do NOT include any internal thinking, reasoning process, mental notes, or meta-commentary. Output ONLY the final answer.',
4141
markdown:
4242
'You are a markdown expert. Generate well-formatted markdown content based on the user\'s request. Use headings, lists, tables, code blocks, and other markdown features as appropriate.',
4343
explain:
@@ -52,8 +52,8 @@ const SYSTEM_PROMPTS = {
5252
'You are a helpful writing assistant. Elaborate on the following text by adding more details, examples, and explanations to make it more comprehensive. Output in markdown format.',
5353
shorten:
5454
'You are a concise writing editor. Shorten the following text while preserving all key information. Remove redundancy and use fewer words. Only output the shortened text.',
55-
qa: 'You are a helpful assistant. Answer the user\'s question based on the provided document context. Be concise. If the answer cannot be found in the context, say so.',
56-
chat: 'You are a helpful AI assistant integrated into a Markdown editor. Help the user with writing, editing, and formatting tasks. Be concise. Output in markdown format.',
55+
qa: 'You are a helpful assistant. The user may have document context open in their editor. If the question relates to the provided context, use it to answer. If the question is unrelated to the context, answer directly from your knowledge. Be concise. Do NOT use LaTeX $...$ or $$...$$ notation — use plain text or Unicode for math. Do NOT include any internal reasoning, thinking process, or meta-commentary. Output in markdown format.',
56+
chat: 'You are a helpful AI assistant integrated into a Markdown editor. Help the user with writing, editing, and formatting tasks. Be concise. Output in markdown format. Do NOT use LaTeX $...$ or $$...$$ notation for math — use plain text or Unicode instead. Do NOT include any internal thinking, reasoning steps, drafting notes, or meta-commentary. Output ONLY the final polished answer.',
5757
};
5858

5959
/**

ai-worker-gemini.js

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -146,16 +146,16 @@ function buildMessages(taskType, context, userPrompt) {
146146
rephrase: 'You are a helpful writing assistant. Rephrase the following text to improve clarity and readability while preserving the meaning. Output in markdown format.',
147147
grammar: 'You are a helpful writing assistant. Fix any grammar, spelling, and punctuation errors in the following text. Only output the corrected text, nothing else.',
148148
autocomplete: 'You are a helpful writing assistant. Continue writing the text naturally. Only output the continuation, do not repeat the existing text. Write 1-2 sentences.',
149-
generate: 'You are a helpful content generation assistant. Generate content based on the user\'s request. Output in well-formatted markdown.',
149+
generate: 'You are a helpful content generation assistant. Generate content based on the user\'s request. Output in well-formatted markdown. Do NOT use LaTeX $...$ or $$...$$ notation for math — use plain text or Unicode instead (e.g. write "x²" not "$x^2$"). Do NOT include any internal thinking, reasoning process, mental notes, or meta-commentary. Output ONLY the final answer.',
150150
markdown: 'You are a markdown expert. Generate well-formatted markdown content based on the user\'s request. Use headings, lists, tables, code blocks, and other markdown features as appropriate.',
151151
explain: 'You are a helpful assistant. Explain the following text in simple, easy-to-understand terms. Be concise. Output in markdown format.',
152152
simplify: 'You are a helpful writing assistant. Simplify the following text to make it easier to understand. Use shorter sentences and simpler words. Output in markdown format.',
153153
polish: 'You are a skilled writing editor. Polish the following text to improve flow, word choice, and overall quality while preserving the meaning and tone. Only output the polished text.',
154154
formalize: 'You are a professional writing assistant. Rewrite the following text in a more formal, professional tone suitable for business or academic contexts. Only output the formalized text.',
155155
elaborate: 'You are a helpful writing assistant. Elaborate on the following text by adding more details, examples, and explanations to make it more comprehensive. Output in markdown format.',
156156
shorten: 'You are a concise writing editor. Shorten the following text while preserving all key information. Remove redundancy and use fewer words. Only output the shortened text.',
157-
qa: 'You are a helpful assistant. Answer the user\'s question based on the provided document context. Be concise. If the answer cannot be found in the context, say so.',
158-
chat: 'You are a helpful AI assistant integrated into a Markdown editor. Help the user with writing, editing, and formatting tasks. Be concise. Output in markdown format.',
157+
qa: 'You are a helpful assistant. The user may have document context open in their editor. If the question relates to the provided context, use it to answer. If the question is unrelated to the context, answer directly from your knowledge. Be concise. Do NOT use LaTeX $...$ or $$...$$ notation — use plain text or Unicode for math. Do NOT include any internal reasoning, thinking process, or meta-commentary. Output in markdown format.',
158+
chat: 'You are a helpful AI assistant integrated into a Markdown editor. Help the user with writing, editing, and formatting tasks. Be concise. Output in markdown format. Do NOT use LaTeX $...$ or $$...$$ notation for math — use plain text or Unicode instead. Do NOT include any internal thinking, reasoning steps, drafting notes, or meta-commentary. Output ONLY the final polished answer.',
159159
};
160160
const systemMessage = systemPrompts[taskType] || systemPrompts.chat;
161161
const messages = [{ role: 'system', content: systemMessage }];

ai-worker.js

Lines changed: 112 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -259,27 +259,62 @@ async function generate(taskType, context, userPrompt, messageId, enableThinking
259259
// Process text + image together
260260
const inputs = await processor(prompt, rawImage);
261261

262-
// Collect streamed text
262+
// Collect streamed text — filter thinking content
263+
// Use skip_special_tokens:false when thinking is on so we see <think>/</ think> markers
263264
let fullText = '';
265+
let inThinkingPhase = !!enableThinking;
266+
let thinkingBuffer = '';
264267
const streamer = new TextStreamer(processor.tokenizer, {
265268
skip_prompt: true,
266-
skip_special_tokens: true,
269+
skip_special_tokens: !enableThinking,
267270
callback_function: (token) => {
268-
fullText += token;
271+
if (!enableThinking) {
272+
fullText += token;
273+
self.postMessage({ type: "token", token, messageId });
274+
return;
275+
}
276+
if (inThinkingPhase) {
277+
thinkingBuffer += token;
278+
if (thinkingBuffer.includes('</think>')) {
279+
inThinkingPhase = false;
280+
const afterThink = thinkingBuffer.substring(
281+
thinkingBuffer.indexOf('</think>') + '</think>'.length
282+
);
283+
const cleaned = afterThink.replace(/<\|[^|]*\|>/g, '').replace(/<\/?(?:think|thinking|thought)>/gi, '');
284+
if (cleaned.trim()) {
285+
fullText += cleaned;
286+
self.postMessage({ type: "token", token: cleaned, messageId });
287+
}
288+
}
289+
return;
290+
}
291+
const cleaned = token.replace(/<\|[^|]*\|>/g, '').replace(/<\/?(?:think|thinking|thought)>/gi, '');
292+
if (cleaned) {
293+
fullText += cleaned;
294+
self.postMessage({ type: "token", token: cleaned, messageId });
295+
}
269296
},
270297
});
271298

272-
// Generate
273-
await model.generate({
274-
...inputs,
275-
do_sample: true,
276-
max_new_tokens: maxTokens,
277-
streamer,
278-
});
299+
// Generate — Qwen3 model card: use sampling, NOT greedy, for thinking mode
300+
const genConfig = enableThinking
301+
? { do_sample: true, temperature: 0.6, top_p: 0.95, top_k: 20, max_new_tokens: Math.max(maxTokens, 4096) }
302+
: { do_sample: true, temperature: 0.7, top_p: 0.8, top_k: 20, max_new_tokens: maxTokens };
303+
await model.generate({ ...inputs, ...genConfig, streamer });
304+
305+
// Final cleanup — strip any remaining think tags or special tokens
306+
let cleanedText = fullText.trim();
307+
cleanedText = cleanedText.replace(/<(?:think|thinking|thought)>[\s\S]*?<\/(?:think|thinking|thought)>/gi, '');
308+
cleanedText = cleanedText.replace(/<(?:think|thinking|thought)>[\s\S]*$/gi, '');
309+
const closeMatch = cleanedText.match(/<\/(?:think|thinking|thought)>/i);
310+
if (closeMatch) {
311+
cleanedText = cleanedText.substring(cleanedText.indexOf(closeMatch[0]) + closeMatch[0].length);
312+
}
313+
cleanedText = cleanedText.replace(/<\|[^|]*\|>/g, '').trim();
279314

280315
self.postMessage({
281316
type: "complete",
282-
text: fullText.trim(),
317+
text: cleanedText.trim(),
283318
messageId,
284319
});
285320
} else {
@@ -294,27 +329,82 @@ async function generate(taskType, context, userPrompt, messageId, enableThinking
294329
return_tensors: "pt",
295330
});
296331

297-
// Collect streamed text
332+
// --- Thinking-aware streaming ---
333+
// When enableThinking is on, the model generates:
334+
// <think>...thinking content...</think>\n\nactual response
335+
//
336+
// Problem: skip_special_tokens:true strips <think> and </think> markers,
337+
// making it impossible to detect where thinking ends.
338+
// Solution: use skip_special_tokens:false so we see the markers,
339+
// then manually filter thinking content and strip special tokens.
298340
let fullText = "";
341+
let inThinkingPhase = !!enableThinking;
342+
let thinkingBuffer = ""; // buffer thinking content (not forwarded)
343+
299344
const streamer = new TextStreamer(processor.tokenizer, {
300345
skip_prompt: true,
301-
skip_special_tokens: true,
346+
skip_special_tokens: !enableThinking, // false when thinking, so we see markers
302347
callback_function: (token) => {
303-
fullText += token;
348+
if (!enableThinking) {
349+
// Normal mode: forward everything
350+
fullText += token;
351+
self.postMessage({ type: "token", token, messageId });
352+
return;
353+
}
354+
355+
// Thinking mode: track <think>...</think> boundary
356+
if (inThinkingPhase) {
357+
thinkingBuffer += token;
358+
// Check if we've seen the </think> closing marker
359+
if (thinkingBuffer.includes('</think>')) {
360+
inThinkingPhase = false;
361+
// Extract anything after </think> (there might be content)
362+
const afterThink = thinkingBuffer.substring(
363+
thinkingBuffer.indexOf('</think>') + '</think>'.length
364+
);
365+
// Clean special tokens from the after-think content
366+
const cleaned = afterThink
367+
.replace(/<\|[^|]*\|>/g, '') // strip <|im_start|>, <|im_end|>, etc.
368+
.replace(/<\/?(?:think|thinking|thought)>/gi, '');
369+
if (cleaned.trim()) {
370+
fullText += cleaned;
371+
self.postMessage({ type: "token", token: cleaned, messageId });
372+
}
373+
}
374+
return; // don't forward thinking tokens
375+
}
376+
377+
// Post-thinking: forward real content, strip any special tokens
378+
const cleaned = token
379+
.replace(/<\|[^|]*\|>/g, '')
380+
.replace(/<\/?(?:think|thinking|thought)>/gi, '');
381+
if (cleaned) {
382+
fullText += cleaned;
383+
self.postMessage({ type: "token", token: cleaned, messageId });
384+
}
304385
},
305386
});
306387

307-
// Generate
308-
await model.generate({
309-
...inputs,
310-
do_sample: false,
311-
max_new_tokens: maxTokens,
312-
streamer,
313-
});
388+
// Generate — Qwen3 model card: use sampling, NOT greedy, for thinking mode
389+
// Thinking: temp=0.6, top_p=0.95, top_k=20 | Non-thinking: temp=0.7, top_p=0.8, top_k=20
390+
const genConfig = enableThinking
391+
? { do_sample: true, temperature: 0.6, top_p: 0.95, top_k: 20, max_new_tokens: Math.max(maxTokens, 4096) }
392+
: { do_sample: true, temperature: 0.7, top_p: 0.8, top_k: 20, max_new_tokens: maxTokens };
393+
await model.generate({ ...inputs, ...genConfig, streamer });
394+
395+
// Final cleanup: strip any remaining think tags or reasoning artifacts
396+
let cleanedText = fullText.trim();
397+
cleanedText = cleanedText.replace(/<(?:think|thinking|thought)>[\s\S]*?<\/(?:think|thinking|thought)>/gi, '');
398+
cleanedText = cleanedText.replace(/<(?:think|thinking|thought)>[\s\S]*$/gi, '');
399+
const closeMatch = cleanedText.match(/<\/(?:think|thinking|thought)>/i);
400+
if (closeMatch) {
401+
cleanedText = cleanedText.substring(cleanedText.indexOf(closeMatch[0]) + closeMatch[0].length);
402+
}
403+
cleanedText = cleanedText.replace(/<\|[^|]*\|>/g, '').trim();
314404

315405
self.postMessage({
316406
type: "complete",
317-
text: fullText.trim(),
407+
text: cleanedText.trim(),
318408
messageId,
319409
});
320410
}

CHANGELOG-search-thinking-block.md renamed to changelogs/CHANGELOG-search-thinking-block.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,29 @@ Refactors the AI chat search flow to show web search results in a collapsible "t
3838
**What:** Added `.ai-thinking-block` container with green-accented border and fade-in animation, `.ai-thinking-spin` rotation keyframe for the search spinner, `.ai-thinking-searching` for the loading state, and `.ai-thinking-no-results` for the empty state with amber info icon. Dark mode variants included.
3939
**Impact:** Consistent, polished visual treatment matching the existing AI panel design.
4040

41+
## 5. Qwen3 Thinking Model — Correct Sampling Parameters
42+
**Files:** `ai-worker.js`
43+
**What:** Replaced greedy decoding (`do_sample: false`) with sampling using Qwen3 model card recommended parameters: `temperature=0.6, top_p=0.95, top_k=20` for thinking mode and `temperature=0.7, top_p=0.8, top_k=20` for non-thinking mode. Greedy decoding causes "performance degradation and endless repetitions" per Qwen3 docs. Increased max tokens from 1024 to 4096 for thinking mode.
44+
**Impact:** Thinking model no longer gets stuck in infinite thinking loop and actually produces the answer.
45+
46+
## 6. Thinking Content Filter — Worker-level `<think>` Tag Stripping
47+
**Files:** `ai-worker.js`
48+
**What:** When `enableThinking` is true, set `skip_special_tokens: false` so `<think>`/`</think>` markers remain visible in the TextStreamer callback. Added state machine that buffers thinking tokens and only forwards content after `</think>`. Strips leftover special tokens (`<|im_start|>`, etc.) from forwarded content. Applied to both text-only and vision generation paths.
49+
**Impact:** Raw thinking content (planning bullets, reasoning monologue) no longer leaks into the chat response.
50+
51+
## 7. Improved `cleanThinkingArtifacts` — Contraction Patterns & Trailing Cleanup
52+
**Files:** `js/ai-chat.js`
53+
**What:** Added `I'll/I'm/I've/I'd` contraction patterns to reasoning detector (previously only matched `I 'll` with a space). Added trailing cleanup that strips planning outlines (`1. What the Black-Scholes equation is...`) and bare numbered items (`4.`) from end of responses.
54+
**Impact:** Catches residual reasoning that appears after `</think>` in the model's actual response content.
55+
4156
---
4257

43-
## Files Changed (3 total)
58+
## Files Changed (5 total)
4459

4560
| File | Lines Changed | Type |
4661
|------|:---:|------|
47-
| `js/ai-chat.js` | +217 −48 | Two-phase thinking block, removed inline search duplication |
62+
| `js/ai-chat.js` | +217 −48 | Two-phase thinking block, reasoning cleanup, removed inline search duplication |
4863
| `js/ai-assistant.js` | +4 −3 | Fixed user message dedup check |
4964
| `css/ai-panel.css` | +194 −0 | Thinking block styles, spinner, no-results state |
65+
| `ai-worker.js` | ~+80 −30 | Correct sampling params, thinking content filter |
66+
| `ai-worker-common.js` | modified | Supporting worker changes |

js/ai-docgen-generate.js

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,10 @@
214214
function cleanGeneratedOutput(text) {
215215
if (!text) return text;
216216

217-
text = text.replace(/<thinking>[\s\S]*?<\/thinking>/gi, '');
217+
text = text.replace(/<(?:think|thinking|thought)>[\s\S]*?<\/(?:think|thinking|thought)>/gi, '');
218+
text = text.replace(/<(?:think|thinking|thought)>[\s\S]*$/gi, '');
219+
var closeMatch = text.match(/<\/(?:think|thinking|thought)>/i);
220+
if (closeMatch) { text = text.substring(text.indexOf(closeMatch[0]) + closeMatch[0].length); }
218221

219222
var thinkingPatterns = [
220223
/^[\s\S]*?Thinking Process:[\s\S]*?(?=^#|\n#)/m,

0 commit comments

Comments
 (0)