Skip to content

Commit ba8a615

Browse files
unamedkrclaude
andcommitted
fix(qwen35): suppress <think> token — Qwen3.5-4B short prompts now work (#95)
Root cause: NOT a DeltaNet implementation bug. Qwen3.5 defaults to thinking mode (<think>...</think>), consuming all max_tokens budget on reasoning before the actual answer. "What is 2+2?" generated "<think>\n\n2+2=4\n\n</think>\n\n4" — the "4" was at token ~15, beyond max_tokens=8. Three fixes in tq_generate: 1. Suppress <think> logit to -1e30 before sampling (prevents entry) 2. Strip leading whitespace tokens (catches residual \n\n) 3. Skipped tokens don't count toward max_tokens budget Results: Before: "What is 2+2?" → "The answer to **" (FAIL) After: "What is 2+2?" → "4" (PASS) Document QA: still works (no regression) Closes #95 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 53b3323 commit ba8a615

File tree

1 file changed

+33
-1
lines changed

1 file changed

+33
-1
lines changed

quant.h

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16175,6 +16175,15 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
1617516175
}
1617616176
}
1617716177

16178+
/* Suppress <think> token to disable thinking/reasoning mode.
16179+
* Qwen3.5 models default to thinking mode which adds many tokens
16180+
* of internal reasoning before the actual answer. By suppressing
16181+
* the <think> special token, the model goes directly to answering. */
16182+
int think_token_id = tokenizer ? str_lookup(tokenizer, "<think>") : -1;
16183+
if (think_token_id >= 0 && think_token_id < vocab_size) {
16184+
state->logits[think_token_id] = -1e30f;
16185+
}
16186+
1617816187
/* Sample first generated token. The seed is configurable via
1617916188
* config->rng_seed (default 42); 0 falls back to 42 so existing
1618016189
* callers that never set rng_seed get bit-identical behaviour. */
@@ -16191,6 +16200,7 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
1619116200
int generated = 0;
1619216201
int output_pos = 0;
1619316202
int prev_token = prompt_tokens[n_prompt - 1];
16203+
int seen_nonwhitespace = 0; /* track whether we've emitted non-whitespace yet */
1619416204

1619516205
/* EOS token IDs — check common values across model families.
1619616206
* Qwen3.5: eos = 248044 (<|endoftext|>), 248046 (<|im_end|>)
@@ -16286,6 +16296,19 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
1628616296
strstr(piece, "<1st>") || strstr(piece, "<2nd>") || strstr(piece, "<3rd>")) {
1628716297
piece = "";
1628816298
}
16299+
/* Skip leading whitespace-only tokens (Qwen3.5 thinking mode
16300+
* produces <think>...</think> which gets filtered, but the
16301+
* surrounding newlines remain as plain text tokens).
16302+
* Only skip before any non-whitespace content has been emitted. */
16303+
if (!seen_nonwhitespace && piece[0] != '\0') {
16304+
const char* p = piece;
16305+
while (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t') p++;
16306+
if (*p == '\0') {
16307+
piece = ""; /* all whitespace — skip */
16308+
} else {
16309+
seen_nonwhitespace = 1;
16310+
}
16311+
}
1628916312
}
1629016313
if (should_stop) break;
1629116314

@@ -16307,7 +16330,11 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
1630716330
prev_token = next_token;
1630816331
tq_forward(model, state, next_token, pos);
1630916332
pos++;
16310-
generated++;
16333+
/* Only count tokens that produced visible output toward the limit.
16334+
* Leading whitespace from thinking mode should not consume the budget. */
16335+
if (seen_nonwhitespace) {
16336+
generated++;
16337+
}
1631116338

1631216339
/* Apply repetition penalty before sampling */
1631316340
if (rep_penalty > 1.0f) {
@@ -16325,6 +16352,11 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
1632516352
}
1632616353
}
1632716354

16355+
/* Suppress <think> token to prevent entering thinking mode */
16356+
if (think_token_id >= 0 && think_token_id < vocab_size) {
16357+
state->logits[think_token_id] = -1e30f;
16358+
}
16359+
1632816360
/* Sample next token */
1632916361
next_token = tq_sample_topp(state->logits, vocab_size,
1633016362
config->temperature, config->top_p,

0 commit comments

Comments
 (0)