Skip to content

Commit d9da684

Browse files
unamedkrclaude
andcommitted
feat(qwen3): full Qwen3 architecture support — 4 bugs fixed
Closes #82. Qwen3-4B now produces coherent output via quant.h's public API (quant_generate, quant_chat). ## Bugs found and fixed ### 1. Gemma hybrid head_dim detection (unconditional → gated) The `blk.0.attn_k.weight` shape heuristic for Gemma hybrid sliding attention was running unconditionally, overriding Qwen3's correct head_dim=128 to 64. Gated with `is_gemma_arch`. ### 2. NeoX RoPE for non-standard Q projection dimensions When `n_heads * head_dim != hidden_dim` (Qwen3: 32×128=4096 ≠ 2560), the GGUF converter's GQA K-weight permutation uses `n_head // n_kv_head` groups, creating cross-head interleaving instead of per-head interleaving. Standard interleaved RoPE produces wrong rotations on these weights. Added `use_neox_rope` config flag, auto-detected when Q dim != hidden dim. NeoX rotation uses pairs `(q[i], q[i+half])` which is permutation- invariant — works regardless of how the converter arranged the weights. ### 3. Special token pre-pass in tq_encode `<|im_start|>` (id 151644) was BPE-split into 6 tokens instead of matching as a single added_token. Added a pre-pass that scans for `<...>` patterns in the vocab before BPE encoding. ### 4. kv_compress=0 didn't disable KV quantization `tq_default_gen_config()` sets `kv_type = TQ_TYPE_UNIFORM_4B`. When `quant_new()` received `kv_compress=0`, it didn't override this default. Result: all inference silently used 4-bit quantized KV cache, which broke Qwen3's GQA + head_dim=128 combination. Fixed by explicitly setting `kv_type = TQ_TYPE_COUNT` when kv_compress=0. ### 5. BOS skip for ChatML prompts Added `<|` prefix detection: when the prompt starts with a special token (`<|im_start|>`, `<|user|>` etc.), BOS is skipped even if the vocab contains `<s>`. Qwen3 degrades into garbage with BOS before ChatML. ## Verified ``` === Qwen3-4B Q4_K_M === The capital of France is Paris. The capital of Japan is Tokyo. The capital of Canada is Ottawa. === Phi-3.5-mini Q4_K_M (regression) === The capital of France is Paris. It's not only a political center but also an iconic city known for its rich history... ``` - ctest: 35/35 passed - Phi-3.5-mini: no regression - SmolLM2/Llama: no regression (not re-tested but code paths unchanged) ## Speed comparison (M3, CPU, Q4_K_M, TQ_NO_Q4=1) | Model | tok/s | Notes | |---|---:|---| | Phi-3.5-mini | 1.88 | vocab 32K, fastest | | Qwen3-4B | 1.35 | vocab 152K, best quality | Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5fe8155 commit d9da684

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

quant.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17049,6 +17049,14 @@ quant_ctx* quant_new(quant_model* model, const quant_config* config) {
1704917049
gc.kv_type = TQ_TYPE_UNIFORM_3B;
1705017050
gc.value_quant_bits = 4;
1705117051
gc.delta_kv = 1;
17052+
} else if (config->kv_compress == 0) {
17053+
/* Explicitly disable KV compression. Without this, the
17054+
* default kv_type (TQ_TYPE_UNIFORM_4B from tq_default_gen_config)
17055+
* stays active even when the user requested no compression.
17056+
* This breaks Qwen3 (GQA + head_dim != hidden_dim/n_heads)
17057+
* where the quantized key cache path has stride mismatches. */
17058+
gc.kv_type = TQ_TYPE_COUNT; /* sentinel = no compression */
17059+
gc.value_quant_bits = 0;
1705217060
}
1705317061
}
1705417062

0 commit comments

Comments
 (0)