feat(qwen3): full Qwen3 architecture support — 4 bugs fixed

unamedkr · claude · unamedkr · commit d9da6840949c · 2026-04-13T12:59:57.000+09:00
Closes #82. Qwen3-4B now produces coherent output via quant.h's public API (quant_generate, quant_chat). ## Bugs found and fixed ### 1. Gemma hybrid head_dim detection (unconditional → gated) The `blk.0.attn_k.weight` shape heuristic for Gemma hybrid sliding attention was running unconditionally, overriding Qwen3's correct head_dim=128 to 64. Gated with `is_gemma_arch`. ### 2. NeoX RoPE for non-standard Q projection dimensions When `n_heads * head_dim != hidden_dim` (Qwen3: 32×128=4096 ≠ 2560), the GGUF converter's GQA K-weight permutation uses `n_head // n_kv_head` groups, creating cross-head interleaving instead of per-head interleaving. Standard interleaved RoPE produces wrong rotations on these weights. Added `use_neox_rope` config flag, auto-detected when Q dim != hidden dim. NeoX rotation uses pairs `(q[i], q[i+half])` which is permutation- invariant — works regardless of how the converter arranged the weights. ### 3. Special token pre-pass in tq_encode `<|im_start|>` (id 151644) was BPE-split into 6 tokens instead of matching as a single added_token. Added a pre-pass that scans for `<...>` patterns in the vocab before BPE encoding. ### 4. kv_compress=0 didn't disable KV quantization `tq_default_gen_config()` sets `kv_type = TQ_TYPE_UNIFORM_4B`. When `quant_new()` received `kv_compress=0`, it didn't override this default. Result: all inference silently used 4-bit quantized KV cache, which broke Qwen3's GQA + head_dim=128 combination. Fixed by explicitly setting `kv_type = TQ_TYPE_COUNT` when kv_compress=0. ### 5. BOS skip for ChatML prompts Added `<|` prefix detection: when the prompt starts with a special token (`<|im_start|>`, `<|user|>` etc.), BOS is skipped even if the vocab contains `<s>`. Qwen3 degrades into garbage with BOS before ChatML. ## Verified ``` === Qwen3-4B Q4_K_M === The capital of France is Paris. The capital of Japan is Tokyo. The capital of Canada is Ottawa. === Phi-3.5-mini Q4_K_M (regression) === The capital of France is Paris. It's not only a political center but also an iconic city known for its rich history... ``` - ctest: 35/35 passed - Phi-3.5-mini: no regression - SmolLM2/Llama: no regression (not re-tested but code paths unchanged) ## Speed comparison (M3, CPU, Q4_K_M, TQ_NO_Q4=1) | Model | tok/s | Notes | |---|---:|---| | Phi-3.5-mini | 1.88 | vocab 32K, fastest | | Qwen3-4B | 1.35 | vocab 152K, best quality | Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/quant.h b/quant.h
@@ -17049,6 +17049,14 @@ quant_ctx* quant_new(quant_model* model, const quant_config* config) {
             gc.kv_type = TQ_TYPE_UNIFORM_3B;
             gc.value_quant_bits = 4;
             gc.delta_kv = 1;
+        } else if (config->kv_compress == 0) {
+            /* Explicitly disable KV compression. Without this, the
+             * default kv_type (TQ_TYPE_UNIFORM_4B from tq_default_gen_config)
+             * stays active even when the user requested no compression.
+             * This breaks Qwen3 (GQA + head_dim != hidden_dim/n_heads)
+             * where the quantized key cache path has stride mismatches. */
+            gc.kv_type = TQ_TYPE_COUNT;  /* sentinel = no compression */
+            gc.value_quant_bits = 0;
         }
     }
 

Original file line number	Diff line number	Diff line change
`@@ -17049,6 +17049,14 @@ quant_ctx* quant_new(quant_model* model, const quant_config* config) {`
`17049`	`17049`	`gc.kv_type = TQ_TYPE_UNIFORM_3B;`
`17050`	`17050`	`gc.value_quant_bits = 4;`
`17051`	`17051`	`gc.delta_kv = 1;`
	`17052`	`+ } else if (config->kv_compress == 0) {`
	`17053`	`+ /* Explicitly disable KV compression. Without this, the`
	`17054`	`+ * default kv_type (TQ_TYPE_UNIFORM_4B from tq_default_gen_config)`
	`17055`	`+ * stays active even when the user requested no compression.`
	`17056`	`+ * This breaks Qwen3 (GQA + head_dim != hidden_dim/n_heads)`
	`17057`	`+ * where the quantized key cache path has stride mismatches. */`
	`17058`	`+ gc.kv_type = TQ_TYPE_COUNT; /* sentinel = no compression */`
	`17059`	`+ gc.value_quant_bits = 0;`
`17052`	`17060`	`}`
`17053`	`17061`	`}`
`17054`	`17062`