Description
Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at ~0.7 tok/s on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.
Benchmark
| Model |
Params |
Vocab |
tok/s |
Notes |
| Phi-3.5-mini (Q8) |
3.8B |
32K |
~8 |
Fast |
| SmolLM2-1.7B (Q8) |
1.7B |
49K |
~12.5 |
Fastest |
| Qwen3.5-4B (Q4) |
4B |
248K |
~0.7 |
18x slower than Phi-3.5 |
Root Cause
Server log shows all 24 DeltaNet layers being dequantized to FP32:
tq_load_gguf: layer 0 attn_qkv dequant to FP32 (was type 13)
tq_load_gguf: layer 1 attn_qkv dequant to FP32 (was type 13)
...
tq_load_gguf: layer 30 attn_qkv dequant to FP32 (was type 13)
Two bottlenecks:
-
DeltaNet FP32 dequant — 24 layers × full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits
-
248K vocab output projection — Every token requires a 2560 × 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 × 32K).
Impact
At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.
Suggested Optimizations
- Keep DeltaNet layers in quantized format — use Q4/Q8 matmul directly instead of FP32 dequant
- Optimize vocab projection — for large-vocab models, consider top-k logit computation or speculative sampling
- DeltaNet-specific kernel — linear attention doesn't need full KV cache, leverage this for speed
Environment
- Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
- Hardware: Apple M3, 8-core, 16GB
- Build: quant.h single-header
Reported by ClawTeam Claw-4 (Optimizer)
Description
Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at ~0.7 tok/s on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.
Benchmark
Root Cause
Server log shows all 24 DeltaNet layers being dequantized to FP32:
Two bottlenecks:
DeltaNet FP32 dequant — 24 layers × full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits
248K vocab output projection — Every token requires a 2560 × 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 × 32K).
Impact
At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.
Suggested Optimizations
Environment
Reported by ClawTeam Claw-4 (Optimizer)