Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s

## Description

Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at **~0.7 tok/s** on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.

## Benchmark

| Model | Params | Vocab | tok/s | Notes |
|-------|--------|------:|------:|-------|
| Phi-3.5-mini (Q8) | 3.8B | 32K | **~8** | Fast |
| SmolLM2-1.7B (Q8) | 1.7B | 49K | **~12.5** | Fastest |
| **Qwen3.5-4B (Q4)** | **4B** | **248K** | **~0.7** | 18x slower than Phi-3.5 |

## Root Cause

Server log shows all 24 DeltaNet layers being dequantized to FP32:
```
tq_load_gguf: layer 0 attn_qkv dequant to FP32 (was type 13)
tq_load_gguf: layer 1 attn_qkv dequant to FP32 (was type 13)
...
tq_load_gguf: layer 30 attn_qkv dequant to FP32 (was type 13)
```

**Two bottlenecks:**

1. **DeltaNet FP32 dequant** — 24 layers × full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits

2. **248K vocab output projection** — Every token requires a 2560 × 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 × 32K).

## Impact

At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.

## Suggested Optimizations

1. **Keep DeltaNet layers in quantized format** — use Q4/Q8 matmul directly instead of FP32 dequant
2. **Optimize vocab projection** — for large-vocab models, consider top-k logit computation or speculative sampling
3. **DeltaNet-specific kernel** — linear attention doesn't need full KV cache, leverage this for speed

## Environment

- Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
- Hardware: Apple M3, 8-core, 16GB
- Build: quant.h single-header

---
*Reported by ClawTeam Claw-4 (Optimizer)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70

Description

Benchmark

Root Cause

Impact

Suggested Optimizations

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Params	Vocab	tok/s	Notes
Phi-3.5-mini (Q8)	3.8B	32K	~8	Fast
SmolLM2-1.7B (Q8)	1.7B	49K	~12.5	Fastest
Qwen3.5-4B (Q4)	4B	248K	~0.7	18x slower than Phi-3.5

Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70

Description

Description

Benchmark

Root Cause

Impact

Suggested Optimizations

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions