Highlights
Phi-3 / Phi-3.5 fully supported — the model with the best speed/quality combo is now the default.
pip install quantcpp
quantcpp # downloads Phi-3.5-mini Q8_0, starts chatWhat's new
- Phi-3 architecture support — fused QKV, fused gate+up FFN, LongRoPE with NeoX rotation
- Phi-3.5-mini Q8_0 as default — 3.0 tok/s on Apple M3 (2x faster than Q4_K_M)
- quant-server-unified — server built directly on
quant.h, no libturboquant sync issues - ChatML marker filter — 32-byte lookahead catches BPE-split stop tokens across boundaries
- 16 chat-cache bugs eliminated — two audit passes found hidden bugs in KV prefix matching, text accumulation, server sessions, WASM state
- ChatContextOverflow — Python
Model.chat()raises a typed exception on context overflow - Architecture compatibility — llama, phi3, gemma, qwen fully or partially supported. Unsupported architectures now fail fast with a clear error.
Performance
| Model | Quant | tok/s (M3) |
|---|---|---|
| Phi-3.5-mini | Q8_0 | 3.0 |
| SmolLM2-1.7B | Q8_0 | 12.5 |
| Llama-3.2-1B | Q4_K_M | 2.3 |
Fixed
- Chat KV cache: sliding-window truncation, realloc failure handling, server session kv_type mismatch, WASM state leaks, rep_penalty in fast path
- Server: HTTP 500 on generation errors (was silent 200), streaming
finish_reason: "error" - BOS token for Phi-3/Llama (
<s>added to lookup chain) - CLI overflow handling with automatic turn trimming
Install
pip install quantcpp==0.13.0Or build from source:
cmake -B build -DTQ_BUILD_SERVER=ON
cmake --build build -j$(nproc)See docs/RELEASE_NOTES.md for the full changelog.
What's Changed
- feat: chat-mode KV cache reuse — O(N²) → O(new tokens) per turn by @unamedkr in #48
- feat: chat KV cache hardening — multi-session + overflow safety + metrics by @unamedkr in #49
- feat: text-prefix chat cache + json_find_key bugfix by @unamedkr in #50
- feat(wasm): chat KV cache reuse — instant turn N+1 in browser by @unamedkr in #51
- fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated by @unamedkr in #52
- fix(chat-cache): second audit pass — 9 more hidden bugs eliminated by @unamedkr in #53
- feat(feedback): Quick Wins from 2026-04-12 external user report by @unamedkr in #59
- feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support by @unamedkr in #65
- feat(default): promote Phi-3.5-mini to recommended default model by @unamedkr in #66
- Port Phi-3 architecture support to libturboquant + Qwen3.5 issues by @unamedkr in #71
- fix(python): model aliases in API + graceful port conflict handling by @unamedkr in #76
- feat: quant-server-unified — server built directly on quant.h by @unamedkr in #79
- fix: Phi-3 Q8_0 default + unified server in CLI/CMake by @unamedkr in #80
- chore: release v0.13.0 — Phi-3 support + unified server by @unamedkr in #81
Full Changelog: v0.12.1...v0.13.0