Release v0.13.0 — Phi-3 Support + Unified Server · quantumaikr/quant.cpp

Highlights

Phi-3 / Phi-3.5 fully supported — the model with the best speed/quality combo is now the default.

pip install quantcpp
quantcpp                  # downloads Phi-3.5-mini Q8_0, starts chat

Phi-3 architecture support — fused QKV, fused gate+up FFN, LongRoPE with NeoX rotation
Phi-3.5-mini Q8_0 as default — 3.0 tok/s on Apple M3 (2x faster than Q4_K_M)
quant-server-unified — server built directly on quant.h, no libturboquant sync issues
ChatML marker filter — 32-byte lookahead catches BPE-split stop tokens across boundaries
16 chat-cache bugs eliminated — two audit passes found hidden bugs in KV prefix matching, text accumulation, server sessions, WASM state
ChatContextOverflow — Python Model.chat() raises a typed exception on context overflow
Architecture compatibility — llama, phi3, gemma, qwen fully or partially supported. Unsupported architectures now fail fast with a clear error.

Chat KV cache: sliding-window truncation, realloc failure handling, server session kv_type mismatch, WASM state leaks, rep_penalty in fast path
Server: HTTP 500 on generation errors (was silent 200), streaming finish_reason: "error"
BOS token for Phi-3/Llama (<s> added to lookup chain)
CLI overflow handling with automatic turn trimming

pip install quantcpp==0.13.0

Or build from source:

cmake -B build -DTQ_BUILD_SERVER=ON
cmake --build build -j$(nproc)

See docs/RELEASE_NOTES.md for the full changelog.

feat: chat-mode KV cache reuse — O(N²) → O(new tokens) per turn by @unamedkr in #48
feat: chat KV cache hardening — multi-session + overflow safety + metrics by @unamedkr in #49
feat: text-prefix chat cache + json_find_key bugfix by @unamedkr in #50
feat(wasm): chat KV cache reuse — instant turn N+1 in browser by @unamedkr in #51
fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated by @unamedkr in #52
fix(chat-cache): second audit pass — 9 more hidden bugs eliminated by @unamedkr in #53
feat(feedback): Quick Wins from 2026-04-12 external user report by @unamedkr in #59
feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support by @unamedkr in #65
feat(default): promote Phi-3.5-mini to recommended default model by @unamedkr in #66
Port Phi-3 architecture support to libturboquant + Qwen3.5 issues by @unamedkr in #71
fix(python): model aliases in API + graceful port conflict handling by @unamedkr in #76
feat: quant-server-unified — server built directly on quant.h by @unamedkr in #79
fix: Phi-3 Q8_0 default + unified server in CLI/CMake by @unamedkr in #80
chore: release v0.13.0 — Phi-3 support + unified server by @unamedkr in #81

Full Changelog: v0.12.1...v0.13.0