Skip to content

v0.13.0 — Phi-3 Support + Unified Server

Latest

Choose a tag to compare

@unamedkr unamedkr released this 12 Apr 10:54
· 97 commits to main since this release
b60ce4e

Highlights

Phi-3 / Phi-3.5 fully supported — the model with the best speed/quality combo is now the default.

pip install quantcpp
quantcpp                  # downloads Phi-3.5-mini Q8_0, starts chat

What's new

  • Phi-3 architecture support — fused QKV, fused gate+up FFN, LongRoPE with NeoX rotation
  • Phi-3.5-mini Q8_0 as default — 3.0 tok/s on Apple M3 (2x faster than Q4_K_M)
  • quant-server-unified — server built directly on quant.h, no libturboquant sync issues
  • ChatML marker filter — 32-byte lookahead catches BPE-split stop tokens across boundaries
  • 16 chat-cache bugs eliminated — two audit passes found hidden bugs in KV prefix matching, text accumulation, server sessions, WASM state
  • ChatContextOverflow — Python Model.chat() raises a typed exception on context overflow
  • Architecture compatibility — llama, phi3, gemma, qwen fully or partially supported. Unsupported architectures now fail fast with a clear error.

Performance

Model Quant tok/s (M3)
Phi-3.5-mini Q8_0 3.0
SmolLM2-1.7B Q8_0 12.5
Llama-3.2-1B Q4_K_M 2.3

Fixed

  • Chat KV cache: sliding-window truncation, realloc failure handling, server session kv_type mismatch, WASM state leaks, rep_penalty in fast path
  • Server: HTTP 500 on generation errors (was silent 200), streaming finish_reason: "error"
  • BOS token for Phi-3/Llama (<s> added to lookup chain)
  • CLI overflow handling with automatic turn trimming

Install

pip install quantcpp==0.13.0

Or build from source:

cmake -B build -DTQ_BUILD_SERVER=ON
cmake --build build -j$(nproc)

See docs/RELEASE_NOTES.md for the full changelog.

What's Changed

  • feat: chat-mode KV cache reuse — O(N²) → O(new tokens) per turn by @unamedkr in #48
  • feat: chat KV cache hardening — multi-session + overflow safety + metrics by @unamedkr in #49
  • feat: text-prefix chat cache + json_find_key bugfix by @unamedkr in #50
  • feat(wasm): chat KV cache reuse — instant turn N+1 in browser by @unamedkr in #51
  • fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated by @unamedkr in #52
  • fix(chat-cache): second audit pass — 9 more hidden bugs eliminated by @unamedkr in #53
  • feat(feedback): Quick Wins from 2026-04-12 external user report by @unamedkr in #59
  • feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support by @unamedkr in #65
  • feat(default): promote Phi-3.5-mini to recommended default model by @unamedkr in #66
  • Port Phi-3 architecture support to libturboquant + Qwen3.5 issues by @unamedkr in #71
  • fix(python): model aliases in API + graceful port conflict handling by @unamedkr in #76
  • feat: quant-server-unified — server built directly on quant.h by @unamedkr in #79
  • fix: Phi-3 Q8_0 default + unified server in CLI/CMake by @unamedkr in #80
  • chore: release v0.13.0 — Phi-3 support + unified server by @unamedkr in #81

Full Changelog: v0.12.1...v0.13.0