Commit ffe668f
fix: load_context resets chat state + KV cache saves 57 tokens (#83)
Two fixes:
1. quant_load_context now resets cached_text/cached_tokens to prevent
stale text-prefix matching in quant_chat after context restore.
2. KV cache pre-build uses quant_chat() for prefill (saves 57 tokens
vs 1 token with quant_generate).
Status: KV save/load works correctly (57 tokens round-trip verified).
Speed: 4.5s cached lookup vs 15s regular (3.3x faster).
Remaining: loaded context + new question produces inaccurate answers.
Root cause: quant_chat's slow path re-prefills the entire new prompt,
overwriting the loaded KV cache. Needs a new API (quant_continue_from_cache)
that appends tokens starting at position n_ctx_tokens instead of 0.
Refs #83
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 296886b commit ffe668f
1 file changed
+11
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17362 | 17362 | | |
17363 | 17363 | | |
17364 | 17364 | | |
| 17365 | + | |
| 17366 | + | |
| 17367 | + | |
| 17368 | + | |
| 17369 | + | |
| 17370 | + | |
| 17371 | + | |
| 17372 | + | |
| 17373 | + | |
| 17374 | + | |
| 17375 | + | |
17365 | 17376 | | |
17366 | 17377 | | |
17367 | 17378 | | |
| |||
0 commit comments