fix: load_context resets chat state + KV cache saves 57 tokens (#83)

unamedkr · claude · unamedkr · commit ffe668f5a056 · 2026-04-13T13:35:57.000+09:00
Two fixes: 1. quant_load_context now resets cached_text/cached_tokens to prevent stale text-prefix matching in quant_chat after context restore. 2. KV cache pre-build uses quant_chat() for prefill (saves 57 tokens vs 1 token with quant_generate). Status: KV save/load works correctly (57 tokens round-trip verified). Speed: 4.5s cached lookup vs 15s regular (3.3x faster). Remaining: loaded context + new question produces inaccurate answers. Root cause: quant_chat's slow path re-prefills the entire new prompt, overwriting the loaded KV cache. Needs a new API (quant_continue_from_cache) that appends tokens starting at position n_ctx_tokens instead of 0. Refs #83 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/quant.h b/quant.h
@@ -17362,6 +17362,17 @@ int quant_load_context(quant_ctx* ctx, const char* path) {
 
     /* Restore position */
     ctx->n_ctx_tokens = (int)nt;
+
+    /* Reset chat state so quant_chat() treats this as a fresh session
+     * with pre-filled KV cache. Without this, quant_chat's text-prefix
+     * matching sees stale cached_text and produces misaligned output.
+     * The next quant_chat() call will re-tokenize its prompt and prefill
+     * starting from position nt (where the loaded KV ends). */
+    if (ctx->cached_text) { free(ctx->cached_text); ctx->cached_text = NULL; }
+    if (ctx->cached_tokens) { free(ctx->cached_tokens); ctx->cached_tokens = NULL; }
+    ctx->n_cached = 0;
+    ctx->cached_capacity = 0;
+
     fclose(fp);
     fprintf(stderr, "quant_load_context: restored %u tokens (%u layers) from %s\n",
             nt, nl, path);