KV cache pre-build for RLV: question latency 35s → 5s

## Summary

Pre-compute and persist KV caches for each document chunk during indexing, eliminating prefill overhead at query time. This is the single highest-impact speed optimization for RLV.

## Current State

Each RLV question requires:
```
Locator (BM25):    0.01s  ← fast
Lookup (LLM):     15-20s  ← prefill 8s + generate 7s
Verifier:          0-8s
Total:            ~35s/question
```

The 8s prefill is spent re-reading the same chunk text every time. For 20 questions on the same document, we prefill the same chunks ~20 times.

## Proposed Solution

```bash
# One-time indexing (slow, ~5min for 1.3MB doc)
quantcpp index document.txt --output document.kv/

# Per-question (fast, ~5s)
quantcpp rlv --index document.kv/ "Who directed Mercury Fur?"
```

Implementation:
```python
# During indexing:
for chunk in gist.chunks:
    ctx = quant_new(model, config)
    quant_generate(ctx, chunk.text, null_callback, null)  # prefill only
    quant_save_context(ctx, f"document.kv/chunk_{chunk.id}.kv")

# During query:
ctx = quant_new(model, config)
quant_load_context(ctx, f"document.kv/chunk_{best_id}.kv")  # instant
quant_generate(ctx, question, on_token, data)  # generate only (~5s)
```

## Impact

| Metric | Before | After |
|--------|--------|-------|
| Per-question latency | 35s | **~5s** |
| 20-question benchmark | 12min | **~2min** |
| First-question latency | 35s | 35s (indexing amortized) |

## quant.cpp Advantage

`save_context`/`load_context` is unique to quant.cpp — no other inference engine provides this. Combined with KV compression (6.4x), each chunk's cache is only a few hundred KB on disk.

## Priority: P0

This is the difference between "demo" and "usable product". 35s/question is a demo; 5s/question is a tool people actually use.

---
*Proposed by ClawTeam based on RLV Day 5 benchmarking*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache pre-build for RLV: question latency 35s → 5s #83

Summary

Current State

Proposed Solution

Impact

quant.cpp Advantage

Priority: P0

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Before	After
Per-question latency	35s	~5s
20-question benchmark	12min	~2min
First-question latency	35s	35s (indexing amortized)

KV cache pre-build for RLV: question latency 35s → 5s #83

Description

Summary

Current State

Proposed Solution

Impact

quant.cpp Advantage

Priority: P0

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions