Summary
Pre-compute and persist KV caches for each document chunk during indexing, eliminating prefill overhead at query time. This is the single highest-impact speed optimization for RLV.
Current State
Each RLV question requires:
Locator (BM25): 0.01s ← fast
Lookup (LLM): 15-20s ← prefill 8s + generate 7s
Verifier: 0-8s
Total: ~35s/question
The 8s prefill is spent re-reading the same chunk text every time. For 20 questions on the same document, we prefill the same chunks ~20 times.
Proposed Solution
# One-time indexing (slow, ~5min for 1.3MB doc)
quantcpp index document.txt --output document.kv/
# Per-question (fast, ~5s)
quantcpp rlv --index document.kv/ "Who directed Mercury Fur?"
Implementation:
# During indexing:
for chunk in gist.chunks:
ctx = quant_new(model, config)
quant_generate(ctx, chunk.text, null_callback, null) # prefill only
quant_save_context(ctx, f"document.kv/chunk_{chunk.id}.kv")
# During query:
ctx = quant_new(model, config)
quant_load_context(ctx, f"document.kv/chunk_{best_id}.kv") # instant
quant_generate(ctx, question, on_token, data) # generate only (~5s)
Impact
| Metric |
Before |
After |
| Per-question latency |
35s |
~5s |
| 20-question benchmark |
12min |
~2min |
| First-question latency |
35s |
35s (indexing amortized) |
quant.cpp Advantage
save_context/load_context is unique to quant.cpp — no other inference engine provides this. Combined with KV compression (6.4x), each chunk's cache is only a few hundred KB on disk.
Priority: P0
This is the difference between "demo" and "usable product". 35s/question is a demo; 5s/question is a tool people actually use.
Proposed by ClawTeam based on RLV Day 5 benchmarking
Summary
Pre-compute and persist KV caches for each document chunk during indexing, eliminating prefill overhead at query time. This is the single highest-impact speed optimization for RLV.
Current State
Each RLV question requires:
The 8s prefill is spent re-reading the same chunk text every time. For 20 questions on the same document, we prefill the same chunks ~20 times.
Proposed Solution
Implementation:
Impact
quant.cpp Advantage
save_context/load_contextis unique to quant.cpp — no other inference engine provides this. Combined with KV compression (6.4x), each chunk's cache is only a few hundred KB on disk.Priority: P0
This is the difference between "demo" and "usable product". 35s/question is a demo; 5s/question is a tool people actually use.
Proposed by ClawTeam based on RLV Day 5 benchmarking