docs: v0.12 — Document-Level RAG verified, 7/7 vs 0/7

unamedkr · claude · unamedkr · commit a40f32df2b6b · 2026-04-11T08:19:39.000+09:00
Update README, Korean README, guide page, CHANGELOG with verified
Document-Level RAG benchmark results:

- Hero stat updated: "7/7 vs 0/7 Doc-QA vs chunk-RAG"
- New section: hallucination examples (John Smith, $1M, 15%)
- Multi-hop reasoning verified (Kyoto, currency risk)
- KV compression preserves QA accuracy (FP32 = 6.4x)

Guide page: new "Verified Result" section with measured chart,
hallucination examples, and 3 cards explaining the implications.

CHANGELOG: v0.12.0 entry with full feature summary including
S2 (K/V asymmetric), S3 (H2O), S1 (PyramidKV), save/load KV,
and Document-Level RAG verification.

New: docs/promotion-strategy-v0.12.md
- Three-tier audience strategy (r/LocalLLaMA, HN, Twitter)
- Anticipated criticism + responses
- Honest self-assessment of strengths/weaknesses
- Success metrics + post-launch action plan
- "What would make this a real paradigm shift" roadmap

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,77 @@
 # Changelog
 
+## [0.12.0] — 2026-04-11 — Document-Level RAG Verified
+
+### Headline: 7/7 vs 0/7
+
+**Direct comparison on Llama 3.2 3B Q8_0:**
+
+| Method | Accuracy | Behavior on failure |
+|---|---:|---|
+| Chunk-RAG (wrong section) | **0/7** | Hallucinates plausible lies |
+| Full Document FP32 KV | **7/7** | Correct |
+| **Full Document 6.4x compressed KV** | **7/7** | **Correct — zero quality loss** |
+
+When chunk-RAG retrieved the wrong section, the model fabricated answers like "John Smith" for CTO (truth: Maria Santos) and "$1,000,000" for revenue (truth: 847M). Loading the full document via 6.4x KV compression produced 100% accuracy including multi-hop reasoning across sections.
+
+**Why this matters:** RAG's fundamental assumption is "retrieval is reliable." When it fails, models silently hallucinate. KV compression eliminates this failure mode by making it practical to load full documents into context on consumer hardware.
+
+Full benchmark: [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md)
+
+### Major: K/V Asymmetric Compression — 6.4x at +3% PPL
+
+KIVI-style asymmetric quantization: K=4bit + V=Q4 + k128 progressive window.
+- **2.9x → 6.4x compression** (+121%)
+- PPL cost: +1.3% → +3.0% (+1.7pp)
+- Verified at both 2082 and 4095 tokens
+- 128K context Llama 3B fits in 9.5 GB (vs ~30 GB FP32)
+
+### Major: H2O Token Eviction + PyramidKV
+
+- H2O eviction: heavy-hitter detection with sink + recent window preservation
+- PyramidKV: per-layer KV budget allocation based on attention entropy
+- **Attention cost: 4.1ms → 1.7ms/tok at budget=128 (-59%)**
+- Llama 1B layer entropy measured: Layer 1 = 6.29 bits, Layer 11 = 1.84 bits
+- Output quality preserved (identical text vs no eviction)
+
+### Major: --save-kv / --load-kv CLI
+
+"Read Once, Query Forever" pattern:
+```bash
+./quant model.gguf -p "long doc..." --save-kv doc.kv  # process once
+./quant model.gguf -p "question?"  --load-kv doc.kv  # query instantly
+```
+
+Per-layer strided save/load. Verified: 3B model recalls "PHOENIX" from saved context.
+
+### Refactoring (R1, R3)
+
+- DISPATCH_MATMUL macros: 4 dispatch chains consolidated
+- Magic numbers replaced with TQ_MAX_HEAD_DIM / TQ_MAX_KV_DIM constants
+- Zero warnings, 35/35 tests pass
+
+### Bug Fixes
+
+- Qwen3.5 text collapse at ~530 tokens — root cause: T=0 greedy entering repetition loop, KV quant error compounds. Added n-gram loop detection (4-gram × 3 repeats → stop).
+- Qwen3.5 head_dim=256 multi-block dequant for KV cache
+- Gemma 4 false attn_output_gate detection on ISWA hybrid attention
+
+### Documentation
+
+- New gh-pages educational guide site (Korean + English with toggle)
+- "Beyond RAG: Document-Level Context" section in README + guide
+- Document-Level RAG benchmark report
+- Open Graph social preview image
+
+### Honest Limitations
+
+- Q4 weight artifacts in fact extraction: "Santos" → "SanSannt", semantically correct but visually noisy
+- 1B model instruction-following limited; 3B+ recommended for QA
+- 7B+ models constrained by 16GB Mac memory (model + KV pressure)
+- Q4_K_M GGUF on-the-fly dequant has bug with TQ_NO_Q4 (workaround: default auto Q4 path works)
+
+---
+
 ## [0.10.1] — 2026-04-10
 
 ### Progressive KV compression — FP32 quality at 3x compression
diff --git a/README.ko.md b/README.ko.md
@@ -39,6 +39,27 @@ API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →
 
 ---
 
+## 핵심 발견: Document-QA 7/7 vs Chunk-RAG 0/7
+
+Llama 3.2 3B Q8_0로 측정한 직접 비교 결과:
+
+| 방법 | 정확도 | 할루시네이션 |
+|---|---:|---|
+| Chunk-RAG (잘못된 청크 검색) | **0/7** | 7개 모두 |
+| 전체 문서 (FP32 KV) | **7/7** | 없음 |
+| **전체 문서 (6.4x 압축 KV)** | **7/7** | **없음 — 품질 손실 0** |
+
+Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지 않고 그럴듯한 거짓을 생성**합니다:
+- "CTO는 누구인가?" → **"John Smith"** (실제: Maria Santos)
+- "매출은?" → **"$1,000,000"** (실제: 8억 4700만)
+- "R&D 비율은?" → **"순이익의 15%"** (실제: 매출의 14%)
+
+6.4x KV 압축으로 전체 문서를 한 번에 로드하면, 모델은 **multi-hop 추론**까지 정확히 답합니다 (예: "성장 지역에 영향을 주는 위험은?" → 환율 변동, Section 3 + Section 5 정보 연결 필요).
+
+**핵심**: KV 압축은 단순한 메모리 절감이 아니라 **근본적으로 다른 RAG 접근**을 가능하게 합니다. RAG는 "어떤 문서를 볼지" 결정하고, long-context는 "그 문서를 얼마나 깊이 이해할지" 결정합니다. 전체 결과: [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md)
+
+---
+
 ## 왜 quant.cpp인가?
 
 AI 모델이 대화를 기억하려면 **KV 캐시**라는 메모리가 필요합니다. 대화가 길어질수록 이 메모리가 빠르게 커져서, 모델 자체보다 더 많은 메모리를 차지합니다.
diff --git a/README.md b/README.md
@@ -12,8 +12,8 @@
 
 <table align="center">
 <tr>
-<td align="center"><b>3x less memory</b><br>same quality</td>
-<td align="center"><b>13% faster</b><br>than uncompressed</td>
+<td align="center"><b>6.4x compression</b><br>+3% PPL</td>
+<td align="center"><b>7/7 vs 0/7</b><br>Doc-QA vs chunk-RAG</td>
 <td align="center"><b>128K context</b><br>on 16GB Mac</td>
 <td align="center"><b>16K LOC</b><br>zero deps</td>
 </tr>
@@ -101,6 +101,27 @@ m = Model("llama-3b.gguf", aggressive=True, context_length=131072)  # 128K in 9.
 
 ---
 
+## Document QA: 7/7 vs Chunk-RAG 0/7 — Measured
+
+A direct comparison of three approaches to document question-answering with **Llama 3.2 3B Q8_0**:
+
+| Method | Accuracy | Hallucinations |
+|---|---:|---|
+| Chunk-RAG (wrong chunk retrieved) | **0/7** | All 7 questions |
+| Full Document (FP32 KV) | **7/7** | None |
+| **Full Document (6.4x compressed KV)** | **7/7** | **None — zero quality loss** |
+
+When chunk-RAG retrieves the wrong section, the model **doesn't say "I don't know"** — it generates plausible-sounding lies:
+- "Who is the CTO?" → **"John Smith"** (truth: Maria Santos)
+- "Revenue?" → **"$1,000,000"** (truth: 847 million)
+- "R&D %?" → **"15% of net income"** (truth: 14% of revenue)
+
+With the full document loaded via 6.4x KV compression, the model correctly answers all 7 questions including **multi-hop reasoning** that requires connecting information across sections (e.g., "What risk affects the growth region?" → currency fluctuations, requiring linking Section 3 + Section 5).
+
+**The takeaway:** KV compression isn't just memory savings — it enables a **fundamentally different RAG approach**. RAG decides *which documents* to look at; long-context decides *how deeply* to understand them. See [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md) for the full benchmark.
+
+---
+
 ## More Features
 
 **Bring your own model** — any GGUF file works:
diff --git a/docs/promotion-strategy-v0.12.md b/docs/promotion-strategy-v0.12.md
@@ -0,0 +1,193 @@
+# Promotion Strategy: v0.12 Document-Level RAG Breakthrough
+
+## Date: 2026-04-11
+
+## The Story (One Sentence)
+
+> **"Chunk-RAG hallucinated 7/7 questions. Loading the full document with 6.4x KV compression got 7/7 correct — on a 16GB Mac."**
+
+## Why This Story Resonates
+
+1. **Concrete numbers**: 7/7 vs 0/7 is impossible to misread
+2. **Real fear**: Hallucination is the #1 production RAG concern
+3. **Counter-intuitive**: KV compression wasn't expected to enable this
+4. **Reproducible**: Single-file benchmark, anyone can run it
+5. **Actionable**: `pip install quantcpp` works today
+
+## Three-Tier Audience Strategy
+
+### Tier 1: r/LocalLLaMA (highest priority)
+
+**Why first**: Our existing community, tech-savvy, RAG fatigue is high.
+
+**Title options** (A/B test mentally):
+- **A** (concrete): "We measured chunk-RAG vs full-document on a 3B model — 0/7 vs 7/7"
+- **B** (provocative): "Your RAG hallucinates when retrieval fails. Here's the data."
+- **C** (technical): "6.4x KV compression makes 'Document-Level RAG' practical on 16GB Macs"
+
+**Recommend A** — concrete data wins on r/LocalLLaMA.
+
+**Post structure**:
+1. Hook: 7/7 vs 0/7 table
+2. The hallucination examples (John Smith, $1M, 15%)
+3. Methodology (Llama 3.2 3B, 7 questions, 3 methods)
+4. Why it matters: chunking is the bug, not the model
+5. CTA: `pip install quantcpp`, GitHub link, benchmark file
+6. Honest disclaimer: "single synthetic doc, needs scale validation"
+
+**Timing**: Tuesday or Wednesday, 9 AM ET (peak r/LocalLLaMA traffic)
+
+**Avoid**:
+- "Patent pending", "revolutionary", "patent us"
+- Comparing to llama.cpp (we already covered this)
+- Hiding limitations (community will dig them out anyway)
+
+### Tier 2: HackerNews
+
+**Why second**: Broader tech audience, RAG/AI is hot topic.
+
+**Title** (HN style — concrete + intriguing):
+- "Show HN: We compared chunk-RAG vs full-document QA — 0/7 vs 7/7"
+
+**Post structure**:
+1. Lead with the benchmark table
+2. Brief on quant.cpp (16K LOC C, single header, KV compression)
+3. The Document-Level RAG concept
+4. Why this matters for production RAG
+5. Repo link
+
+**Avoid**:
+- Marketing speak
+- Vague claims
+- Anything that sounds like a startup pitch
+
+### Tier 3: Twitter/X
+
+**Why third**: Amplification + screenshots.
+
+**Thread structure** (5-7 tweets):
+
+```
+1/ Chunk-RAG: 0/7 ❌ (all hallucinated)
+   Full Document: 7/7 ✅
+   Same model. Same questions. Just different context approach.
+   
+   We measured this on Llama 3.2 3B Q8_0:
+   [screenshot of benchmark table]
+
+2/ When chunk-RAG retrieves the wrong section, the model doesn't say "I don't know."
+   It generates plausible lies:
+   • "CTO?" → "John Smith" (actually: Maria Santos)  
+   • "Revenue?" → "$1M" (actually: $847M)
+   • "R&D?" → "15% of net income" (actually: 14% of revenue)
+
+3/ The fix isn't a smarter retriever. It's loading the full document.
+   But that needs context windows that don't fit on consumer hardware.
+   Until now.
+
+4/ quant.cpp's 6.4x KV compression makes 128K context fit in 9.5 GB on a 16GB Mac.
+   With Llama 3.2 3B, the full document fits → 7/7 accuracy → zero hallucinations.
+   
+   FP32 KV: 7/7
+   6.4x compressed KV: 7/7 (zero quality loss)
+
+5/ This isn't "RAG is dead." It's "chunking RAG is dangerous."
+   RAG decides which documents to look at.
+   Long-context decides how deeply to understand them.
+   They complement each other.
+
+6/ Open source, MIT-style: pip install quantcpp
+   Single C header, 16K LOC, runs anywhere.
+   
+   Benchmark: github.com/quantumaikr/quant.cpp/blob/main/bench/results/document_level_rag_breakthrough.md
+   
+   /end
+```
+
+### Bonus: LinkedIn (selective)
+
+**Audience**: Enterprise AI leads, ML engineers at companies with internal RAG.
+
+**Tone**: Professional, focus on production risk.
+
+**Key message**: "Your production RAG might be hallucinating without you knowing. Here's a measurable benchmark to find out."
+
+## Defensive Preparation
+
+### Anticipated criticism + responses
+
+**Q: "5 sections, 7 questions, single model — that's not a benchmark."**
+A: "Correct. This is a proof-of-concept, not a paper. Reproduce it in 5 minutes; we'd love to see results on LongBench/NIAH next."
+
+**Q: "Of course full context beats wrong-chunk retrieval. Your retriever sucks."**
+A: "Actually that's the point. Real production RAG fails silently when retrieval misses — and we showed exactly what 'silent failure' looks like (hallucination, not 'I don't know')."
+
+**Q: "Why not just use Gemini 1.5 Pro / Claude 3 with native 1M context?"**
+A: "Those run in cloud at $X/M tokens. quant.cpp runs locally for free on your laptop. Different use case (privacy, offline, cost)."
+
+**Q: "Your model output has 'SanSannt' instead of 'Santos'. That's broken."**
+A: "Q4 weight quantization artifact — semantically correct but visually noisy. For exact-string output use Q8 weights. Documented honestly in the report."
+
+**Q: "Chunking has been a known problem for years. What's new?"**
+A: "Two things: (1) We measured the failure mode quantitatively (silent hallucination). (2) We made the alternative practical on consumer hardware via KV compression."
+
+## Honest Self-Assessment
+
+**Strengths to lean into:**
+- Concrete numbers, not vague claims
+- Open source benchmark, instantly reproducible
+- 11 prior self-corrections (track record of honesty)
+- Real measurement on real hardware
+
+**Weaknesses to acknowledge upfront:**
+- Synthetic document (not real-world corpus)
+- Single model size (3B)
+- Single language (English)
+- Q4 weight artifacts in output
+
+**Don't lean into:**
+- "Paradigm shift" language (premature, see paradigm-shift discussion)
+- "RAG is dead" claims
+- Comparing to specific commercial RAG products
+- Anything that sounds like a startup pitch
+
+## Success Metrics
+
+**Tier 1 (r/LocalLLaMA)**:
+- 200+ upvotes = good
+- 500+ upvotes = great
+- 1000+ upvotes = breakthrough
+- Comments to track: meaningful technical discussion, reproductions, criticism
+
+**Tier 2 (HN)**:
+- Front page = good
+- 100+ comments = great
+- Thread depth (technical replies) > vote count
+
+**Tier 3 (Twitter/X)**:
+- 100+ retweets on lead tweet = good
+- ML researcher engagement (Karpathy, Mikolov, etc.) = great
+
+## Post-Launch Actions
+
+**Day 1-3**: Active comment engagement, answer questions, fix typos found by community.
+
+**Week 1**: Aggregate feedback into a follow-up "what we learned" post. Address top criticism transparently.
+
+**Week 2-4**: Run the benchmark on:
+- LongBench subset (real questions)
+- 7B model (better instruction-following)
+- 2-3 different document types (code, legal, novel)
+
+**Month 2**: Write a more rigorous benchmark report based on what survives scrutiny. This becomes the "v2 evidence" for any future paradigm-shift claims.
+
+## What Would Make This a Real Paradigm Shift (Future Work)
+
+To upgrade from "interesting result" to "paradigm shift":
+1. ✅ 0/7 vs 7/7 on synthetic data — done
+2. ⏳ Same result on LongBench / NIAH (1000+ questions)
+3. ⏳ Reproduced by independent team
+4. ⏳ Featured in HuggingFace blog or paper citation
+5. ⏳ Adopted by 1+ production system
+
+We're at step 1. Steps 2-5 need months. The promotion now should reflect this honestly.
diff --git a/site/index.html b/site/index.html