Skip to content

Commit a40f32d

Browse files
unamedkrclaude
andcommitted
docs: v0.12 — Document-Level RAG verified, 7/7 vs 0/7
Update README, Korean README, guide page, CHANGELOG with verified Document-Level RAG benchmark results: - Hero stat updated: "7/7 vs 0/7 Doc-QA vs chunk-RAG" - New section: hallucination examples (John Smith, $1M, 15%) - Multi-hop reasoning verified (Kyoto, currency risk) - KV compression preserves QA accuracy (FP32 = 6.4x) Guide page: new "Verified Result" section with measured chart, hallucination examples, and 3 cards explaining the implications. CHANGELOG: v0.12.0 entry with full feature summary including S2 (K/V asymmetric), S3 (H2O), S1 (PyramidKV), save/load KV, and Document-Level RAG verification. New: docs/promotion-strategy-v0.12.md - Three-tier audience strategy (r/LocalLLaMA, HN, Twitter) - Anticipated criticism + responses - Honest self-assessment of strengths/weaknesses - Success metrics + post-launch action plan - "What would make this a real paradigm shift" roadmap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ad4f495 commit a40f32d

File tree

5 files changed

+373
-2
lines changed

5 files changed

+373
-2
lines changed

CHANGELOG.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,77 @@
11
# Changelog
22

3+
## [0.12.0] — 2026-04-11 — Document-Level RAG Verified
4+
5+
### Headline: 7/7 vs 0/7
6+
7+
**Direct comparison on Llama 3.2 3B Q8_0:**
8+
9+
| Method | Accuracy | Behavior on failure |
10+
|---|---:|---|
11+
| Chunk-RAG (wrong section) | **0/7** | Hallucinates plausible lies |
12+
| Full Document FP32 KV | **7/7** | Correct |
13+
| **Full Document 6.4x compressed KV** | **7/7** | **Correct — zero quality loss** |
14+
15+
When chunk-RAG retrieved the wrong section, the model fabricated answers like "John Smith" for CTO (truth: Maria Santos) and "$1,000,000" for revenue (truth: 847M). Loading the full document via 6.4x KV compression produced 100% accuracy including multi-hop reasoning across sections.
16+
17+
**Why this matters:** RAG's fundamental assumption is "retrieval is reliable." When it fails, models silently hallucinate. KV compression eliminates this failure mode by making it practical to load full documents into context on consumer hardware.
18+
19+
Full benchmark: [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md)
20+
21+
### Major: K/V Asymmetric Compression — 6.4x at +3% PPL
22+
23+
KIVI-style asymmetric quantization: K=4bit + V=Q4 + k128 progressive window.
24+
- **2.9x → 6.4x compression** (+121%)
25+
- PPL cost: +1.3% → +3.0% (+1.7pp)
26+
- Verified at both 2082 and 4095 tokens
27+
- 128K context Llama 3B fits in 9.5 GB (vs ~30 GB FP32)
28+
29+
### Major: H2O Token Eviction + PyramidKV
30+
31+
- H2O eviction: heavy-hitter detection with sink + recent window preservation
32+
- PyramidKV: per-layer KV budget allocation based on attention entropy
33+
- **Attention cost: 4.1ms → 1.7ms/tok at budget=128 (-59%)**
34+
- Llama 1B layer entropy measured: Layer 1 = 6.29 bits, Layer 11 = 1.84 bits
35+
- Output quality preserved (identical text vs no eviction)
36+
37+
### Major: --save-kv / --load-kv CLI
38+
39+
"Read Once, Query Forever" pattern:
40+
```bash
41+
./quant model.gguf -p "long doc..." --save-kv doc.kv # process once
42+
./quant model.gguf -p "question?" --load-kv doc.kv # query instantly
43+
```
44+
45+
Per-layer strided save/load. Verified: 3B model recalls "PHOENIX" from saved context.
46+
47+
### Refactoring (R1, R3)
48+
49+
- DISPATCH_MATMUL macros: 4 dispatch chains consolidated
50+
- Magic numbers replaced with TQ_MAX_HEAD_DIM / TQ_MAX_KV_DIM constants
51+
- Zero warnings, 35/35 tests pass
52+
53+
### Bug Fixes
54+
55+
- Qwen3.5 text collapse at ~530 tokens — root cause: T=0 greedy entering repetition loop, KV quant error compounds. Added n-gram loop detection (4-gram × 3 repeats → stop).
56+
- Qwen3.5 head_dim=256 multi-block dequant for KV cache
57+
- Gemma 4 false attn_output_gate detection on ISWA hybrid attention
58+
59+
### Documentation
60+
61+
- New gh-pages educational guide site (Korean + English with toggle)
62+
- "Beyond RAG: Document-Level Context" section in README + guide
63+
- Document-Level RAG benchmark report
64+
- Open Graph social preview image
65+
66+
### Honest Limitations
67+
68+
- Q4 weight artifacts in fact extraction: "Santos" → "SanSannt", semantically correct but visually noisy
69+
- 1B model instruction-following limited; 3B+ recommended for QA
70+
- 7B+ models constrained by 16GB Mac memory (model + KV pressure)
71+
- Q4_K_M GGUF on-the-fly dequant has bug with TQ_NO_Q4 (workaround: default auto Q4 path works)
72+
73+
---
74+
375
## [0.10.1] — 2026-04-10
476

577
### Progressive KV compression — FP32 quality at 3x compression

README.ko.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,27 @@ API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →
3939

4040
---
4141

42+
## 핵심 발견: Document-QA 7/7 vs Chunk-RAG 0/7
43+
44+
Llama 3.2 3B Q8_0로 측정한 직접 비교 결과:
45+
46+
| 방법 | 정확도 | 할루시네이션 |
47+
|---|---:|---|
48+
| Chunk-RAG (잘못된 청크 검색) | **0/7** | 7개 모두 |
49+
| 전체 문서 (FP32 KV) | **7/7** | 없음 |
50+
| **전체 문서 (6.4x 압축 KV)** | **7/7** | **없음 — 품질 손실 0** |
51+
52+
Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지 않고 그럴듯한 거짓을 생성**합니다:
53+
- "CTO는 누구인가?" → **"John Smith"** (실제: Maria Santos)
54+
- "매출은?" → **"$1,000,000"** (실제: 8억 4700만)
55+
- "R&D 비율은?" → **"순이익의 15%"** (실제: 매출의 14%)
56+
57+
6.4x KV 압축으로 전체 문서를 한 번에 로드하면, 모델은 **multi-hop 추론**까지 정확히 답합니다 (예: "성장 지역에 영향을 주는 위험은?" → 환율 변동, Section 3 + Section 5 정보 연결 필요).
58+
59+
**핵심**: KV 압축은 단순한 메모리 절감이 아니라 **근본적으로 다른 RAG 접근**을 가능하게 합니다. RAG는 "어떤 문서를 볼지" 결정하고, long-context는 "그 문서를 얼마나 깊이 이해할지" 결정합니다. 전체 결과: [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md)
60+
61+
---
62+
4263
## 왜 quant.cpp인가?
4364

4465
AI 모델이 대화를 기억하려면 **KV 캐시**라는 메모리가 필요합니다. 대화가 길어질수록 이 메모리가 빠르게 커져서, 모델 자체보다 더 많은 메모리를 차지합니다.

README.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212

1313
<table align="center">
1414
<tr>
15-
<td align="center"><b>3x less memory</b><br>same quality</td>
16-
<td align="center"><b>13% faster</b><br>than uncompressed</td>
15+
<td align="center"><b>6.4x compression</b><br>+3% PPL</td>
16+
<td align="center"><b>7/7 vs 0/7</b><br>Doc-QA vs chunk-RAG</td>
1717
<td align="center"><b>128K context</b><br>on 16GB Mac</td>
1818
<td align="center"><b>16K LOC</b><br>zero deps</td>
1919
</tr>
@@ -101,6 +101,27 @@ m = Model("llama-3b.gguf", aggressive=True, context_length=131072) # 128K in 9.
101101

102102
---
103103

104+
## Document QA: 7/7 vs Chunk-RAG 0/7 — Measured
105+
106+
A direct comparison of three approaches to document question-answering with **Llama 3.2 3B Q8_0**:
107+
108+
| Method | Accuracy | Hallucinations |
109+
|---|---:|---|
110+
| Chunk-RAG (wrong chunk retrieved) | **0/7** | All 7 questions |
111+
| Full Document (FP32 KV) | **7/7** | None |
112+
| **Full Document (6.4x compressed KV)** | **7/7** | **None — zero quality loss** |
113+
114+
When chunk-RAG retrieves the wrong section, the model **doesn't say "I don't know"** — it generates plausible-sounding lies:
115+
- "Who is the CTO?" → **"John Smith"** (truth: Maria Santos)
116+
- "Revenue?" → **"$1,000,000"** (truth: 847 million)
117+
- "R&D %?" → **"15% of net income"** (truth: 14% of revenue)
118+
119+
With the full document loaded via 6.4x KV compression, the model correctly answers all 7 questions including **multi-hop reasoning** that requires connecting information across sections (e.g., "What risk affects the growth region?" → currency fluctuations, requiring linking Section 3 + Section 5).
120+
121+
**The takeaway:** KV compression isn't just memory savings — it enables a **fundamentally different RAG approach**. RAG decides *which documents* to look at; long-context decides *how deeply* to understand them. See [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md) for the full benchmark.
122+
123+
---
124+
104125
## More Features
105126

106127
**Bring your own model** — any GGUF file works:

docs/promotion-strategy-v0.12.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Promotion Strategy: v0.12 Document-Level RAG Breakthrough
2+
3+
## Date: 2026-04-11
4+
5+
## The Story (One Sentence)
6+
7+
> **"Chunk-RAG hallucinated 7/7 questions. Loading the full document with 6.4x KV compression got 7/7 correct — on a 16GB Mac."**
8+
9+
## Why This Story Resonates
10+
11+
1. **Concrete numbers**: 7/7 vs 0/7 is impossible to misread
12+
2. **Real fear**: Hallucination is the #1 production RAG concern
13+
3. **Counter-intuitive**: KV compression wasn't expected to enable this
14+
4. **Reproducible**: Single-file benchmark, anyone can run it
15+
5. **Actionable**: `pip install quantcpp` works today
16+
17+
## Three-Tier Audience Strategy
18+
19+
### Tier 1: r/LocalLLaMA (highest priority)
20+
21+
**Why first**: Our existing community, tech-savvy, RAG fatigue is high.
22+
23+
**Title options** (A/B test mentally):
24+
- **A** (concrete): "We measured chunk-RAG vs full-document on a 3B model — 0/7 vs 7/7"
25+
- **B** (provocative): "Your RAG hallucinates when retrieval fails. Here's the data."
26+
- **C** (technical): "6.4x KV compression makes 'Document-Level RAG' practical on 16GB Macs"
27+
28+
**Recommend A** — concrete data wins on r/LocalLLaMA.
29+
30+
**Post structure**:
31+
1. Hook: 7/7 vs 0/7 table
32+
2. The hallucination examples (John Smith, $1M, 15%)
33+
3. Methodology (Llama 3.2 3B, 7 questions, 3 methods)
34+
4. Why it matters: chunking is the bug, not the model
35+
5. CTA: `pip install quantcpp`, GitHub link, benchmark file
36+
6. Honest disclaimer: "single synthetic doc, needs scale validation"
37+
38+
**Timing**: Tuesday or Wednesday, 9 AM ET (peak r/LocalLLaMA traffic)
39+
40+
**Avoid**:
41+
- "Patent pending", "revolutionary", "patent us"
42+
- Comparing to llama.cpp (we already covered this)
43+
- Hiding limitations (community will dig them out anyway)
44+
45+
### Tier 2: HackerNews
46+
47+
**Why second**: Broader tech audience, RAG/AI is hot topic.
48+
49+
**Title** (HN style — concrete + intriguing):
50+
- "Show HN: We compared chunk-RAG vs full-document QA — 0/7 vs 7/7"
51+
52+
**Post structure**:
53+
1. Lead with the benchmark table
54+
2. Brief on quant.cpp (16K LOC C, single header, KV compression)
55+
3. The Document-Level RAG concept
56+
4. Why this matters for production RAG
57+
5. Repo link
58+
59+
**Avoid**:
60+
- Marketing speak
61+
- Vague claims
62+
- Anything that sounds like a startup pitch
63+
64+
### Tier 3: Twitter/X
65+
66+
**Why third**: Amplification + screenshots.
67+
68+
**Thread structure** (5-7 tweets):
69+
70+
```
71+
1/ Chunk-RAG: 0/7 ❌ (all hallucinated)
72+
Full Document: 7/7 ✅
73+
Same model. Same questions. Just different context approach.
74+
75+
We measured this on Llama 3.2 3B Q8_0:
76+
[screenshot of benchmark table]
77+
78+
2/ When chunk-RAG retrieves the wrong section, the model doesn't say "I don't know."
79+
It generates plausible lies:
80+
• "CTO?" → "John Smith" (actually: Maria Santos)
81+
• "Revenue?" → "$1M" (actually: $847M)
82+
• "R&D?" → "15% of net income" (actually: 14% of revenue)
83+
84+
3/ The fix isn't a smarter retriever. It's loading the full document.
85+
But that needs context windows that don't fit on consumer hardware.
86+
Until now.
87+
88+
4/ quant.cpp's 6.4x KV compression makes 128K context fit in 9.5 GB on a 16GB Mac.
89+
With Llama 3.2 3B, the full document fits → 7/7 accuracy → zero hallucinations.
90+
91+
FP32 KV: 7/7
92+
6.4x compressed KV: 7/7 (zero quality loss)
93+
94+
5/ This isn't "RAG is dead." It's "chunking RAG is dangerous."
95+
RAG decides which documents to look at.
96+
Long-context decides how deeply to understand them.
97+
They complement each other.
98+
99+
6/ Open source, MIT-style: pip install quantcpp
100+
Single C header, 16K LOC, runs anywhere.
101+
102+
Benchmark: github.com/quantumaikr/quant.cpp/blob/main/bench/results/document_level_rag_breakthrough.md
103+
104+
/end
105+
```
106+
107+
### Bonus: LinkedIn (selective)
108+
109+
**Audience**: Enterprise AI leads, ML engineers at companies with internal RAG.
110+
111+
**Tone**: Professional, focus on production risk.
112+
113+
**Key message**: "Your production RAG might be hallucinating without you knowing. Here's a measurable benchmark to find out."
114+
115+
## Defensive Preparation
116+
117+
### Anticipated criticism + responses
118+
119+
**Q: "5 sections, 7 questions, single model — that's not a benchmark."**
120+
A: "Correct. This is a proof-of-concept, not a paper. Reproduce it in 5 minutes; we'd love to see results on LongBench/NIAH next."
121+
122+
**Q: "Of course full context beats wrong-chunk retrieval. Your retriever sucks."**
123+
A: "Actually that's the point. Real production RAG fails silently when retrieval misses — and we showed exactly what 'silent failure' looks like (hallucination, not 'I don't know')."
124+
125+
**Q: "Why not just use Gemini 1.5 Pro / Claude 3 with native 1M context?"**
126+
A: "Those run in cloud at $X/M tokens. quant.cpp runs locally for free on your laptop. Different use case (privacy, offline, cost)."
127+
128+
**Q: "Your model output has 'SanSannt' instead of 'Santos'. That's broken."**
129+
A: "Q4 weight quantization artifact — semantically correct but visually noisy. For exact-string output use Q8 weights. Documented honestly in the report."
130+
131+
**Q: "Chunking has been a known problem for years. What's new?"**
132+
A: "Two things: (1) We measured the failure mode quantitatively (silent hallucination). (2) We made the alternative practical on consumer hardware via KV compression."
133+
134+
## Honest Self-Assessment
135+
136+
**Strengths to lean into:**
137+
- Concrete numbers, not vague claims
138+
- Open source benchmark, instantly reproducible
139+
- 11 prior self-corrections (track record of honesty)
140+
- Real measurement on real hardware
141+
142+
**Weaknesses to acknowledge upfront:**
143+
- Synthetic document (not real-world corpus)
144+
- Single model size (3B)
145+
- Single language (English)
146+
- Q4 weight artifacts in output
147+
148+
**Don't lean into:**
149+
- "Paradigm shift" language (premature, see paradigm-shift discussion)
150+
- "RAG is dead" claims
151+
- Comparing to specific commercial RAG products
152+
- Anything that sounds like a startup pitch
153+
154+
## Success Metrics
155+
156+
**Tier 1 (r/LocalLLaMA)**:
157+
- 200+ upvotes = good
158+
- 500+ upvotes = great
159+
- 1000+ upvotes = breakthrough
160+
- Comments to track: meaningful technical discussion, reproductions, criticism
161+
162+
**Tier 2 (HN)**:
163+
- Front page = good
164+
- 100+ comments = great
165+
- Thread depth (technical replies) > vote count
166+
167+
**Tier 3 (Twitter/X)**:
168+
- 100+ retweets on lead tweet = good
169+
- ML researcher engagement (Karpathy, Mikolov, etc.) = great
170+
171+
## Post-Launch Actions
172+
173+
**Day 1-3**: Active comment engagement, answer questions, fix typos found by community.
174+
175+
**Week 1**: Aggregate feedback into a follow-up "what we learned" post. Address top criticism transparently.
176+
177+
**Week 2-4**: Run the benchmark on:
178+
- LongBench subset (real questions)
179+
- 7B model (better instruction-following)
180+
- 2-3 different document types (code, legal, novel)
181+
182+
**Month 2**: Write a more rigorous benchmark report based on what survives scrutiny. This becomes the "v2 evidence" for any future paradigm-shift claims.
183+
184+
## What Would Make This a Real Paradigm Shift (Future Work)
185+
186+
To upgrade from "interesting result" to "paradigm shift":
187+
1. ✅ 0/7 vs 7/7 on synthetic data — done
188+
2. ⏳ Same result on LongBench / NIAH (1000+ questions)
189+
3. ⏳ Reproduced by independent team
190+
4. ⏳ Featured in HuggingFace blog or paper citation
191+
5. ⏳ Adopted by 1+ production system
192+
193+
We're at step 1. Steps 2-5 need months. The promotion now should reflect this honestly.

0 commit comments

Comments
 (0)