docs: Beyond RAG — manifesto, README slogan change, guide update

unamedkr · claude · unamedkr · commit 3107c68e9178 · 2026-04-11T08:59:02.000+09:00
Strong "Beyond RAG" framing across all surfaces:

README:
- New slogan: "Beyond RAG: load the whole document. On your laptop."
- Hero stat reordered to lead with "7/7 vs 0/7 — Beyond RAG measured"
- "Document QA" section rewritten as "Beyond RAG" with movement framing
- Honest disclaimer: v1, synthetic data, conversation starter

Guide page:
- Section title: "Beyond RAG" (was "Document-Level Context")
- New blockquote with manifesto opening
- Historical framing: workaround → dogma → moving on
- CTA button to manifesto

CHANGELOG:
- v0.12.0 banner with manifesto quote

New: docs/beyond-rag-manifesto.md
- Full essay: where we are, what we measured, why this happens
- "When Beyond RAG wins" use-case table
- "What Beyond RAG is NOT" honest list
- Track record: 11 self-corrections
- Sign-on: invite community contributions

Promotion strategy:
- Movement frame section added
- Reddit/HN/Twitter titles updated with "Beyond RAG"
- 8-tweet thread with manifesto opening
- Why "Beyond" (not "death") works as positioning

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,12 @@
 # Changelog
 
-## [0.12.0] — 2026-04-11 — Document-Level RAG Verified
+## [0.12.0] — 2026-04-11 — Beyond RAG
+
+> **Chunking RAG was a workaround for small context windows.**
+> **The workaround became dogma.**
+> **Now context windows are big enough that we don't need the workaround.**
+>
+> See: [docs/beyond-rag-manifesto.md](docs/beyond-rag-manifesto.md)
 
 ### Headline: 7/7 vs 0/7
 
diff --git a/README.ko.md b/README.ko.md
@@ -3,12 +3,12 @@
 </p>
 
 <h3 align="center">quant.cpp</h3>
-<p align="center"><b>AI를 내 앱에 넣는 가장 작은 방법</b></p>
+<p align="center"><b>Beyond RAG: 문서 전체를 노트북에 통째로 로드하세요.</b></p>
 
 <p align="center">
-  C 파일 하나(16K줄)로 AI 추론을 추가할 수 있습니다.<br>
-  설치할 것도, GPU도, 외부 의존성도 없습니다.<br>
-  메모리를 3배 덜 쓰면서 품질은 그대로 유지합니다.
+  Chunking은 작은 컨텍스트 윈도우를 위한 임시방편이었습니다. 우리는 그것을 불필요하게 만들었습니다.<br>
+  6.4× KV 압축으로 16GB Mac에서 전체 문서 이해가 가능합니다.<br>
+  C 파일 하나(16K줄), 외부 의존성 0.
 </p>
 
 <p align="center">
diff --git a/README.md b/README.md
@@ -3,17 +3,18 @@
 </p>
 
 <h3 align="center">quant.cpp</h3>
-<p align="center"><b>The smallest way to add AI to your app.</b></p>
+<p align="center"><b>Beyond RAG: load the whole document. On your laptop.</b></p>
 
 <p align="center">
-  One C file (16K lines). Zero dependencies. Runs everywhere.<br>
-  <code>pip install quantcpp</code> — or <code>#include "quant.h"</code> and compile.
+  Chunking was a workaround for small context windows. We just made it unnecessary.<br>
+  6.4× KV compression brings full-document understanding to consumer hardware.<br>
+  <code>pip install quantcpp</code> — 16K lines of C, zero dependencies.
 </p>
 
 <table align="center">
 <tr>
+<td align="center"><b>7/7 vs 0/7</b><br>Beyond RAG measured</td>
 <td align="center"><b>6.4x compression</b><br>+3% PPL</td>
-<td align="center"><b>7/7 vs 0/7</b><br>Doc-QA vs chunk-RAG</td>
 <td align="center"><b>128K context</b><br>on 16GB Mac</td>
 <td align="center"><b>16K LOC</b><br>zero deps</td>
 </tr>
@@ -101,24 +102,54 @@ m = Model("llama-3b.gguf", aggressive=True, context_length=131072)  # 128K in 9.
 
 ---
 
-## Document QA: 7/7 vs Chunk-RAG 0/7 — Measured
+## Beyond RAG: 7/7 vs 0/7 — Measured
 
-A direct comparison of three approaches to document question-answering with **Llama 3.2 3B Q8_0**:
+> **Chunking RAG was a workaround for small context windows. The workaround became dogma.**
+> **Now context windows are big enough that we don't need the workaround.**
 
-| Method | Accuracy | Hallucinations |
+A direct comparison on **Llama 3.2 3B Q8_0**, 5-section synthetic document, 7 questions (4 single-hop, 3 multi-hop):
+
+| Method | Accuracy | Behavior on failure |
 |---|---:|---|
-| Chunk-RAG (wrong chunk retrieved) | **0/7** | All 7 questions |
-| Full Document (FP32 KV) | **7/7** | None |
-| **Full Document (6.4x compressed KV)** | **7/7** | **None — zero quality loss** |
+| **Chunk-RAG** (wrong section retrieved) | **0/7** | **Hallucinated all answers** |
+| Full Document (FP32 KV) | **7/7** | Correct |
+| **Full Document (6.4× compressed KV)** | **7/7** | **Correct — zero quality loss** |
+
+### The hidden failure mode of chunk-RAG
 
 When chunk-RAG retrieves the wrong section, the model **doesn't say "I don't know"** — it generates plausible-sounding lies:
-- "Who is the CTO?" → **"John Smith"** (truth: Maria Santos)
-- "Revenue?" → **"$1,000,000"** (truth: 847 million)
-- "R&D %?" → **"15% of net income"** (truth: 14% of revenue)
 
-With the full document loaded via 6.4x KV compression, the model correctly answers all 7 questions including **multi-hop reasoning** that requires connecting information across sections (e.g., "What risk affects the growth region?" → currency fluctuations, requiring linking Section 3 + Section 5).
+| Question | Chunk-RAG (wrong section) | Truth |
+|---|---|---|
+| "Who is the CTO?" | "John Smith" ❌ | Maria Santos |
+| "What is the revenue?" | "$1,000,000" ❌ | 847 million |
+| "R&D %?" | "15% of net income" ❌ | 14% of revenue |
+| "Who proposed?" | "John Smith, EVP" ❌ | James Park |
+
+This is the production risk no one measures: **silent hallucination on retrieval failure**. Your monitoring shows 100% uptime. Your users get wrong answers.
+
+### Beyond RAG: load the whole document instead
+
+With **6.4× KV compression**, a full 5-section document fits in context on a 16GB Mac. The model answers all 7 questions correctly, including multi-hop reasoning that requires linking information across sections:
+
+> **"What risk affects the growth region?"** → currency fluctuations
+> *(requires linking Section 3 "Asia growth" with Section 5 "Asia currency risk")*
+
+Chunk-RAG cannot do this — each chunk is retrieved independently.
+
+### RAG isn't dead. RAG is one tool.
+
+This isn't "RAG is dead." RAG is still the only way to handle 100K+ document corpora. But:
+- **RAG decides *which documents* to look at** (search problem)
+- **Long-context decides *how deeply* to understand them** (reasoning problem)
+
+The bug was using the same tool for both. The fix is using each for what it's good at.
+
+**Reproduce in 5 minutes:** [bench/document_level_rag_test.sh](bench/document_level_rag_test.sh)
+**Full benchmark report:** [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md)
+**Manifesto:** [docs/beyond-rag-manifesto.md](docs/beyond-rag-manifesto.md)
 
-**The takeaway:** KV compression isn't just memory savings — it enables a **fundamentally different RAG approach**. RAG decides *which documents* to look at; long-context decides *how deeply* to understand them. See [bench/results/document_level_rag_breakthrough.md](bench/results/document_level_rag_breakthrough.md) for the full benchmark.
+> **Honest disclaimer:** v1 is a synthetic 5-section document with 7 questions on a single 3B model. We're not claiming this is LongBench. We *are* claiming it's enough to start a conversation about the failure mode chunk-RAG has been hiding. v2 with real benchmarks is in progress.
 
 ---
 
diff --git a/docs/beyond-rag-manifesto.md b/docs/beyond-rag-manifesto.md
@@ -0,0 +1,155 @@
+# Beyond RAG: A Manifesto
+
+## TL;DR
+
+> **Chunking RAG was a workaround for small context windows. The workaround became dogma. Now context windows are big enough that we don't need the workaround. Welcome to Beyond RAG.**
+
+## Where We Are
+
+In 2023, every "production AI" stack looked like this:
+
+```
+[document] → [chunker] → [embedder] → [vector DB]
+                                            ↓
+[user query] → [embedder] → [retriever] → [reranker] → [LLM] → [answer]
+```
+
+Six moving parts. Four of them exist solely because the LLM at the end couldn't fit your whole document in its context window.
+
+This was a reasonable engineering compromise. Llama 1 had 2K context. GPT-3.5 had 4K. You had to chunk.
+
+Then context windows grew. Llama 3.2 has 128K. Claude 3 has 200K. Gemini 1.5 has 2M. The compromise should have started disappearing.
+
+It didn't. The infrastructure became dogma. The vector DB companies became billion-dollar valuations. The "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.
+
+## What We Measured
+
+We tested chunk-RAG vs full-document context on a 5-section synthetic document with 7 questions, using Llama 3.2 3B Q8_0:
+
+| Method | Accuracy |
+|---|---:|
+| Chunk-RAG (wrong section retrieved) | **0/7** |
+| Full Document (FP32 KV) | **7/7** |
+| Full Document (6.4× compressed KV) | **7/7** |
+
+When chunk-RAG retrieved the wrong section, the model didn't say "I don't know." It made up answers:
+
+| Question | Chunk-RAG hallucination | Truth |
+|---|---|---|
+| Who is the CTO? | "John Smith" | Maria Santos |
+| What is the revenue? | "$1,000,000" | 847 million |
+| What % is R&D? | "15% of net income" | 14% of revenue |
+
+This is the failure mode no one is monitoring. **Your dashboards show 100% uptime. Your users get plausible-sounding lies.**
+
+## Why This Happens
+
+When you give an LLM a partial context and ask a question whose answer isn't in that context, two things can happen:
+
+1. The model says "I don't know based on the provided context."
+2. The model fills in the gap with the most likely-sounding answer.
+
+Modern instruction-tuned models do **#2** by default. Their training rewards "give a confident answer" more than "admit uncertainty." Combined with RAG's silent retrieval failures, this creates a system that confidently lies whenever its retriever misses.
+
+You can mitigate with prompt engineering, confidence thresholds, fine-tuning. None of them fix the root cause: **the LLM only sees a fragment**.
+
+## The Beyond RAG Pattern
+
+When the document fits in the context window, the entire stack collapses:
+
+```
+[document] ───────────────────────────────────────→ [LLM] → [answer]
+                                                   (full context)
+```
+
+Three steps become one. The hallucination failure mode disappears because **the LLM has all the information**. There's nothing for it to hallucinate.
+
+This isn't theoretical. It's just engineering: you need a context window big enough to fit your document, and you need it to fit on hardware you have.
+
+That's where KV cache compression comes in. quant.cpp's 6.4× compression means a 128K-token context for a 3B model fits in **9.5 GB on a 16GB Mac**. Llama 3.2 3B + your full company manual + the user's question, all running locally, no cloud, no vector DB, no retriever to fail silently.
+
+## When Beyond RAG Wins
+
+| Use case | Best approach |
+|---|---|
+| Chat with one document (manual, paper, novel) | **Beyond RAG** |
+| Codebase analysis (single repo) | **Beyond RAG** |
+| Customer support over a product manual | **Beyond RAG** |
+| Long conversation memory | **Beyond RAG** |
+| Search across 100K product reviews | RAG (still) |
+| Search across all of Wikipedia | RAG (still) |
+| Multi-tenant systems with millions of docs | Hybrid: RAG + Beyond RAG |
+
+The right question isn't "RAG or no RAG." It's **"is my entire context small enough to fit?"** If yes, skip the chunker. If no, use RAG to narrow the candidates, then load the survivors fully.
+
+This is **document-level RAG**: retrieval at the document level, not the chunk level. You still get the recall of search. You still get the precision of full context. You don't get the hallucination from chunking.
+
+## What Beyond RAG Is Not
+
+- **Not "RAG is dead."** RAG is essential when your corpus exceeds context. We're saying: stop pretending it's the only tool.
+- **Not "use Gemini 1.5 Pro for everything."** Cloud LLMs cost money per token, leak data, and require internet. Beyond RAG runs locally.
+- **Not "vector DBs are obsolete."** They're great for what they are. They're just often misused as a hammer for non-nail problems.
+- **Not a finished idea.** This is v1. We measured 7 questions on 1 model. Real validation needs LongBench, NIAH, multiple models, real corpora. We're going there.
+
+## What We're Asking
+
+If you're building production RAG, run our 5-minute benchmark on your own data:
+
+```bash
+git clone https://github.com/quantumaikr/quant.cpp
+cd quant.cpp
+# Adapt bench/document_level_rag_test.sh to your document + questions
+bash bench/document_level_rag_test.sh
+```
+
+When chunk-RAG fails on your data, see what your users would have seen.
+
+If the hallucinations bother you — and they should — try the alternative:
+
+```python
+pip install quantcpp
+```
+
+```python
+from quantcpp import Model
+m = Model.from_pretrained("Llama-3.2-3B", aggressive=True)
+m.ask(open("your_document.txt").read() + "\n\nQuestion: ...")
+```
+
+No vector DB. No chunker. No retriever. No silent failure.
+
+Just the model and the document.
+
+## The Goal
+
+Five years from now, "RAG" should mean "retrieve documents to load into context" — the way we use the word "search" today. It shouldn't mean "chunk-and-embed-and-pray."
+
+We're not the only ones thinking this. Anthropic's contextual retrieval, Gemini's 2M context, the long-context benchmark community — everyone is moving toward the same insight from different directions.
+
+quant.cpp is one tool: the one that makes Beyond RAG practical on consumer hardware. There will be others. Together, we move past the workaround.
+
+> **Welcome to Beyond RAG. Bring your documents.**
+
+---
+
+## Honest Disclaimers
+
+- This is a v1 finding. 5 sections, 7 questions, 1 model. We're not claiming a paper. We're starting a conversation.
+- Q4 weight quantization produces visual artifacts ("Santos" → "SanSannt"). Semantically correct, visually noisy. Use Q8 weights for production.
+- 1B models lack reliable instruction-following for QA. Use 3B+.
+- Beyond RAG only works when the document fits. For large corpora, hybrid is needed.
+- We'll update this manifesto with v2 evidence (LongBench, real corpora) when it's ready.
+
+## Track Record
+
+quant.cpp has **11 self-found, publicly-corrected claims** in its honest correction track. We don't ship vibes; we ship measurements. When this manifesto is wrong, we'll correct it and tell you what we got wrong.
+
+## Sign On
+
+If you've shipped a RAG system that hallucinated in production, we'd love to hear what failure mode it was. Open an issue or DM the maintainers. Real-world data > synthetic benchmarks.
+
+If you want to validate Beyond RAG on a real benchmark, we'd love a PR.
+
+If you think this manifesto is wrong, even better. Tell us why.
+
+> *Written 2026-04-11. v1. We'll be wrong about something. We'll fix it in public.*
diff --git a/docs/promotion-strategy-v0.12.md b/docs/promotion-strategy-v0.12.md
diff --git a/site/index.html b/site/index.html