|
| 1 | +# Beyond RAG: A Manifesto |
| 2 | + |
| 3 | +## TL;DR |
| 4 | + |
| 5 | +> **Chunking RAG was a workaround for small context windows. The workaround became dogma. Now context windows are big enough that we don't need the workaround. Welcome to Beyond RAG.** |
| 6 | +
|
| 7 | +## Where We Are |
| 8 | + |
| 9 | +In 2023, every "production AI" stack looked like this: |
| 10 | + |
| 11 | +``` |
| 12 | +[document] → [chunker] → [embedder] → [vector DB] |
| 13 | + ↓ |
| 14 | +[user query] → [embedder] → [retriever] → [reranker] → [LLM] → [answer] |
| 15 | +``` |
| 16 | + |
| 17 | +Six moving parts. Four of them exist solely because the LLM at the end couldn't fit your whole document in its context window. |
| 18 | + |
| 19 | +This was a reasonable engineering compromise. Llama 1 had 2K context. GPT-3.5 had 4K. You had to chunk. |
| 20 | + |
| 21 | +Then context windows grew. Llama 3.2 has 128K. Claude 3 has 200K. Gemini 1.5 has 2M. The compromise should have started disappearing. |
| 22 | + |
| 23 | +It didn't. The infrastructure became dogma. The vector DB companies became billion-dollar valuations. The "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one. |
| 24 | + |
| 25 | +## What We Measured |
| 26 | + |
| 27 | +We tested chunk-RAG vs full-document context on a 5-section synthetic document with 7 questions, using Llama 3.2 3B Q8_0: |
| 28 | + |
| 29 | +| Method | Accuracy | |
| 30 | +|---|---:| |
| 31 | +| Chunk-RAG (wrong section retrieved) | **0/7** | |
| 32 | +| Full Document (FP32 KV) | **7/7** | |
| 33 | +| Full Document (6.4× compressed KV) | **7/7** | |
| 34 | + |
| 35 | +When chunk-RAG retrieved the wrong section, the model didn't say "I don't know." It made up answers: |
| 36 | + |
| 37 | +| Question | Chunk-RAG hallucination | Truth | |
| 38 | +|---|---|---| |
| 39 | +| Who is the CTO? | "John Smith" | Maria Santos | |
| 40 | +| What is the revenue? | "$1,000,000" | 847 million | |
| 41 | +| What % is R&D? | "15% of net income" | 14% of revenue | |
| 42 | + |
| 43 | +This is the failure mode no one is monitoring. **Your dashboards show 100% uptime. Your users get plausible-sounding lies.** |
| 44 | + |
| 45 | +## Why This Happens |
| 46 | + |
| 47 | +When you give an LLM a partial context and ask a question whose answer isn't in that context, two things can happen: |
| 48 | + |
| 49 | +1. The model says "I don't know based on the provided context." |
| 50 | +2. The model fills in the gap with the most likely-sounding answer. |
| 51 | + |
| 52 | +Modern instruction-tuned models do **#2** by default. Their training rewards "give a confident answer" more than "admit uncertainty." Combined with RAG's silent retrieval failures, this creates a system that confidently lies whenever its retriever misses. |
| 53 | + |
| 54 | +You can mitigate with prompt engineering, confidence thresholds, fine-tuning. None of them fix the root cause: **the LLM only sees a fragment**. |
| 55 | + |
| 56 | +## The Beyond RAG Pattern |
| 57 | + |
| 58 | +When the document fits in the context window, the entire stack collapses: |
| 59 | + |
| 60 | +``` |
| 61 | +[document] ───────────────────────────────────────→ [LLM] → [answer] |
| 62 | + (full context) |
| 63 | +``` |
| 64 | + |
| 65 | +Three steps become one. The hallucination failure mode disappears because **the LLM has all the information**. There's nothing for it to hallucinate. |
| 66 | + |
| 67 | +This isn't theoretical. It's just engineering: you need a context window big enough to fit your document, and you need it to fit on hardware you have. |
| 68 | + |
| 69 | +That's where KV cache compression comes in. quant.cpp's 6.4× compression means a 128K-token context for a 3B model fits in **9.5 GB on a 16GB Mac**. Llama 3.2 3B + your full company manual + the user's question, all running locally, no cloud, no vector DB, no retriever to fail silently. |
| 70 | + |
| 71 | +## When Beyond RAG Wins |
| 72 | + |
| 73 | +| Use case | Best approach | |
| 74 | +|---|---| |
| 75 | +| Chat with one document (manual, paper, novel) | **Beyond RAG** | |
| 76 | +| Codebase analysis (single repo) | **Beyond RAG** | |
| 77 | +| Customer support over a product manual | **Beyond RAG** | |
| 78 | +| Long conversation memory | **Beyond RAG** | |
| 79 | +| Search across 100K product reviews | RAG (still) | |
| 80 | +| Search across all of Wikipedia | RAG (still) | |
| 81 | +| Multi-tenant systems with millions of docs | Hybrid: RAG + Beyond RAG | |
| 82 | + |
| 83 | +The right question isn't "RAG or no RAG." It's **"is my entire context small enough to fit?"** If yes, skip the chunker. If no, use RAG to narrow the candidates, then load the survivors fully. |
| 84 | + |
| 85 | +This is **document-level RAG**: retrieval at the document level, not the chunk level. You still get the recall of search. You still get the precision of full context. You don't get the hallucination from chunking. |
| 86 | + |
| 87 | +## What Beyond RAG Is Not |
| 88 | + |
| 89 | +- **Not "RAG is dead."** RAG is essential when your corpus exceeds context. We're saying: stop pretending it's the only tool. |
| 90 | +- **Not "use Gemini 1.5 Pro for everything."** Cloud LLMs cost money per token, leak data, and require internet. Beyond RAG runs locally. |
| 91 | +- **Not "vector DBs are obsolete."** They're great for what they are. They're just often misused as a hammer for non-nail problems. |
| 92 | +- **Not a finished idea.** This is v1. We measured 7 questions on 1 model. Real validation needs LongBench, NIAH, multiple models, real corpora. We're going there. |
| 93 | + |
| 94 | +## What We're Asking |
| 95 | + |
| 96 | +If you're building production RAG, run our 5-minute benchmark on your own data: |
| 97 | + |
| 98 | +```bash |
| 99 | +git clone https://github.com/quantumaikr/quant.cpp |
| 100 | +cd quant.cpp |
| 101 | +# Adapt bench/document_level_rag_test.sh to your document + questions |
| 102 | +bash bench/document_level_rag_test.sh |
| 103 | +``` |
| 104 | + |
| 105 | +When chunk-RAG fails on your data, see what your users would have seen. |
| 106 | + |
| 107 | +If the hallucinations bother you — and they should — try the alternative: |
| 108 | + |
| 109 | +```python |
| 110 | +pip install quantcpp |
| 111 | +``` |
| 112 | + |
| 113 | +```python |
| 114 | +from quantcpp import Model |
| 115 | +m = Model.from_pretrained("Llama-3.2-3B", aggressive=True) |
| 116 | +m.ask(open("your_document.txt").read() + "\n\nQuestion: ...") |
| 117 | +``` |
| 118 | + |
| 119 | +No vector DB. No chunker. No retriever. No silent failure. |
| 120 | + |
| 121 | +Just the model and the document. |
| 122 | + |
| 123 | +## The Goal |
| 124 | + |
| 125 | +Five years from now, "RAG" should mean "retrieve documents to load into context" — the way we use the word "search" today. It shouldn't mean "chunk-and-embed-and-pray." |
| 126 | + |
| 127 | +We're not the only ones thinking this. Anthropic's contextual retrieval, Gemini's 2M context, the long-context benchmark community — everyone is moving toward the same insight from different directions. |
| 128 | + |
| 129 | +quant.cpp is one tool: the one that makes Beyond RAG practical on consumer hardware. There will be others. Together, we move past the workaround. |
| 130 | + |
| 131 | +> **Welcome to Beyond RAG. Bring your documents.** |
| 132 | +
|
| 133 | +--- |
| 134 | + |
| 135 | +## Honest Disclaimers |
| 136 | + |
| 137 | +- This is a v1 finding. 5 sections, 7 questions, 1 model. We're not claiming a paper. We're starting a conversation. |
| 138 | +- Q4 weight quantization produces visual artifacts ("Santos" → "SanSannt"). Semantically correct, visually noisy. Use Q8 weights for production. |
| 139 | +- 1B models lack reliable instruction-following for QA. Use 3B+. |
| 140 | +- Beyond RAG only works when the document fits. For large corpora, hybrid is needed. |
| 141 | +- We'll update this manifesto with v2 evidence (LongBench, real corpora) when it's ready. |
| 142 | + |
| 143 | +## Track Record |
| 144 | + |
| 145 | +quant.cpp has **11 self-found, publicly-corrected claims** in its honest correction track. We don't ship vibes; we ship measurements. When this manifesto is wrong, we'll correct it and tell you what we got wrong. |
| 146 | + |
| 147 | +## Sign On |
| 148 | + |
| 149 | +If you've shipped a RAG system that hallucinated in production, we'd love to hear what failure mode it was. Open an issue or DM the maintainers. Real-world data > synthetic benchmarks. |
| 150 | + |
| 151 | +If you want to validate Beyond RAG on a real benchmark, we'd love a PR. |
| 152 | + |
| 153 | +If you think this manifesto is wrong, even better. Tell us why. |
| 154 | + |
| 155 | +> *Written 2026-04-11. v1. We'll be wrong about something. We'll fix it in public.* |
0 commit comments