|
| 1 | +# Promotion Strategy: v0.12 Document-Level RAG Breakthrough |
| 2 | + |
| 3 | +## Date: 2026-04-11 |
| 4 | + |
| 5 | +## The Story (One Sentence) |
| 6 | + |
| 7 | +> **"Chunk-RAG hallucinated 7/7 questions. Loading the full document with 6.4x KV compression got 7/7 correct — on a 16GB Mac."** |
| 8 | +
|
| 9 | +## Why This Story Resonates |
| 10 | + |
| 11 | +1. **Concrete numbers**: 7/7 vs 0/7 is impossible to misread |
| 12 | +2. **Real fear**: Hallucination is the #1 production RAG concern |
| 13 | +3. **Counter-intuitive**: KV compression wasn't expected to enable this |
| 14 | +4. **Reproducible**: Single-file benchmark, anyone can run it |
| 15 | +5. **Actionable**: `pip install quantcpp` works today |
| 16 | + |
| 17 | +## Three-Tier Audience Strategy |
| 18 | + |
| 19 | +### Tier 1: r/LocalLLaMA (highest priority) |
| 20 | + |
| 21 | +**Why first**: Our existing community, tech-savvy, RAG fatigue is high. |
| 22 | + |
| 23 | +**Title options** (A/B test mentally): |
| 24 | +- **A** (concrete): "We measured chunk-RAG vs full-document on a 3B model — 0/7 vs 7/7" |
| 25 | +- **B** (provocative): "Your RAG hallucinates when retrieval fails. Here's the data." |
| 26 | +- **C** (technical): "6.4x KV compression makes 'Document-Level RAG' practical on 16GB Macs" |
| 27 | + |
| 28 | +**Recommend A** — concrete data wins on r/LocalLLaMA. |
| 29 | + |
| 30 | +**Post structure**: |
| 31 | +1. Hook: 7/7 vs 0/7 table |
| 32 | +2. The hallucination examples (John Smith, $1M, 15%) |
| 33 | +3. Methodology (Llama 3.2 3B, 7 questions, 3 methods) |
| 34 | +4. Why it matters: chunking is the bug, not the model |
| 35 | +5. CTA: `pip install quantcpp`, GitHub link, benchmark file |
| 36 | +6. Honest disclaimer: "single synthetic doc, needs scale validation" |
| 37 | + |
| 38 | +**Timing**: Tuesday or Wednesday, 9 AM ET (peak r/LocalLLaMA traffic) |
| 39 | + |
| 40 | +**Avoid**: |
| 41 | +- "Patent pending", "revolutionary", "patent us" |
| 42 | +- Comparing to llama.cpp (we already covered this) |
| 43 | +- Hiding limitations (community will dig them out anyway) |
| 44 | + |
| 45 | +### Tier 2: HackerNews |
| 46 | + |
| 47 | +**Why second**: Broader tech audience, RAG/AI is hot topic. |
| 48 | + |
| 49 | +**Title** (HN style — concrete + intriguing): |
| 50 | +- "Show HN: We compared chunk-RAG vs full-document QA — 0/7 vs 7/7" |
| 51 | + |
| 52 | +**Post structure**: |
| 53 | +1. Lead with the benchmark table |
| 54 | +2. Brief on quant.cpp (16K LOC C, single header, KV compression) |
| 55 | +3. The Document-Level RAG concept |
| 56 | +4. Why this matters for production RAG |
| 57 | +5. Repo link |
| 58 | + |
| 59 | +**Avoid**: |
| 60 | +- Marketing speak |
| 61 | +- Vague claims |
| 62 | +- Anything that sounds like a startup pitch |
| 63 | + |
| 64 | +### Tier 3: Twitter/X |
| 65 | + |
| 66 | +**Why third**: Amplification + screenshots. |
| 67 | + |
| 68 | +**Thread structure** (5-7 tweets): |
| 69 | + |
| 70 | +``` |
| 71 | +1/ Chunk-RAG: 0/7 ❌ (all hallucinated) |
| 72 | + Full Document: 7/7 ✅ |
| 73 | + Same model. Same questions. Just different context approach. |
| 74 | + |
| 75 | + We measured this on Llama 3.2 3B Q8_0: |
| 76 | + [screenshot of benchmark table] |
| 77 | +
|
| 78 | +2/ When chunk-RAG retrieves the wrong section, the model doesn't say "I don't know." |
| 79 | + It generates plausible lies: |
| 80 | + • "CTO?" → "John Smith" (actually: Maria Santos) |
| 81 | + • "Revenue?" → "$1M" (actually: $847M) |
| 82 | + • "R&D?" → "15% of net income" (actually: 14% of revenue) |
| 83 | +
|
| 84 | +3/ The fix isn't a smarter retriever. It's loading the full document. |
| 85 | + But that needs context windows that don't fit on consumer hardware. |
| 86 | + Until now. |
| 87 | +
|
| 88 | +4/ quant.cpp's 6.4x KV compression makes 128K context fit in 9.5 GB on a 16GB Mac. |
| 89 | + With Llama 3.2 3B, the full document fits → 7/7 accuracy → zero hallucinations. |
| 90 | + |
| 91 | + FP32 KV: 7/7 |
| 92 | + 6.4x compressed KV: 7/7 (zero quality loss) |
| 93 | +
|
| 94 | +5/ This isn't "RAG is dead." It's "chunking RAG is dangerous." |
| 95 | + RAG decides which documents to look at. |
| 96 | + Long-context decides how deeply to understand them. |
| 97 | + They complement each other. |
| 98 | +
|
| 99 | +6/ Open source, MIT-style: pip install quantcpp |
| 100 | + Single C header, 16K LOC, runs anywhere. |
| 101 | + |
| 102 | + Benchmark: github.com/quantumaikr/quant.cpp/blob/main/bench/results/document_level_rag_breakthrough.md |
| 103 | + |
| 104 | + /end |
| 105 | +``` |
| 106 | + |
| 107 | +### Bonus: LinkedIn (selective) |
| 108 | + |
| 109 | +**Audience**: Enterprise AI leads, ML engineers at companies with internal RAG. |
| 110 | + |
| 111 | +**Tone**: Professional, focus on production risk. |
| 112 | + |
| 113 | +**Key message**: "Your production RAG might be hallucinating without you knowing. Here's a measurable benchmark to find out." |
| 114 | + |
| 115 | +## Defensive Preparation |
| 116 | + |
| 117 | +### Anticipated criticism + responses |
| 118 | + |
| 119 | +**Q: "5 sections, 7 questions, single model — that's not a benchmark."** |
| 120 | +A: "Correct. This is a proof-of-concept, not a paper. Reproduce it in 5 minutes; we'd love to see results on LongBench/NIAH next." |
| 121 | + |
| 122 | +**Q: "Of course full context beats wrong-chunk retrieval. Your retriever sucks."** |
| 123 | +A: "Actually that's the point. Real production RAG fails silently when retrieval misses — and we showed exactly what 'silent failure' looks like (hallucination, not 'I don't know')." |
| 124 | + |
| 125 | +**Q: "Why not just use Gemini 1.5 Pro / Claude 3 with native 1M context?"** |
| 126 | +A: "Those run in cloud at $X/M tokens. quant.cpp runs locally for free on your laptop. Different use case (privacy, offline, cost)." |
| 127 | + |
| 128 | +**Q: "Your model output has 'SanSannt' instead of 'Santos'. That's broken."** |
| 129 | +A: "Q4 weight quantization artifact — semantically correct but visually noisy. For exact-string output use Q8 weights. Documented honestly in the report." |
| 130 | + |
| 131 | +**Q: "Chunking has been a known problem for years. What's new?"** |
| 132 | +A: "Two things: (1) We measured the failure mode quantitatively (silent hallucination). (2) We made the alternative practical on consumer hardware via KV compression." |
| 133 | + |
| 134 | +## Honest Self-Assessment |
| 135 | + |
| 136 | +**Strengths to lean into:** |
| 137 | +- Concrete numbers, not vague claims |
| 138 | +- Open source benchmark, instantly reproducible |
| 139 | +- 11 prior self-corrections (track record of honesty) |
| 140 | +- Real measurement on real hardware |
| 141 | + |
| 142 | +**Weaknesses to acknowledge upfront:** |
| 143 | +- Synthetic document (not real-world corpus) |
| 144 | +- Single model size (3B) |
| 145 | +- Single language (English) |
| 146 | +- Q4 weight artifacts in output |
| 147 | + |
| 148 | +**Don't lean into:** |
| 149 | +- "Paradigm shift" language (premature, see paradigm-shift discussion) |
| 150 | +- "RAG is dead" claims |
| 151 | +- Comparing to specific commercial RAG products |
| 152 | +- Anything that sounds like a startup pitch |
| 153 | + |
| 154 | +## Success Metrics |
| 155 | + |
| 156 | +**Tier 1 (r/LocalLLaMA)**: |
| 157 | +- 200+ upvotes = good |
| 158 | +- 500+ upvotes = great |
| 159 | +- 1000+ upvotes = breakthrough |
| 160 | +- Comments to track: meaningful technical discussion, reproductions, criticism |
| 161 | + |
| 162 | +**Tier 2 (HN)**: |
| 163 | +- Front page = good |
| 164 | +- 100+ comments = great |
| 165 | +- Thread depth (technical replies) > vote count |
| 166 | + |
| 167 | +**Tier 3 (Twitter/X)**: |
| 168 | +- 100+ retweets on lead tweet = good |
| 169 | +- ML researcher engagement (Karpathy, Mikolov, etc.) = great |
| 170 | + |
| 171 | +## Post-Launch Actions |
| 172 | + |
| 173 | +**Day 1-3**: Active comment engagement, answer questions, fix typos found by community. |
| 174 | + |
| 175 | +**Week 1**: Aggregate feedback into a follow-up "what we learned" post. Address top criticism transparently. |
| 176 | + |
| 177 | +**Week 2-4**: Run the benchmark on: |
| 178 | +- LongBench subset (real questions) |
| 179 | +- 7B model (better instruction-following) |
| 180 | +- 2-3 different document types (code, legal, novel) |
| 181 | + |
| 182 | +**Month 2**: Write a more rigorous benchmark report based on what survives scrutiny. This becomes the "v2 evidence" for any future paradigm-shift claims. |
| 183 | + |
| 184 | +## What Would Make This a Real Paradigm Shift (Future Work) |
| 185 | + |
| 186 | +To upgrade from "interesting result" to "paradigm shift": |
| 187 | +1. ✅ 0/7 vs 7/7 on synthetic data — done |
| 188 | +2. ⏳ Same result on LongBench / NIAH (1000+ questions) |
| 189 | +3. ⏳ Reproduced by independent team |
| 190 | +4. ⏳ Featured in HuggingFace blog or paper citation |
| 191 | +5. ⏳ Adopted by 1+ production system |
| 192 | + |
| 193 | +We're at step 1. Steps 2-5 need months. The promotion now should reflect this honestly. |
0 commit comments