You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously missing data-i18n attributes on:
- Beyond RAG blockquote (rag.quote)
- Beyond RAG "It didn't..." paragraph (rag.para2)
- Verification section (entire): section label, title, intro,
all 3 bar labels, hallucination problem heading/description/
examples/summary, 3 info cards, CTA button (14 new keys)
- Footer (footer.text)
All 185 HTML keys now match 185 EN keys and 185 KO keys exactly.
Language toggle (EN ↔ KO) swaps every visible string.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
<pclass="reveal" data-i18n-html="rag.intro">Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. <strong>Now they have 128K. The compromise should have started disappearing.</strong></p>
533
533
534
-
<pclass="reveal">It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.</p>
534
+
<pclass="reveal"data-i18n="rag.para2">It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.</p>
535
535
536
536
<divclass="viz reveal">
537
537
<divclass="viz-title" data-i18n="rag.viz.title">Chunk-Level RAG vs Document-Level RAG</div>
<pclass="reveal">We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:</p>
<h2class="reveal"data-i18n="verify.title">7/7 vs 0/7 — Verified</h2>
605
+
<pclass="reveal"data-i18n-html="verify.intro">We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:</p>
<divclass="mem-bar"><divclass="mem-bar-fill bar-aggr" style="--w:100%"data-i18n="verify.bar3.inner">100% — same as FP32</div></div>
623
623
</div>
624
624
</div>
625
625
626
-
<h3class="reveal">The Hallucination Problem</h3>
627
-
<pclass="reveal">When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated <strong>plausible-sounding lies</strong>:</p>
<pclass="reveal"data-i18n-html="verify.halluc.desc">When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated <strong>plausible-sounding lies</strong>:</p>
<pclass="reveal" style="color:var(--text);font-weight:500;font-size:1.1rem">This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.</p>
642
+
<pclass="reveal" style="color:var(--text);font-weight:500;font-size:1.1rem"data-i18n-html="verify.halluc.summary">This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.</p>
<p>Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.</p>
657
+
<h4data-i18n="verify.card3.t">Runs on 16GB Mac</h4>
658
+
<pdata-i18n="verify.card3.d">Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.</p>
659
659
</div>
660
660
</div>
661
661
662
662
<divstyle="text-align:center;margin-top:3rem">
663
-
<ahref="https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md" class="cta-btn cta-primary" style="font-size:.95rem">Read the Beyond RAG Manifesto →</a>
663
+
<ahref="https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md" class="cta-btn cta-primary" style="font-size:.95rem"data-i18n-html="verify.cta">Read the Beyond RAG Manifesto →</a>
664
664
</div>
665
665
</div>
666
666
</section>
@@ -743,7 +743,7 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
743
743
<!-- ===== Footer ===== -->
744
744
<footer>
745
745
<divclass="container">
746
-
<p>quant.cpp · Apache 2.0 · <ahref="https://github.com/quantumaikr/quant.cpp">GitHub</a> · Made by <ahref="https://github.com/quantumaikr">quantumaikr</a></p>
746
+
<pdata-i18n-html="footer.text">quant.cpp · Apache 2.0 · <ahref="https://github.com/quantumaikr/quant.cpp">GitHub</a> · Made by <ahref="https://github.com/quantumaikr">quantumaikr</a></p>
747
747
</div>
748
748
</footer>
749
749
@@ -913,7 +913,30 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
913
913
"rag.card2.d": "Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.",
914
914
"rag.card3.t": "Read Once, Query Forever",
915
915
"rag.card3.d": "Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.",
"rag.quote": "<strong>Chunking RAG was a workaround for small context windows.</strong><br>The workaround became dogma.<br>Now context windows are big enough that we don't need the workaround.<br><em style=\"color:var(--accent2)\">— Welcome to Beyond RAG.</em>",
918
+
"rag.para2": "It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. \"RAG pipeline\" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.",
919
+
"verify.label": "Measured Result",
920
+
"verify.title": "7/7 vs 0/7 — Verified",
921
+
"verify.intro": "We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:",
"verify.halluc.title": "The Hallucination Problem",
929
+
"verify.halluc.desc": "When chunk-RAG retrieved the wrong section, the model didn't say \"I don't know\" — it generated <strong>plausible-sounding lies</strong>:",
930
+
"verify.halluc.examples": "<div><span style=\"color:var(--accent2)\">Q:</span> Who is the CTO?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"John Smith\"   <span style=\"color:var(--text3)\">→ truth: Maria Santos</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> What is the revenue?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"$1,000,000\"   <span style=\"color:var(--text3)\">→ truth: 847 million</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> What percent is R&D?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"15% of net income\"   <span style=\"color:var(--text3)\">→ truth: 14% of revenue</span></div>",
931
+
"verify.halluc.summary": "This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.",
932
+
"verify.card1.t": "KV Compression = Zero Quality Loss",
933
+
"verify.card1.d": "FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.",
"verify.card3.d": "Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.",
938
+
"verify.cta": "Read the Beyond RAG Manifesto →",
939
+
"footer.text": "quant.cpp · Apache 2.0 · <a href=\"https://github.com/quantumaikr/quant.cpp\">GitHub</a> · Made by <a href=\"https://github.com/quantumaikr\">quantumaikr</a>"
917
940
},
918
941
ko: {
919
942
"nav.problem": "\uBB38\uC81C\uC810",
@@ -1077,7 +1100,30 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
1077
1100
"rag.card2.d": "100K 문서를 한 번에 컨텍스트에 넣을 수 없습니다. Prefill이 느립니다. RAG는 검색을 2-3개 관련 문서로 좁혀줍니다.",
1078
1101
"rag.card3.t": "한 번 읽고, 영원히 질문",
1079
1102
"rag.card3.d": "문서를 .kv 파일로 사전 처리 (GPU, 1회). 어떤 노트북에서든 즉시 로드 (0.5초). 오프라인, 무제한, 프라이빗 질문.",
1080
-
"rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴"
1103
+
"rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴",
1104
+
"rag.quote": "<strong>청킹 RAG는 작은 컨텍스트 윈도우에 대한 임시방편이었습니다.</strong><br>그 임시방편이 정설이 됐습니다.<br>이제 컨텍스트 윈도우가 충분히 커져서 임시방편이 필요 없습니다.<br><em style=\"color:var(--accent2)\">— Beyond RAG에 오신 것을 환영합니다.</em>",
1105
+
"rag.para2": "사라지지 않았습니다. 인프라가 정설이 됐습니다. 벡터 DB는 수십억 달러 기업이 됐습니다. \"RAG 파이프라인\"은 실제 용도가 필요하든 아니든 모든 AI 엔지니어가 구축해야 할 무언가가 됐습니다.",
1106
+
"verify.label": "측정 결과",
1107
+
"verify.title": "7/7 vs 0/7 — 검증됨",
1108
+
"verify.intro": "5개 섹션의 합성 문서와 7개 질문(4개 단일-hop, 3개 multi-hop)으로 세 가지 접근법을 비교했습니다. <strong>Llama 3.2 3B Q8_0</strong>으로 테스트:",
1109
+
"verify.viz.title": "사실 추출 정확도",
1110
+
"verify.bar1.label": "Chunk-RAG (잘못된 섹션 검색)",
1111
+
"verify.bar1.val": "0/7 — 전부 환각",
1112
+
"verify.bar2.label": "전체 문서 (FP32 KV)",
1113
+
"verify.bar3.label": "<strong>전체 문서 (6.4배 KV 압축)</strong>",
1114
+
"verify.bar3.inner": "100% — FP32와 동일",
1115
+
"verify.halluc.title": "환각 문제",
1116
+
"verify.halluc.desc": "Chunk-RAG가 잘못된 섹션을 검색했을 때, 모델은 \"모르겠습니다\"라고 말하지 않고 <strong>그럴듯한 거짓말</strong>을 생성했습니다:",
0 commit comments