Skip to content

Commit 5c79c8d

Browse files
unamedkrclaude
andauthored
docs(guide): add 'When to use which?' scenario table + C code in CTA (#39)
Address Reddit feedback: guide only showed KV compression benchmarks vs llama.cpp but didn't explain when to use quant.cpp vs llama.cpp. Changes: 1. Added "When to use which?" table after the PPL comparison with concrete scenarios (WASM 192KB, MCU, game engines, teaching) and explicit acknowledgment of llama.cpp strengths (GPU, models) 2. CTA now shows both Python AND C single-header code side by side, reinforcing the "one file" value proposition 3. Updated i18n strings for EN and KO Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b32d597 commit 5c79c8d

File tree

1 file changed

+38
-5
lines changed

1 file changed

+38
-5
lines changed

site/index.html

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -480,7 +480,7 @@ <h3 class="reveal">Compression vs Quality</h3>
480480
</table>
481481
</div>
482482

483-
<h3 class="reveal">vs llama.cpp</h3>
483+
<h3 class="reveal">vs llama.cpp KV compression</h3>
484484
<p class="reveal">Same 4-bit budget, 3.5x less quality degradation:</p>
485485
<div class="viz reveal">
486486
<div class="viz-title">PPL Degradation at 4-bit (lower is better)</div>
@@ -494,6 +494,23 @@ <h3 class="reveal">vs llama.cpp</h3>
494494
</div>
495495
</div>
496496

497+
<h3 class="reveal" style="margin-top:3rem">When to use which?</h3>
498+
<p class="reveal" style="color:var(--text2);margin-bottom:1rem">llama.cpp is excellent. The difference is integration scope, not capability:</p>
499+
<div class="reveal" style="overflow-x:auto">
500+
<table>
501+
<thead><tr><th>Scenario</th><th>quant.cpp</th><th>llama.cpp</th></tr></thead>
502+
<tbody>
503+
<tr><td>WASM browser demo</td><td style="color:var(--green)">192 KB binary</td><td style="color:var(--text2)">Tensor graph too large</td></tr>
504+
<tr><td>Microcontroller / RTOS</td><td style="color:var(--green)">#include only</td><td style="color:var(--text2)">Needs build system</td></tr>
505+
<tr><td>Game engine plugin</td><td style="color:var(--green)">Drop one .h file</td><td style="color:var(--text2)">250K LOC build</td></tr>
506+
<tr><td>Learn in an afternoon</td><td style="color:var(--green)">16K LOC</td><td style="color:var(--text2)">250K+ LOC</td></tr>
507+
<tr><td>GPU throughput</td><td style="color:var(--text2)">Basic</td><td style="color:var(--green)">Full Metal/CUDA</td></tr>
508+
<tr><td>Model coverage</td><td style="color:var(--text2)">7 architectures</td><td style="color:var(--green)">100+</td></tr>
509+
</tbody>
510+
</table>
511+
</div>
512+
<p class="reveal" style="color:var(--text2);font-size:0.85rem;margin-top:0.5rem">Use llama.cpp for speed on a workstation. Use quant.cpp when you need to ship LLM inference <em>inside</em> something.</p>
513+
497514
<h3 class="reveal">Context Length on 8GB Mac</h3>
498515
<div class="reveal">
499516
<table>
@@ -572,12 +589,28 @@ <h2 class="reveal" data-i18n="gl.title">Glossary</h2>
572589
<section class="cta" style="background:var(--bg2)">
573590
<div class="container reveal">
574591
<h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
575-
<p style="color:var(--text2);margin-bottom:2rem;max-width:500px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Three lines of Python. No GPU, no API key, no setup.</p>
576-
<pre style="text-align:left;display:inline-block;margin-bottom:2rem"><code>pip install quantcpp
592+
<p style="color:var(--text2);margin-bottom:2rem;max-width:560px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Python one-liner or C single-header. No GPU, no API key, no setup.</p>
593+
<div style="display:flex;gap:1.5rem;flex-wrap:wrap;justify-content:center;margin-bottom:2rem;text-align:left">
594+
<div>
595+
<div style="font-size:0.75rem;color:var(--text2);margin-bottom:0.3rem;font-weight:600">Python</div>
596+
<pre style="margin:0"><code>pip install quantcpp
577597

578598
from quantcpp import Model
579599
m = Model.from_pretrained("Llama-3.2-1B")
580600
print(m.ask("What is gravity?"))</code></pre>
601+
</div>
602+
<div>
603+
<div style="font-size:0.75rem;color:var(--text2);margin-bottom:0.3rem;font-weight:600">C (single header)</div>
604+
<pre style="margin:0"><code>#include "quant.h"
605+
606+
int main() {
607+
quant_model* m = quant_load("model.gguf");
608+
quant_generate(quant_new(m, NULL),
609+
"Hello!", print_token, NULL);
610+
}
611+
// cc app.c -lm -lpthread</code></pre>
612+
</div>
613+
</div>
581614
<br>
582615
<a href="https://github.com/quantumaikr/quant.cpp" class="cta-btn cta-primary">GitHub</a>
583616
<a href="https://pypi.org/project/quantcpp/" class="cta-btn cta-secondary">PyPI</a>
@@ -715,7 +748,7 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
715748
'ch5.label':'Chapter 5','ch5.title':'Benchmarks','ch5.desc':'All measurements on Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 threads.',
716749
'ch6.label':'Chapter 6','ch6.title':'Research Foundations','ch6.desc':'Each technique in quant.cpp is grounded in peer-reviewed research:',
717750
'gl.label':'Reference','gl.title':'Glossary',
718-
'cta.title':'Try It Yourself','cta.desc':'Three lines of Python. No GPU, no API key, no setup.',
751+
'cta.title':'Try It Yourself','cta.desc':'Python one-liner or C single-header. No GPU, no API key, no setup.',
719752
},
720753
ko: {
721754
'nav.problem':'문제점','nav.solution':'핵심 발견','nav.techniques':'4가지 기술',
@@ -748,7 +781,7 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
748781
'ch5.label':'챕터 5','ch5.title':'벤치마크','ch5.desc':'모든 측정: Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 스레드.',
749782
'ch6.label':'챕터 6','ch6.title':'연구 기반','ch6.desc':'quant.cpp의 각 기술은 동료 심사를 거친 연구에 기반합니다:',
750783
'gl.label':'참조','gl.title':'용어집',
751-
'cta.title':'직접 해보기','cta.desc':'Python 3줄. GPU도, API 키도, 설정도 필요 없습니다.',
784+
'cta.title':'직접 해보기','cta.desc':'Python 한 줄 또는 C 헤더 하나. GPU도, API 키도, 설정도 필요 없습니다.',
752785
}
753786
};
754787

0 commit comments

Comments
 (0)