Merge pull request #98 from AdaWorldAPI/claude/setup-rust-smart-home-SOPAY

AdaWorldAPI · web-flow · commit d4d69e0266bc · 2026-04-13T13:57:38.000+02:00
Claude/setup rust smart home sopay
diff --git a/README-DE.md b/README-DE.md
@@ -18,9 +18,9 @@ Ein vollstaendiger Hochleistungs-Numerik-Stack auf Basis von [rust-ndarray/ndarr
 | FAISS CPU (Flat) | AVX2 FP32 Dot | ~50M/s | ~20 ns | i7 | 65W |
 | FAISS CPU (IVF-PQ) | AVX2 quantisiert | ~100-200M/s | ~5-10 ns | i7 | 65W |
 
-Ein 35-EUR Raspberry Pi 4 bei 5 Watt erreicht oder schlaegt eine 350-EUR RTX 3060 bei 170 Watt. Ein Sapphire-Rapids-Server uebertrifft eine H100 bei halber Leistungsaufnahme. Ein 15-EUR Pi Zero 2W bei 2 Watt schlaegt FAISS CPU Flat noch um 60%.
+> **Zur Methodik:** Alle Zahlen sind pro *vollstaendiger Query* (ein Vektor rein -> ein Aehnlichkeitswert raus). Unser Palette-System quantisiert Vektoren offline auf 256 Archetypes; FAISS IVF-PQ trainiert offline einen Inverted-File-Index. Beides erfordert einmalige Vorbereitung. Der Kernunterschied: Unser Lookup ist ein einziger u8-Tabellenlesevorgang aus einer 64KB-Tabelle im L1-Cache (0 FLOPs, kein Fliesskomma); FAISS PQ dekodiert 8 Subspaces pro Query (~16 Ops + Addition). FAISS Flat berechnet ein volles 768-dim FP32-Skalarprodukt (~1.536 FLOPs). Unser Fehler beim Foveal-Tier (1/40 sigma) betraegt 0,4% — vergleichbar mit PQs 5-10% bei hoeherem Durchsatz und null Hardwarekosten.
 
-Der Trick: GPU muss FP32-multiplizieren, FP32-dividieren und ueber PCIe transferieren. Wir lesen einen u8 aus einer 64KB Tabelle die im L1-Cache liegt. Kein Transfer, kein Kernel-Launch, kein Fliesskomma.
+Ein 35-EUR Raspberry Pi 4 bei 5 Watt erreicht oder schlaegt eine 350-EUR RTX 3060 bei 170 Watt. Ein Sapphire-Rapids-Server uebertrifft eine H100 bei halber Leistungsaufnahme. Ein 15-EUR Pi Zero 2W bei 2 Watt schlaegt FAISS CPU Flat noch um 60%.
 
 ## Upstream vs. Fork — Feature fuer Feature
 
diff --git a/README.md b/README.md
@@ -18,9 +18,9 @@ A complete high-performance numerical computing stack built on top of [rust-ndar
 | FAISS CPU (Flat) | AVX2 FP32 dot | ~50M/s | ~20 ns | i7 | 65W |
 | FAISS CPU (IVF-PQ) | AVX2 quantized | ~100–200M/s | ~5–10 ns | i7 | 65W |
 
-A $35 Raspberry Pi 4 at 5 watts matches or beats a $350 RTX 3060 at 170 watts. A Sapphire Rapids server outperforms an H100 at half the power. A $15 Pi Zero 2W at 2 watts still beats FAISS CPU Flat by 60%.
+> **Methodology note:** All numbers are per *complete query* (one vector in → one similarity score out). Our palette system pre-quantizes vectors to 256 archetypes offline; FAISS IVF-PQ pre-trains an inverted file index offline. Both require one-time preparation. The key difference: our lookup is a single u8 table read from a 64KB table in L1 cache (0 FLOPs, no floating point); FAISS PQ decodes 8 subspaces per query (~16 ops + addition). FAISS Flat computes a full 768-dim FP32 dot product (~1,536 FLOPs). Our error at the Foveal tier (1/40σ) is 0.4% — comparable to PQ's 5–10% at higher throughput and zero hardware cost.
 
-The trick: GPU must FP32-multiply, FP32-divide, and transfer over PCIe. We read one u8 from a 64KB table that lives in L1 cache. No transfer, no kernel launch, no floating point.
+A $35 Raspberry Pi 4 at 5 watts matches or beats a $350 RTX 3060 at 170 watts. A Sapphire Rapids server outperforms an H100 at half the power. A $15 Pi Zero 2W at 2 watts still beats FAISS CPU Flat by 60%.
 
 ## Core Architecture