You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A complete high-performance numerical computing stack built on top of the [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray) foundation. This fork adds 55 HPC modules with 880 tests, covering BLAS L1-L3, LAPACK, FFT, vector math, quantized inference, and hardware-specific SIMD kernels spanning Intel AMX through Raspberry Pi NEON — all on **stable Rust 1.94**, zero nightly features.
4
-
5
-
The upstream ndarray provides excellent n-dimensional array abstractions. We keep all of that and add what it was never designed to do: compete with NumPy's OpenBLAS on GEMM, run codebook inference on a 5-watt Pi 4, and handle half-precision floats that Rust doesn't even have a stable type for yet.
6
-
7
-
[Deutsche Version / German Version](README-DE.md)
8
-
9
-
## Upstream vs. Fork — Feature by Feature
10
-
11
-
### ISA Coverage (Instruction Set Architecture)
12
-
13
-
| ISA / Feature | Upstream ndarray |**AdaWorldAPI Fork**| Speedup vs. Upstream |
| CPU detection | Per-call runtime | Once via LazyLock, then pointer deref only |
3
+
A complete high-performance numerical computing stack built on top of [rust-ndarray/ndarray](https://github.com/rust-ndarray/ndarray). 55 HPC modules, 880 tests, BLAS L1-L3, LAPACK, FFT, quantized inference, SIMD kernels from Intel AMX to Raspberry Pi NEON — **stable Rust 1.94**, zero nightly.
63
4
64
-
### What Upstream Does on Each Target
65
-
66
-
```
67
-
Upstream on x86_64: → matrixmultiply crate (external, AVX2 if available)
68
-
Upstream on aarch64: → Scalar (no NEON, no intrinsics)
|**Rust stable**|**Yes** (1.94) | CUDA toolkit | CUDA toolkit | Python |
87
19
88
-
Upstream hits a cache cliff at 1024×1024: no tiling, no threading, no microkernel. Our Goto implementation eliminates this entirely. At 1024×1024 we deliver **10.5× the throughput of upstream** and match NumPy's decades-old OpenBLAS within measurement noise.
20
+
GPU wins at large dense GEMM. We win at **everything else**: similarity search, latency-sensitive inference, edge deployment, energy efficiency, and cost. A $35 Raspberry Pi 4 at 5 watts outperforms a $350 GPU at 170 watts for codebook inference — because table lookups don't need floating-point hardware.
89
21
90
-
### Codebook Inference (Token Generation)
22
+
##Core Architecture
91
23
92
-
Not matrix multiplication — O(1) table lookup per token. No GPU required.
24
+
Five layers built on top of upstream ndarray's array primitives:
93
25
94
-
| Hardware | ISA | tok/s | 50-Token Latency | Power |
| Raspberry Pi 5 | NEON + dotprod |**2,000–5,000**| 10–25 ms | 5W |
100
-
| Raspberry Pi 4 | NEON (dual pipeline) |**500–2,000**| 25–100 ms | 5W |
101
-
| Pi Zero 2W | NEON (single pipeline) |**50–500**| 100–1000 ms | 2W |
26
+
**SIMD Polyfill** (`simd.rs`, `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) — `std::simd`-compatible types (`F32x16`, `F64x8`, `U8x64`, `I32x16`) on stable Rust via `core::arch`. Detection once via `LazyLock<SimdCaps>`, dispatch via frozen function pointer table (0.3ns per call).
102
27
103
-
At 5 watts, a Pi 4 generates a 50-token voice assistant response in under 100 milliseconds.
**Burn Integration** (`crates/burn/`) — SIMD-augmented burn-ndarray backend wiring `F32x16` into tensor ops and activations.
115
35
116
-
**611 million cosine-equivalent comparisons per second using only integer operations** — 12× faster than SIMD f32 dot product. The 256×256 table (64KB) fits entirely in L1 cache.
36
+
## Upstream vs. Fork
117
37
118
-
### Half-Precision Weight Transcoding
38
+
### ISA Coverage
119
39
120
-
Tested on 15M parameter model (Piper TTS scale):
40
+
| ISA | Upstream ndarray |**This Fork**| Speedup |
`std::simd` has been nightly-only for years. We implement the same type surface using stable `core::arch` intrinsics. The dispatch is a `LazyLock<SimdCaps>` singleton: one CPUID call, frozen forever, zero per-call overhead.
65
+
## Performance
134
66
135
-
### 2. Half-Precision Types Without Nightly
67
+
### GEMM
136
68
137
-
Rust's `f16` type is nightly-only. We use `u16` as carrier + hardware instructions via stable `#[target_feature]` (F16C on x86, `FCVTL`/`FCVTN` via inline `asm!()` on ARM). IEEE 754 bit-exact at hardware speed.
**10.5× over upstream** at 1024×1024 — matches NumPy OpenBLAS.
140
76
141
-
Intel AMX intrinsics are nightly-only. We emit instructions via `asm!(".byte ...")` encoding — 256 MACs per instruction, verified on Rust 1.94 stable. Reduces distance table build from 24–48h to ~80 minutes.
77
+
### Codebook Inference
142
78
143
-
### 4. Tiered ARM NEON for Single-Board Computers
79
+
| Hardware | ISA | tok/s | 50-tok Latency | Power |
Function pointer table, not per-call branching. `LazyLock<SimdDispatch>` → one indirect call, no atomic, no branch prediction miss.
96
+
611M cosine-equivalent lookups/sec using only integer operations. The 256×256 table (64KB) lives in L1 cache — no FP division, no multiplication, no PCIe transfer.
150
97
151
-
### 6. BF16 RNE Bit-Exact with Hardware
98
+
### f16 Weight Transcoding
152
99
153
-
Pure AVX-512-F emulation of `VCVTNEPS2BF16`, verified bit-for-bit on 1M+ inputs including subnormals, Inf, NaN, and halfway ties.
100
+
| Format | Size | Max Error | Speed |
101
+
|--------|------|-----------|-------|
102
+
| f32 | 60 MB | — | — |
103
+
|**f16**|**30 MB**| 7.3e-6 | 94M/s |
104
+
|**Scaled-f16**|**30 MB**| 4.9e-6 | 91M/s |
105
+
|**Double-f16**| 60 MB | 5.7e-8 | 42M/s |
154
106
155
-
### 7. Cognitive Codec Stack
107
+
##What We Build That Nobody Else Does
156
108
157
-
Fingerprint<256>, Base17 VSA, CAM-PQ, Palette Semiring, bgz7/bgz17 — compressed model weights (201GB → 685MB) with O(1) inference.
109
+
1.**SIMD Polyfill on Stable** — `F32x16`/`F64x8`/`U8x64` via `core::arch`, not nightly `std::simd`
110
+
2.**f16 Without Nightly** — `u16` carrier + F16C hardware / ARM `FCVTL` via `asm!()`
111
+
3.**AMX on Stable** — `asm!(".byte ...")` encoding, 256 MACs/instruction
112
+
4.**Tiered ARM NEON** — A53/A72/A76 with pipeline + big.LITTLE awareness
113
+
5.**0.3ns Dispatch** — LazyLock frozen fn-pointer table, no per-call branching
114
+
6.**BF16 RNE Bit-Exact** — Pure AVX-512-F emulates `VCVTNEPS2BF16` bit-for-bit
0 commit comments