You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rebase Scout shard runner to BF16-direct + F64x8 SIMD path
run_llama4_shard() now uses stream_index_gguf_bf16() instead of
stream_index_gguf(). Changes:
- BF16-direct: no f32 intermediate allocation (saves 283 MB/tensor)
- F64x8 SIMD: 8 rows projected in parallel per zmm register
- Strided octave (stride=16): 97% fewer BF16→f64 conversions
- Halftone drop: 9 of 17 golden positions, odd bins interpolated
- Exact shard sizes: SCOUT_SHARD_SIZES const replaces 44 GB estimate
- Reusable u16 buffer inside indexer (no per-tensor alloc)
Both Scout shard tests and Maverick test now use the same
BF16-direct pipeline. The old f32 path remains for non-BF16
formats (IQ1_S, Q8_0, etc).
0 commit comments