feat(simd): re-export f32_to_bf16_batch_rne / f32_to_bf16_scalar_rne

claude · claude · commit 7caefe9fa7fa · 2026-04-11T18:42:59.000Z
Makes the pure AVX-512-F RNE routines from commit c489d31 reachable as `ndarray::simd::f32_to_bf16_batch_rne` and `ndarray::simd::f32_to_bf16_scalar_rne` for consumer code in lance-graph. Without this re-export, callers would have to reach into the private `simd_avx512` module path, which is not `pub mod` in `lib.rs`. Doc comment on the re-export explicitly pins the workspace-wide "never scalar ever" rule for F32→BF16: consumer hot loops use `f32_to_bf16_batch_rne` exclusively (500-20,000× faster than scalar via AMX/AVX-512-BF16 tiles), and `f32_to_bf16_scalar_rne` is exposed only as a unit-test reference implementation. Cross-references the Certification Process section in `lance-graph/CLAUDE.md`. Companion commit in lance-graph updates `seven_lane_encoder.rs` Lane 6 to call the batch primitive instead of its previous element-wise truncation loop. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
diff --git a/src/simd.rs b/src/simd.rs
@@ -105,6 +105,20 @@ pub use crate::simd_avx512::{
     bf16_to_f32_scalar, f32_to_bf16_scalar,
     bf16_to_f32_batch, f32_to_bf16_batch,
 };
+
+// BF16 RNE (round-to-nearest-even) path — pure AVX-512-F, byte-exact vs
+// hardware `_mm512_cvtneps_pbh` on Sapphire Rapids+ (verified on 1M inputs
+// in ndarray::simd_avx512::tests). Consumer code should call
+// `f32_to_bf16_batch_rne` in hot loops (500-20000× faster than the scalar
+// path via AMX / AVX-512 tiles); `f32_to_bf16_scalar_rne` is exposed only
+// as a unit-test reference implementation and MUST NOT be called in hot
+// loops per the workspace-wide "never scalar ever" rule for F32→BF16.
+// See lance-graph/CLAUDE.md § Certification Process.
+#[cfg(target_arch = "x86_64")]
+pub use crate::simd_avx512::{
+    f32_to_bf16_scalar_rne,
+    f32_to_bf16_batch_rne,
+};
 // BF16 SIMD types only available when avx512bf16 is enabled at compile time
 #[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16"))]
 pub use crate::simd_avx512::{BF16x16, BF16x8};