Parallel in-place deinterleaving from #95 only gets us about on par with RustFFT for large sizes.
Here are some ideas on how we could go faster.
Parallel out-of-place interleave/deinterleave
Instead of .zip()/.unzip() use Rayon's .unzip_to_vecs() or persistent scratch buffers for deinterleaving and plain old .par_iter() for interleaving. #100
Complexity: Needs more explicit SIMD instead of iterator autovectorization, but quite straightforward. Needs heuristics for switching over to the parallel impl.
Trade-offs: Out-of-place, higher memory use.
Fuse deinterleaving with BRAVO and interleaving with last step of FFT
We're loading/storing data in there anyway, so might as well also do the interleaving transformations while we're there.
Complexity: An additional variant for BRAVO and for one FFT pass. The FFT pass adaptation is kinda awkward for DiT because the final pass is the one that spans both halves. In DiF it would be the butterflies of size 2 instead, which are way easier to parallelize. We can probably make it work with the plan-and-execute trick like in #95/#97 with some modest overhead.
Trade-offs: Out-of-place, higher memory use. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Might require unsafe non-temporal stores to go fast on a single thread on x86.
Deinterleave/interleave in each FFT kernel
If we're in ALU-rich, bandwidth-poor regime, then extra math in each kernel might not be that bad.
Trade-offs: More pervasive compute cost. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Needs a dedicated BRAVO variant, which should be straightforward.
Specialized interleaved kernels
Instead of deinterleaving before and interleaving after in each kernel, adapt the math to operate on interleaved values directly. This is what other FFT implementations appear to do.
Complexity: writing an entire new set of kernels.
Trade-offs: same as the previous option, maybe slightly cheaper
Parallel in-place deinterleaving from #95 only gets us about on par with RustFFT for large sizes.
Here are some ideas on how we could go faster.
Parallel out-of-place interleave/deinterleave
Instead of
.zip()/.unzip()use Rayon's.unzip_to_vecs()or persistent scratch buffers for deinterleaving and plain old.par_iter()for interleaving. #100Complexity: Needs more explicit SIMD instead of iterator autovectorization, but quite straightforward. Needs heuristics for switching over to the parallel impl.
Trade-offs: Out-of-place, higher memory use.
Fuse deinterleaving with BRAVO and interleaving with last step of FFT
We're loading/storing data in there anyway, so might as well also do the interleaving transformations while we're there.
Complexity: An additional variant for BRAVO and for one FFT pass. The FFT pass adaptation is kinda awkward for DiT because the final pass is the one that spans both halves. In DiF it would be the butterflies of size 2 instead, which are way easier to parallelize. We can probably make it work with the plan-and-execute trick like in #95/#97 with some modest overhead.
Trade-offs: Out-of-place, higher memory use. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Might require unsafe non-temporal stores to go fast on a single thread on x86.
Deinterleave/interleave in each FFT kernel
If we're in ALU-rich, bandwidth-poor regime, then extra math in each kernel might not be that bad.
Trade-offs: More pervasive compute cost. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Needs a dedicated BRAVO variant, which should be straightforward.
Specialized interleaved kernels
Instead of deinterleaving before and interleaving after in each kernel, adapt the math to operate on interleaved values directly. This is what other FFT implementations appear to do.
Complexity: writing an entire new set of kernels.
Trade-offs: same as the previous option, maybe slightly cheaper