Optimize FFT on `Complex<T>`

Parallel in-place deinterleaving from #95 only gets us about on par with RustFFT for large sizes.

Here are some ideas on how we could go faster.

### Parallel out-of-place interleave/deinterleave

Instead of `.zip()`/`.unzip()` use Rayon's `.unzip_to_vecs()`  or persistent scratch buffers for deinterleaving and plain old `.par_iter()` for interleaving. #100

**Complexity:** Needs more explicit SIMD instead of iterator autovectorization, but quite straightforward. Needs heuristics for switching over to the parallel impl.

**Trade-offs:** Out-of-place, higher memory use.

### Fuse deinterleaving with BRAVO and interleaving with last step of FFT

We're loading/storing data in there anyway, so might as well also do the interleaving transformations while we're there.

**Complexity:** An additional variant for BRAVO and for one FFT pass. The FFT pass adaptation is kinda awkward for DiT because the final pass is the one that spans both halves. In DiF it would be the butterflies of size 2 instead, which are way easier to parallelize. We can probably make it work with the plan-and-execute trick like in #95/#97 with some modest overhead.

**Trade-offs:** Out-of-place, higher memory use. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Might require unsafe non-temporal stores to go fast on a single thread on x86.

### Deinterleave/interleave in each FFT kernel

If we're in ALU-rich, bandwidth-poor regime, then extra math in each kernel might not be _that_ bad.

**Trade-offs:** More pervasive compute cost. No more 2-thread BRAVO, which wouldn't matter if #97 pans out. Needs a dedicated BRAVO variant, which should be straightforward.

### Specialized interleaved kernels

Instead of deinterleaving before and interleaving after in each kernel, adapt the math to operate on interleaved values directly. This is what other FFT implementations appear to do.

Complexity: writing an entire new set of kernels.

**Trade-offs:** same as the previous option, maybe slightly cheaper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize FFT on `Complex<T>` #99

Parallel out-of-place interleave/deinterleave

Fuse deinterleaving with BRAVO and interleaving with last step of FFT

Deinterleave/interleave in each FFT kernel

Specialized interleaved kernels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize FFT on Complex<T> #99

Description

Parallel out-of-place interleave/deinterleave

Fuse deinterleaving with BRAVO and interleaving with last step of FFT

Deinterleave/interleave in each FFT kernel

Specialized interleaved kernels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize FFT on `Complex<T>` #99