Skip to content

Latest commit

 

History

History
183 lines (126 loc) · 6.77 KB

File metadata and controls

183 lines (126 loc) · 6.77 KB
vectorfft_banner (1)

A permutation-free mixed-radix FFT library in C with hand-tuned AVX2/AVX-512 codelets.
Beats Intel MKL and FFTW on every tested size. No external dependencies.


Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Throughput

Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitors: FFTW 3.3.10 (FFTW_MEASURE), Intel MKL (sequential)

Speedup over Intel MKL and FFTW

Speedup

VectorFFT vs Intel MKL

vs MKL

N K vs FFTW vs MKL
60 32 4.81x 5.36x
49 32 4.60x 4.11x
200 256 3.78x 3.36x
1000 256 3.17x 2.78x
5000 32 5.88x 2.95x
10000 256 7.92x 1.82x
10000 1024 9.67x 2.19x
50000 256 16.02x 1.77x
100000 32 13.98x 2.55x
16384 1024 7.00x 1.82x
4096 1024 2.69x 1.75x

Prime-N Performance (Rader & Bluestein)

vfft_primes

Smooth primes (N-1 is 19-smooth) use Rader's algorithm with hand-optimized codelets for the convolution. Non-smooth primes use Bluestein's algorithm (chirp-z transform).

Prime N K Method vs FFTW vs MKL
61 32 Rader 3.89x 5.41x
337 256 Rader 2.37x 2.58x
2053 256 Rader 2.52x 2.24x
263 32 Bluestein 1.04x 1.10x

Rader primes: 1.39x - 5.41x vs MKL (mean 2.50x). Bluestein handles the hard cases where no efficient decomposition exists.


Accuracy

vfft_accuracy

All three libraries achieve comparable accuracy against brute-force O(N^2) DFT reference. VectorFFT's errors are 1.3-2.5x higher than FFTW/MKL due to the multi-stage stride-based decomposition (more intermediate twiddle multiplications), but remain well within double-precision tolerance and follow the theoretical O(N * epsilon * log N) bound.

Roundtrip error (fwd + bwd / N) is at machine epsilon (~1e-16) for all sizes -- the permutation-free architecture guarantees perfect cancellation.

Run vfft_bench to see accuracy results for your hardware.


Architecture

VectorFFT uses a permutation-free stride-based Cooley-Tukey architecture:

  • DIT forward + DIF backward -- roundtrip cancels digit-reversal permutation
  • Zero permutation passes, zero scratch buffers -- fully in-place, single buffer
  • 18 hand-optimized radixes (2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 19, 20, 25, 32, 64)
  • Method C fused twiddles -- bakes common factor into per-leg twiddle table at plan time
  • Scalar twiddle optimization -- stores (R-1) unique scalars per group instead of (R-1)*K replicated rows; broadcasts to SIMD width inside codelets, eliminating L2/L3 twiddle pressure
  • Split-complex layout -- separate re[] / im[] arrays, SIMD-friendly

All codelets are generated by Python scripts with ISA-specific scheduling (register allocation, spill management, FMA pipelining). Prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman optimal register allocation with explicit spill management to fit AVX2's 16 YMM registers.


Getting Started

git clone https://github.com/Tugbars/VectorFFT.git
cd VectorFFT

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release

# Generate codelets (requires Python 3)
cmake --build . --target generate_codelets

# Build everything
cmake --build . --config Release

Usage

#include "planner.h"

// Initialize
stride_registry_t reg;
stride_registry_init(&reg);

// Plan (heuristic -- instant)
stride_plan_t *plan = stride_auto_plan(N, K, &reg);

// Execute (in-place, split-complex)
stride_execute_fwd(plan, re, im);   // forward
stride_execute_bwd(plan, re, im);   // backward (output / N to normalize)

stride_plan_destroy(plan);

Wisdom (optional)

Exhaustive search finds optimal factorization for each (N, K) pair:

stride_wisdom_t wis;
stride_wisdom_init(&wis);
stride_wisdom_load(&wis, "vfft_wisdom.txt");

// Use cached optimal plan, or fall back to heuristic
stride_plan_t *plan = stride_wise_plan(N, K, &reg, &wis);

// Calibrate new (N, K) pairs
stride_wisdom_calibrate(&wis, N, K, &reg);
stride_wisdom_save(&wis, "vfft_wisdom.txt");

Project Structure

src/stride-fft/
  core/               Runtime engine (header-only)
    executor.h        In-place stride-based executor (Method C)
    planner.h         Top-level API: auto_plan, wise_plan, exhaustive_plan
    registry.h        ISA-aware codelet registry (AVX-512 > AVX2 > scalar)
    factorizer.h      CPU-aware heuristic factorizer
    exhaustive.h      Exhaustive factorization search
    compat.h          Portable timer + aligned alloc
  codelets/           Generated SIMD headers (~150k lines)
    avx2/             47 AVX2 codelet headers
    avx512/           47 AVX-512 codelet headers
    scalar/           47 scalar fallback headers
  generators/         Python codelet generators
    gen_radix2.py .. gen_radix64.py
    generate_all.bat
  bench/
    bench_planner.c   Full benchmark (vs FFTW + MKL)
    plot_results.py   Comparison graphs

How It Works

Traditional FFT libraries (FFTW, MKL) use Cooley-Tukey with a digit-reversal permutation pass -- an O(N) memory shuffle that doesn't compute anything but costs cache misses.

VectorFFT eliminates this entirely:

  1. Forward (DIT): stages process data at increasing strides, output lands in digit-reversed order
  2. Backward (DIF): stages process in reverse order, naturally undoing the permutation
  3. Roundtrip (fwd + bwd): produces natural-order output with zero permutation overhead

This saves one full data pass per transform. At large N, that's the difference between 1x and 16x over FFTW.


Acknowledgments

  • FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
  • VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.