VectorFFT/README.md at dev/factorizationSolve · Tugbars/VectorFFT

A permutation-free mixed-radix FFT library in C with hand-tuned AVX2/AVX-512 codelets.
Beats Intel MKL and FFTW on every tested size. No external dependencies.

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitors: FFTW 3.3.10 (FFTW_MEASURE), Intel MKL (sequential)

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

N	K	vs FFTW	vs MKL
60	32	4.81x	5.36x
49	32	4.60x	4.11x
200	256	3.78x	3.36x
1000	256	3.17x	2.78x
5000	32	5.88x	2.95x
10000	256	7.92x	1.82x
10000	1024	9.67x	2.19x
50000	256	16.02x	1.77x
100000	32	13.98x	2.55x
16384	1024	7.00x	1.82x
4096	1024	2.69x	1.75x

Prime-N Performance (Rader & Bluestein)

Smooth primes (N-1 is 19-smooth) use Rader's algorithm with hand-optimized codelets for the convolution. Non-smooth primes use Bluestein's algorithm (chirp-z transform).

Prime N	K	Method	vs FFTW	vs MKL
61	32	Rader	3.89x	5.41x
337	256	Rader	2.37x	2.58x
2053	256	Rader	2.52x	2.24x
263	32	Bluestein	1.04x	1.10x

Rader primes: 1.39x - 5.41x vs MKL (mean 2.50x). Bluestein handles the hard cases where no efficient decomposition exists.

Accuracy

All three libraries achieve comparable accuracy against brute-force O(N^2) DFT reference. VectorFFT's errors are 1.3-2.5x higher than FFTW/MKL due to the multi-stage stride-based decomposition (more intermediate twiddle multiplications), but remain well within double-precision tolerance and follow the theoretical O(N * epsilon * log N) bound.

Roundtrip error (fwd + bwd / N) is at machine epsilon (~1e-16) for all sizes -- the permutation-free architecture guarantees perfect cancellation.

Run vfft_bench to see accuracy results for your hardware.

Architecture

VectorFFT uses a permutation-free stride-based Cooley-Tukey architecture:

DIT forward + DIF backward -- roundtrip cancels digit-reversal permutation
Zero permutation passes, zero scratch buffers -- fully in-place, single buffer
18 hand-optimized radixes (2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 19, 20, 25, 32, 64)
Method C fused twiddles -- bakes common factor into per-leg twiddle table at plan time
Scalar twiddle optimization -- stores (R-1) unique scalars per group instead of (R-1)*K replicated rows; broadcasts to SIMD width inside codelets, eliminating L2/L3 twiddle pressure
Split-complex layout -- separate re[] / im[] arrays, SIMD-friendly

All codelets are generated by Python scripts with ISA-specific scheduling (register allocation, spill management, FMA pipelining). Prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman optimal register allocation with explicit spill management to fit AVX2's 16 YMM registers.

Getting Started

git clone https://github.com/Tugbars/VectorFFT.git
cd VectorFFT

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release

# Generate codelets (requires Python 3)
cmake --build . --target generate_codelets

# Build everything
cmake --build . --config Release

Usage

#include "planner.h"

// Initialize
stride_registry_t reg;
stride_registry_init(&reg);

// Plan (heuristic -- instant)
stride_plan_t *plan = stride_auto_plan(N, K, &reg);

// Execute (in-place, split-complex)
stride_execute_fwd(plan, re, im);   // forward
stride_execute_bwd(plan, re, im);   // backward (output / N to normalize)

stride_plan_destroy(plan);

Wisdom (optional)

Exhaustive search finds optimal factorization for each (N, K) pair:

stride_wisdom_t wis;
stride_wisdom_init(&wis);
stride_wisdom_load(&wis, "vfft_wisdom.txt");

// Use cached optimal plan, or fall back to heuristic
stride_plan_t *plan = stride_wise_plan(N, K, &reg, &wis);

// Calibrate new (N, K) pairs
stride_wisdom_calibrate(&wis, N, K, &reg);
stride_wisdom_save(&wis, "vfft_wisdom.txt");

Project Structure

src/stride-fft/
  core/               Runtime engine (header-only)
    executor.h        In-place stride-based executor (Method C)
    planner.h         Top-level API: auto_plan, wise_plan, exhaustive_plan
    registry.h        ISA-aware codelet registry (AVX-512 > AVX2 > scalar)
    factorizer.h      CPU-aware heuristic factorizer
    exhaustive.h      Exhaustive factorization search
    compat.h          Portable timer + aligned alloc
  codelets/           Generated SIMD headers (~150k lines)
    avx2/             47 AVX2 codelet headers
    avx512/           47 AVX-512 codelet headers
    scalar/           47 scalar fallback headers
  generators/         Python codelet generators
    gen_radix2.py .. gen_radix64.py
    generate_all.bat
  bench/
    bench_planner.c   Full benchmark (vs FFTW + MKL)
    plot_results.py   Comparison graphs

How It Works

Traditional FFT libraries (FFTW, MKL) use Cooley-Tukey with a digit-reversal permutation pass -- an O(N) memory shuffle that doesn't compute anything but costs cache misses.

VectorFFT eliminates this entirely:

Forward (DIT): stages process data at increasing strides, output lands in digit-reversed order
Backward (DIF): stages process in reverse order, naturally undoing the permutation
Roundtrip (fwd + bwd): produces natural-order output with zero permutation overhead

This saves one full data pass per transform. At large N, that's the difference between 1x and 16x over FFTW.

Acknowledgments

FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

Prime-N Performance (Rader & Bluestein)

Accuracy

Architecture

Getting Started

Usage

Wisdom (optional)

Project Structure

How It Works

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

Prime-N Performance (Rader & Bluestein)

Accuracy

Architecture

Getting Started

Usage

Wisdom (optional)

Project Structure

How It Works

Acknowledgments