FIDESlib is a full CKKS (Cheon-Kim-Kim-Son) homomorphic encryption library originally written for NVIDIA CUDA. This directory contains the direct Metal port for macOS only, targeting Apple Silicon GPUs with full CKKS support (NTT, modular arithmetic, rotations, bootstrap).
Key Characteristics:
- Direct Metal port (no abstraction layer, no CUDA fallback)
- macOS only (Apple Silicon and Intel Mac with Metal support)
- Full CKKS support for homomorphic encryption operations
- All 7 Metal shaders compile successfully
| File | Purpose |
|---|---|
MetalDevice.hpp/cpp |
Device enumeration, MTLDevice setup |
MetalBuffer.hpp/cpp |
GPU memory wrapper (MTLBuffer) |
MetalMemoryPool.hpp/cpp |
Async memory pool |
MetalStream.hpp/cpp |
Command queue/buffer abstraction |
MetalLauncher.hpp/cpp |
Kernel launch helpers with pipeline caching |
MetalMath.hpp |
Intrinsic wrappers (bit-reverse, clz, umulhi) |
| File | Purpose | Status |
|---|---|---|
MetalAddSub.metal |
Add/subtract kernels | Compiles |
MetalModMult.metal |
Barrett/Shoup/Neal multiplication | Compiles |
MetalNTT.metal |
NTT/INTT 2D CT/GS butterflies | Compiles |
MetalNTTFusion.metal |
Fused NTT templates | Compiles |
MetalConv.metal |
Convolution/modulus switching | Compiles |
MetalRotation.metal |
Automorphism/rotation kernels | Compiles |
MetalElemWise.metal |
Batch element-wise operations | Compiles |
| File | Purpose |
|---|---|
MetalMathTest.cpp |
CPU-based unit tests for math helpers |
MetalSample.cpp |
Sample program demonstrating Metal operations |
MetalTest.cpp |
Full Metal integration tests (requires device) |
- macOS with Xcode command line tools
- Metal-compatible GPU (Apple Silicon or Intel Mac with Metal)
- CMake 3.10+
mkdir -p build/metal
for f in src/Metal/*.metal; do
xcrun metal -c "$f" -o "build/metal/$(basename $f .metal).air"
done
xcrun metallib build/metal/*.air -o build/metal/fideslib_metal.metallibAdd to your CMake configuration:
set(FIDES_USE_METAL ON CACHE BOOL "Enable Metal GPU backend")
include(${CMAKE_SOURCE_DIR}/cmake/Metal.cmake)The CMake module will:
- Check for xcrun toolchain
- Compile all
.metalfiles to.airfiles - Combine
.airfiles intofideslib_metal.metallib - Install the metallib to the library directory
The Metal backend provides software implementations for intrinsics not available in Metal:
// Bit reversal (Metal has no __brev)
uint32_t metal_brev(uint32_t x); // 32-bit bit reverse
uint64_t metal_brev64(uint64_t x); // 64-bit bit reverse
// Count leading zeros
uint32_t metal_clz(uint32_t x); // clz for 32-bit
uint32_t metal_clzll(uint64_t x); // clz for 64-bit
// Unsigned multiply high
uint32_t metal_umulhi(uint32_t a, uint32_t b); // 32-bit upper product
uint64_t metal_umul64hi(uint64_t a, uint64_t b); // 64-bit upper product
// Popcount
uint32_t metal_popc(uint32_t x); // Population count// Barrett reduction - requires a*b >= mod^2 for accurate results
uint64_t metal_modmult_barrett(uint64_t a, uint64_t b, uint64_t mod);
uint64_t metal_barrett_mu(uint64_t mod); // Precompute mu = floor(2^64 / mod)
// Shoup's algorithm - faster when b is reused
uint64_t metal_shoup_precompute(uint64_t b, uint64_t mod);
uint64_t metal_modmult_shoup(uint64_t a, uint64_t b, uint64_t mod, uint64_t shoup_b);
// Simple modular operations
uint64_t metal_modadd(uint64_t a, uint64_t b, uint64_t mod);
uint64_t metal_modsub(uint64_t a, uint64_t b, uint64_t mod);Barrett Reduction Limitation:
Barrett reduction with k=64 requires a * b >= mod^2 for accurate results. When a * b < mod^2, the quotient q = floor(a * b * mu / 2^64) can be 0, resulting in no reduction.
In CKKS context, inputs are typically in [0, mod) but products are often large enough to satisfy this constraint. For small moduli, use direct modular multiplication instead.
Metal has no native 3D grid. Flatten 3D CUDA grids:
// CUDA
dim3 grid(N/2, numLimbs, parts);
// Metal (flattened)
MTLSize gridSize = MTLSizeMake(N/2 * parts, numLimbs, 1);- CUDA: up to 48KB shared memory per block
- Metal: ~32KB threadgroup memory per threadgroup
- Fix: Reduce threadgroup size from 1024 to 512 threads
- CUDA: max 1024 threads/block
- Metal: max 512 threads/threadgroup recommended
| CUDA | Metal |
|---|---|
__syncthreads() |
threadgroup_barrier(mem_flags::mem_device) |
__syncwarp() |
simdgroup_barrier() |
__brev(x) |
metal_brev(x) (software, ~10x slower) |
__clz(x) |
metal_clz(x) |
__umulhi(a,b) |
metal_umul64hi(a, b) |
atomicAdd |
atomic_fetch_add_explicit |
clang++ -std=c++20 -o build/metal_math_test src/Metal/MetalMathTest.cpp
./build/metal_math_testTests cover:
- Modular addition/subtraction
- Barrett/Shoup modular multiplication
- Bit reversal (32-bit and 64-bit)
- Count leading zeros
- Population count
- Unsigned multiply high
Results: 53 tests pass
# Build all samples
clang++ -std=c++20 -o build/ntt_example src/Metal/samples/ntt_example.cpp
clang++ -std=c++20 -o build/ckks_basics src/Metal/samples/ckks_basics.cpp
clang++ -std=c++20 -o build/rotation_example src/Metal/samples/rotation_example.cpp
clang++ -std=c++20 -o build/modular_arithmetic src/Metal/samples/modular_arithmetic.cpp
clang++ -std=c++20 -o build/bmi_calculator src/Metal/samples/bmi_calculator.cpp
# Run samples
./build/ntt_example # Flat buffer layout, butterfly operations
./build/ckks_basics # CKKS encoding/encryption/decryption
./build/rotation_example # Rotation/automorphism concepts
./build/modular_arithmetic # Barrett/Shoup multiplication
./build/bmi_calculator # Privacy-preserving BMI example# Requires Objective-C++ and Metal framework
clang++ -std=c++20 -framework Metal -fobjc-arc \
src/Metal/MetalTest.cpp src/Metal/MetalDevice.mm \
src/Metal/MetalBuffer.mm src/Metal/MetalStream.mm \
-o build/metal_test
./build/metal_testmult1Add2- multiply then addMult- batch multiplicationAdd,Sub- batch addition/subtractionsquare- batch squaringbinomialMult- binomial multiplicationdotProductPt- dot producteval_linear_w_sum- linear sum evaluation
automorph- automorphism transformationconjugate- complex conjugationteleswap- telescopic rotation patterns
mult_and_save- multiply and savesquare_and_save- square and saverescale- rescaling operationmoddown- modulus switching downksk_dot- key switching dot product
ModDown2- modulus switchingDecompAndModUpConv- decomposition and mod-upModUpDiag- modulus up diagonal
Bootstrap kernels are mostly orchestration/CPU code that calls the GPU kernels already ported. The GPU parts are in MetalBootstrap.metal.
-
**No
device void**- Metal doesn't supportdevice void**. Restructure buffers to flat layouts. -
No double precision - Use
floatfor CKKS operations (homomorphic encryption doesn't require doubles). -
Threadgroup barriers - Always use
threadgroup_barrier(mem_flags::mem_device)for device-wide sync. -
constexpr limitations - Metal doesn't allow
constexprat program scope. Use#defineinstead.
- Bit reversal is software (~10x slower than hardware
__brev). Consider precomputing lookup tables for small N. - Threadgroup memory is limited to ~32KB. Restructure kernels to use registers where possible.
- 512 threads per threadgroup (not 1024 like CUDA). Adjust grid dimensions accordingly.
The cmake/Metal.cmake module handles:
# Find Metal toolchain
find_program(XCRUN_EXECUTABLE xcrun PATHS /usr/bin /usr/local/bin)
# Metal source files
file(GLOB METAL_SHADER_FILES "${METAL_SOURCE_DIR}/*.metal")
# Custom target for compilation
add_custom_target(metal_lib ALL
COMMAND "${COMPILE_SCRIPT}" ...
DEPENDS ${METAL_SHADER_FILES}
)After compilation, all shaders are combined into:
build/metal/fideslib_metal.metallib (292KB)
This single metallib contains all kernels from all 7 shader files.
- Context Integration - Replace CUDA calls with Metal in CKKS Context
- Testing Infrastructure - Build complete test suite with Metal device
- Bootstrap Validation - Verify bootstrap produces correct results vs CPU reference
"No Metal devices found"
- Ensure running on macOS with Metal-compatible GPU
- Check that Xcode command line tools are installed
"xcrun not found"
- Install Xcode command line tools:
xcode-select --install
Shader compilation errors
- Ensure using correct Metal SDK:
xcrun --version - Check for deprecated Metal 2.0 features
Linker errors with metallib
- Ensure all
.airfiles are included:xcrun metallib build/metal/*.air