This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is the MAX Kernels directory containing high-performance compute kernels written in Mojo. These kernels serve as building blocks for numerical, machine learning, and other performance-critical workloads. The repository is part of Modular AI's larger codebase and uses Bazel as its build system.
This project uses Bazel for building. Commands should be run through the
./bazelw wrapper script from the main Modular repository root.
# Build all kernels
./bazelw build //max/kernels/...
# Build a specific module
./bazelw build //max/kernels/src/linalg:linalg
# Build a specific benchmark
./bazelw build //max/kernels/benchmarks/gpu:bench_matmul
# Build and run a benchmark
./bazelw run //max/kernels/benchmarks/gpu:bench_matmul
# Run a specific test
./bazelw test //max/kernels/test/linalg:test_matmul
# Run all tests in a directory
./bazelw test //max/kernels/test/linalg/...
# Run GPU tests with specific hardware
./bazelw test --config=remote-a10 //max/kernels/test/gpu/... # For A10 GPU
./bazelw test --config=remote-h100 //max/kernels/test/gpu/... # For H100 GPU
./bazelw test --config=remote-mi300 //max/kernels/test/gpu/... # For MI300 GPU# Run a Mojo file
./bazelw run //KGEN/tools/mojo -- /path/to/file.mojo
# Or use the bmojo alias (after sourcing start-modular.sh)
bmojo /path/to/file.mojo
# Debug a Mojo file
bd //KGEN/tools/mojo -- /path/to/file.mojosrc/: Core kernel implementationslinalg/: Linear algebra operations (GEMM, GEMV, etc.)nn/: Neural network operations (convolution, attention, pooling)quantization/: Quantized operationslayout/: Memory layout utilities and tensor operationsinternal_utils/: Internal utilities and helperskv_cache/: Key-value cache implementationsMogg/: MOGG (Modular Graph Generator) related coderegister/: Register-level operations
test/: Unit tests mirroring source structure- Tests are organized by functionality (linalg, nn, gpu, etc.)
benchmarks/: Performance benchmarksgpu/: GPU-specific benchmarks with YAML configurationslinalg/: Linear algebra benchmarksnn/: Neural network operation benchmarksautotune/: Auto-tuning utilities and benchmarking tools
- Kernels are written using Mojo's systems programming capabilities
- Fine-grained control over memory layout and parallelism
- Hardware-specific optimizations (CPU SIMD, GPU tensor cores)
- Vendor library integration (cuBLAS, Apple Accelerate)
from linalg.matmul import matmul
from internal_utils import DeviceNDBuffer, HostNDBuffer
from gpu.host import DeviceContext- Tests files have a corresponding
.mojofile in the test directory - GPU tests are in the
test/gpu/subdirectory - Tests use assertions from the
testingmodule
# Run a specific test
./bazelw test //max/kernels/test/linalg:test_matmul
# Run tests with specific configurations
./bazelw test --config=asan //max/kernels/test/... # With AddressSanitizer
./bazelw test --config=debug-modular //max/kernels/test/... # Debug build
./bazelw test --runs_per_test=10 //max/kernels/test/... # Multiple runs# Run benchmarks using the benchmarking framework
./bazelw run //max/kernels/benchmarks/gpu:bench_matmul
# Run benchmarks with environment variables
./bazelw run //max/kernels/benchmarks/gpu:bench_matmul -- \
env_get_int[M]=1024 env_get_int[N]=1024 env_get_int[K]=1024
# Use autotune tools for performance analysis
python benchmarks/autotune/kbench.py benchmarks/gpu/bench_matmul.yaml# Format Mojo code
mojo format ./
# Run formatting through Bazel
./bazelw run //:format- NVIDIA GPU support through CUDA/PTX
- AMD GPU support through ROCm
- Tests can be run on specific hardware using remote configs
- GPU kernels use device contexts and memory management
- Intel AMX support
- Apple AMX and Accelerate framework
- ARM NEON intrinsics
- x86 AVX/VNNI instructions
Many benchmarks and tests use environment variables for configuration:
env_get_int[]: Get integer valuesenv_get_bool[]: Get boolean flagsenv_get_dtype[]: Get data type specifications
Example:
./bazelw run //max/kernels/benchmarks/gpu:bench_matmul -- \
env_get_int[M]=512 env_get_bool[transpose_b]=true \
env_get_dtype[type]=float16# Debug with bazel
bd //max/kernels/benchmarks/gpu:bench_matmul
# Debug in VSCode
bd --vscode //max/kernels/benchmarks/gpu:bench_matmul- Use
print()for debugging values - Enable assertions with
--enable_assertions - Use
--test_output=streamedfor immediate test output
The benchmarks/autotune/ directory contains tools for:
- Running parameterized benchmarks (
kbench.py) - Comparing performance (
kdiff.py) - Plotting results (
kplot.py) - Profiling kernels (
kprofile.py)
Platform-specific optimizations are selected through dispatch tables:
dispatch_table_a100_gpu.mojo: NVIDIA A100 optimizationsdispatch_table_amd.mojo: AMD GPU optimizations
Currently, external contributions are not being accepted, but you can:
- Report bugs through GitHub Issues
- Test kernels and provide feedback
- Stay updated through the Modular forum