Progressive GPU kernel optimizations (K0–K4), inspired by Cedric Nugteren's OpenCL SGEMM tutorial. All kernels compute C = A * B and are parametrized by element type and element-wise operations.
| Kernel | Description |
|---|---|
| K0 | Naive: each thread computes one output element, adding each pairwise product directly to the global memory cell of the result matrix |
| K1 | Local accumulator: each thread computes one output element using a mutable local register before writing to global memory once |
| K2 | Local memory tiling: tiles of both input matrices are loaded into local memory for reuse, each thread computes one output element |
| K3 | Increased work per thread: each thread computes WPT output elements from tiles in local memory |
| K4 | 2D register blocking: each thread computes a TTS × TTS tile of the output for maximal data reuse |
The benchmarks/MatrixMultiplication.Benchmarks/ project uses BenchmarkDotNet to measure GPU kernel execution times for all 5 matrix multiplication kernels (K0–K4) across matrix sizes 256–2048, various work-group configurations, and all OpenCL platforms (POCL, Nvidia, Intel GPU).
As far as benchmarks iterate over all possible configurations, they can be used as a tuner to choose optimal kernel configuration for particular device.
| Class | Extra params | Kernel |
|---|---|---|
K0Benchmark |
— | multiplyKernel0 |
K1Benchmark |
— | multiplyKernel1 |
K2Benchmark |
— | multiplyKernel2 |
K3Benchmark |
WPT: 1, 2, 4, 8 |
multiplyKernel3 with workPerThread |
K4Benchmark |
TTS: 1, 2, 4, 8 |
multiplyKernel4 with threadTileSize |
Common parameters across all classes:
- N — matrix size: 256, 512, 1024, 2048
- LWS — local work size: 8, 16, 32, 64, 128, 256 (device-dependent, some values may be invalid)
- Device — OpenCL platform:
POCL,Nvidia,IntelGPU(iterated by BDN via[Params])
- Measurement: posts kernel command (async via
MailboxProcessor) then synchronizes withCreateToHostMsgon a 1-element buffer — measures wall-clock GPU execution time - Data transfer excluded: buffers are allocated and filled with random data in
[GlobalSetup], outside the timed portion - Cleanup:
CreateFreeMsgon allClArraybuffers in[GlobalCleanup] - Invalid configs: fail in
[GlobalSetup]with descriptive message → BenchmarkDotNet marks asNAand continues
# Full run all kernels on all devices:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks
# Quick smoke test (ShortRun, single kernel):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --job short --filter '*K0Benchmark*'
# Selective kernel:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --filter '*K3Benchmark*'BenchmarkDotNet passes remaining CLI arguments (like --filter, --job, --stopOnFirstError) through to its own parser. Results are exported as CSV, Markdown, and HTML to BenchmarkDotNet.Artifacts/results/.
The Python script benchmarks/analyze_benchmarks.py with --mode mxm reads all 5 kernel CSVs and generates 6 comparison plots:
- Best config per kernel (2×3 grid): for each kernel K0–K4, bars show minimum execution time per (N, Device) triple, annotated with the configuration that achieved that minimum (LWS for K0–K2; LWS + WPT for K3; LWS + TTS for K4)
- Per-device all configurations (1×3 grid): for each device, all valid configs as individual bars grouped by N then by kernel
- HPC-style scaling overview (2×3 grid): log-log lines, one subplot per kernel, 3 device lines annotated with best config
- Speedup over K0 (2×2 grid): bars showing ratio relative to K0 baseline for K1–K4
- Speedup stepwise (2×2 grid): successive speedup K1/K0 → K4/K3
- Config sensitivity (2×3 grid): worst/best time ratio per (N, Device), composite best/worst label inside each bar
python benchmarks/analyze_benchmarks.py --mode mxmExamples of plots: