FSharp_GPGPU_Examples/src/MatrixMultiplication at main · gsvgit/FSharp_GPGPU_Examples

Name	Name	Last commit message	Last commit date
parent directory ..
App.config	App.config
AssemblyInfo.fs	AssemblyInfo.fs
FodyWeavers.xml	FodyWeavers.xml
FodyWeavers.xsd	FodyWeavers.xsd
Main.fs	Main.fs
Matrices.fs	Matrices.fs
MatrixMultiplication.fsproj	MatrixMultiplication.fsproj
README.md	README.md

Matrix Multiplication

Progressive GPU kernel optimizations (K0–K4), inspired by Cedric Nugteren's OpenCL SGEMM tutorial. All kernels compute C = A * B and are parametrized by element type and element-wise operations.

Kernels

Kernel	Description
K0	Naive: each thread computes one output element, adding each pairwise product directly to the global memory cell of the result matrix
K1	Local accumulator: each thread computes one output element using a mutable local register before writing to global memory once
K2	Local memory tiling: tiles of both input matrices are loaded into local memory for reuse, each thread computes one output element
K3	Increased work per thread: each thread computes `WPT` output elements from tiles in local memory
K4	2D register blocking: each thread computes a `TTS × TTS` tile of the output for maximal data reuse

Benchmarks

The benchmarks/MatrixMultiplication.Benchmarks/ project uses BenchmarkDotNet to measure GPU kernel execution times for all 5 matrix multiplication kernels (K0–K4) across matrix sizes 256–2048, various work-group configurations, and all OpenCL platforms (POCL, Nvidia, Intel GPU).

As far as benchmarks iterate over all possible configurations, they can be used as a tuner to choose optimal kernel configuration for particular device.

Benchmark classes

Class	Extra params	Kernel
`K0Benchmark`	—	`multiplyKernel0`
`K1Benchmark`	—	`multiplyKernel1`
`K2Benchmark`	—	`multiplyKernel2`
`K3Benchmark`	`WPT`: 1, 2, 4, 8	`multiplyKernel3` with `workPerThread`
`K4Benchmark`	`TTS`: 1, 2, 4, 8	`multiplyKernel4` with `threadTileSize`

Common parameters across all classes:

N — matrix size: 256, 512, 1024, 2048
LWS — local work size: 8, 16, 32, 64, 128, 256 (device-dependent, some values may be invalid)
Device — OpenCL platform: POCL, Nvidia, IntelGPU (iterated by BDN via [Params])

Design

Measurement: posts kernel command (async via MailboxProcessor) then synchronizes with CreateToHostMsg on a 1-element buffer — measures wall-clock GPU execution time
Data transfer excluded: buffers are allocated and filled with random data in [GlobalSetup], outside the timed portion
Cleanup: CreateFreeMsg on all ClArray buffers in [GlobalCleanup]
Invalid configs: fail in [GlobalSetup] with descriptive message → BenchmarkDotNet marks as NA and continues

How to run

# Full run all kernels on all devices:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks

# Quick smoke test (ShortRun, single kernel):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --job short --filter '*K0Benchmark*'

# Selective kernel:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --filter '*K3Benchmark*'

BenchmarkDotNet passes remaining CLI arguments (like --filter, --job, --stopOnFirstError) through to its own parser. Results are exported as CSV, Markdown, and HTML to BenchmarkDotNet.Artifacts/results/.

Analysis script

The Python script benchmarks/analyze_benchmarks.py with --mode mxm reads all 5 kernel CSVs and generates 6 comparison plots:

Best config per kernel (2×3 grid): for each kernel K0–K4, bars show minimum execution time per (N, Device) triple, annotated with the configuration that achieved that minimum (LWS for K0–K2; LWS + WPT for K3; LWS + TTS for K4)
Per-device all configurations (1×3 grid): for each device, all valid configs as individual bars grouped by N then by kernel
HPC-style scaling overview (2×3 grid): log-log lines, one subplot per kernel, 3 device lines annotated with best config
Speedup over K0 (2×2 grid): bars showing ratio relative to K0 baseline for K1–K4
Speedup stepwise (2×2 grid): successive speedup K1/K0 → K4/K3
Config sensitivity (2×3 grid): worst/best time ratio per (N, Device), composite best/worst label inside each bar

python benchmarks/analyze_benchmarks.py --mode mxm

Examples of plots:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Matrix Multiplication

Kernels

Benchmarks

Benchmark classes

Design

How to run

Analysis script

FilesExpand file tree

MatrixMultiplication

Directory actions

More options

Directory actions

More options

Latest commit

History

MatrixMultiplication

Folders and files

parent directory

README.md

Matrix Multiplication

Kernels

Benchmarks

Benchmark classes

Design

How to run

Analysis script