examples : add llama-profiler-cpu/gpu for op roofline measurement by aukarande · Pull Request #22495 · ggml-org/llama.cpp

aukarande · 2026-04-29T02:09:03Z

Overview

Pipelined sharding (pshard) is a CPU/GPU scheduling approach for VRAM-constrained inference. It combines prioritized VRAM placement, sub-layer sharding, CPU offload, and pipelined copy/compute. It has three stages: profiling, planning(#22691), and inference(#22692). First, we benchmark representative CPU and GPU kernels on the target machine. Then the planner uses those measurements to compare placement strategies within each token tier and writes the chosen schedules to a registry. At inference time, the runtime selects the smallest token tier that can cover the current n_tokens and uses that tier's schedule.

This is meant to work without manual tuning. The runtime can discover an optimal ubatch size, adapt across context, decode, and multi-user decode, and choose schedules based on the current VRAM budget, CPU thread count, and PCIe bandwidth, without requiring manual tensor overrides. The full set of TTFT, TPS, and end-to-end results for a variety of models is included in the inference PR (#22692).

This work was proposed in a recent meeting with @ggerganov and @JohannesGaessler, and more details can be found in our paper: Efficient, VRAM-Constrained xLM Inference on Clients.

This PR adds the first phase: profiling. It adds two example binaries, llama-profiler-cpu and llama-profiler-gpu, to measure the FLOPs and effective bandwidth of the ggml ops that matter most for inference: MUL_MAT, MUL_MAT_ID, and FLASH_ATTN_EXT.

We sweep a range of shapes, quant types, and MoE/attention configs, and write out roofline-related metrics for each run, including arithmetic intensity, effective bandwidth, effective GFLOPs, and a ridge point for each (op, dtype) pair. We also re-measure CPU compute throughput while PCIe traffic is stressed in parallel, to capture contention effects. On GPU, the output header also includes estimated peak bandwidth and peak compute.

This PR does not change any core or library code. Both binaries are standalone examples built on top of the public ggml / ggml-backend API and live under examples/.

Additional information

What gets measured

CPU (llama-profiler-cpu)

DRAM streaming bandwidth
Host-to-device PCIe bandwidth in isolation
CPU compute throughput re-measured while PCIe traffic is stressed in parallel, to capture contention effects
MUL_MAT and MUL_MAT_ID across the matrix-size and quantization tables in profiler-common.h
FLASH_ATTN_EXT across the attention configs from get_attn_configs() (KV is F16 only)

GPU (llama-profiler-gpu)

MUL_MAT, MUL_MAT_ID, and FLASH_ATTN_EXT across the same tables (KV is F16 only)
Configurations that exceed available device memory or INT_MAX output size are skipped
Ridge point is reported for each (op, dtype) pair

Both binaries write a plain-text results file (cpu_profile.txt / gpu_profile.txt by default, overridable with --output).

Usage

./build/bin/llama-profiler-gpu                     # fast mode, common shapes only (default: --fast)
./build/bin/llama-profiler-gpu --full              # full sweep
./build/bin/llama-profiler-cpu --threads 16

Requirements

I have read and agree with the contributing guidelines.
AI usage disclosure: Yes. I used AI for some cleanup work (formatting, comment polish) and while debugging. The design, op selection, configuration tables in profiler-common.h, and the concurrent PCIe-stress measurement approach is ours. I manually reviewed the code and tested both binaries on multiple machines.

ggml-gh-bot · 2026-04-29T02:13:29Z

Hi @aukarande, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

feat: GPU/CPU roofline profiler for per-op benchmarking

881f974

github-actions Bot added the examples label Apr 29, 2026

This was referenced May 4, 2026

tools: add llama-pshard-plan-params for token-tiered placement planning #22691

Open

llama: add pshard runtime for plan switching and streamed weights #22692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples : add llama-profiler-cpu/gpu for op roofline measurement#22495

examples : add llama-profiler-cpu/gpu for op roofline measurement#22495
aukarande wants to merge 1 commit intoggml-org:masterfrom
aukarande:pshard/profiling

aukarande commented Apr 29, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aukarande commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

What gets measured

Usage

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aukarande commented Apr 29, 2026 •

edited

Loading