Skip to content

examples : add llama-profiler-cpu/gpu for op roofline measurement#22495

Open
aukarande wants to merge 1 commit intoggml-org:masterfrom
aukarande:pshard/profiling
Open

examples : add llama-profiler-cpu/gpu for op roofline measurement#22495
aukarande wants to merge 1 commit intoggml-org:masterfrom
aukarande:pshard/profiling

Conversation

@aukarande
Copy link
Copy Markdown

@aukarande aukarande commented Apr 29, 2026

Overview

Pipelined sharding (pshard) is a CPU/GPU scheduling approach for VRAM-constrained inference. It combines prioritized VRAM placement, sub-layer sharding, CPU offload, and pipelined copy/compute. It has three stages: profiling, planning(#22691), and inference(#22692). First, we benchmark representative CPU and GPU kernels on the target machine. Then the planner uses those measurements to compare placement strategies within each token tier and writes the chosen schedules to a registry. At inference time, the runtime selects the smallest token tier that can cover the current n_tokens and uses that tier's schedule.

This is meant to work without manual tuning. The runtime can discover an optimal ubatch size, adapt across context, decode, and multi-user decode, and choose schedules based on the current VRAM budget, CPU thread count, and PCIe bandwidth, without requiring manual tensor overrides. The full set of TTFT, TPS, and end-to-end results for a variety of models is included in the inference PR (#22692).

This work was proposed in a recent meeting with @ggerganov and @JohannesGaessler, and more details can be found in our paper: Efficient, VRAM-Constrained xLM Inference on Clients.

This PR adds the first phase: profiling. It adds two example binaries, llama-profiler-cpu and llama-profiler-gpu, to measure the FLOPs and effective bandwidth of the ggml ops that matter most for inference: MUL_MAT, MUL_MAT_ID, and FLASH_ATTN_EXT.

We sweep a range of shapes, quant types, and MoE/attention configs, and write out roofline-related metrics for each run, including arithmetic intensity, effective bandwidth, effective GFLOPs, and a ridge point for each (op, dtype) pair. We also re-measure CPU compute throughput while PCIe traffic is stressed in parallel, to capture contention effects. On GPU, the output header also includes estimated peak bandwidth and peak compute.

This PR does not change any core or library code. Both binaries are standalone examples built on top of the public ggml / ggml-backend API and live under examples/.

Additional information

What gets measured

CPU (llama-profiler-cpu)

  • DRAM streaming bandwidth
  • Host-to-device PCIe bandwidth in isolation
  • CPU compute throughput re-measured while PCIe traffic is stressed in parallel, to capture contention effects
  • MUL_MAT and MUL_MAT_ID across the matrix-size and quantization tables in profiler-common.h
  • FLASH_ATTN_EXT across the attention configs from get_attn_configs() (KV is F16 only)

GPU (llama-profiler-gpu)

  • MUL_MAT, MUL_MAT_ID, and FLASH_ATTN_EXT across the same tables (KV is F16 only)
  • Configurations that exceed available device memory or INT_MAX output size are skipped
  • Ridge point is reported for each (op, dtype) pair

Both binaries write a plain-text results file (cpu_profile.txt / gpu_profile.txt by default, overridable with --output).

Usage

./build/bin/llama-profiler-gpu                     # fast mode, common shapes only (default: --fast)
./build/bin/llama-profiler-gpu --full              # full sweep
./build/bin/llama-profiler-cpu --threads 16

Requirements

  • I have read and agree with the contributing guidelines.
  • AI usage disclosure: Yes. I used AI for some cleanup work (formatting, comment polish) and while debugging. The design, op selection, configuration tables in profiler-common.h, and the concurrent PCIe-stress measurement approach is ours. I manually reviewed the code and tested both binaries on multiple machines.

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 29, 2026

Hi @aukarande, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant