examples : add llama-profiler-cpu/gpu for op roofline measurement#22495
Open
aukarande wants to merge 1 commit intoggml-org:masterfrom
Open
examples : add llama-profiler-cpu/gpu for op roofline measurement#22495aukarande wants to merge 1 commit intoggml-org:masterfrom
aukarande wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Hi @aukarande, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Pipelined sharding (
pshard) is a CPU/GPU scheduling approach for VRAM-constrained inference. It combines prioritized VRAM placement, sub-layer sharding, CPU offload, and pipelined copy/compute. It has three stages:profiling,planning(#22691), andinference(#22692). First, we benchmark representative CPU and GPU kernels on the target machine. Then the planner uses those measurements to compare placement strategies within each token tier and writes the chosen schedules to a registry. At inference time, the runtime selects the smallest token tier that can cover the currentn_tokensand uses that tier's schedule.This is meant to work without manual tuning. The runtime can discover an optimal
ubatchsize, adapt across context, decode, and multi-user decode, and choose schedules based on the current VRAM budget, CPU thread count, and PCIe bandwidth, without requiring manual tensor overrides. The full set of TTFT, TPS, and end-to-end results for a variety of models is included in the inference PR (#22692).This work was proposed in a recent meeting with @ggerganov and @JohannesGaessler, and more details can be found in our paper: Efficient, VRAM-Constrained xLM Inference on Clients.
This PR adds the first phase:
profiling. It adds two example binaries,llama-profiler-cpuandllama-profiler-gpu, to measure the FLOPs and effective bandwidth of the ggml ops that matter most for inference:MUL_MAT,MUL_MAT_ID, andFLASH_ATTN_EXT.We sweep a range of shapes, quant types, and MoE/attention configs, and write out roofline-related metrics for each run, including arithmetic intensity, effective bandwidth, effective GFLOPs, and a ridge point for each
(op, dtype)pair. We also re-measure CPU compute throughput while PCIe traffic is stressed in parallel, to capture contention effects. On GPU, the output header also includes estimated peak bandwidth and peak compute.This PR does not change any core or library code. Both binaries are standalone examples built on top of the public
ggml/ggml-backendAPI and live underexamples/.Additional information
What gets measured
CPU (
llama-profiler-cpu)MUL_MATandMUL_MAT_IDacross the matrix-size and quantization tables inprofiler-common.hFLASH_ATTN_EXTacross the attention configs fromget_attn_configs()(KVis F16 only)GPU (
llama-profiler-gpu)MUL_MAT,MUL_MAT_ID, andFLASH_ATTN_EXTacross the same tables (KVis F16 only)INT_MAXoutput size are skipped(op, dtype)pairBoth binaries write a plain-text results file (
cpu_profile.txt/gpu_profile.txtby default, overridable with--output).Usage
Requirements
profiler-common.h, and the concurrent PCIe-stress measurement approach is ours. I manually reviewed the code and tested both binaries on multiple machines.