tools: add llama-pshard-plan-params for token-tiered placement planning#22691
tools: add llama-pshard-plan-params for token-tiered placement planning#22691aukarande wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Hi @aukarande, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Overview
Pipelined sharding (
pshard) is a CPU/GPU scheduling approach for VRAM-constrained inference. It combines prioritized VRAM placement, sub-layer sharding, CPU offload, and pipelined copy/compute. It has three stages:profiling(#22495),planning, andinference(#22692). First, we benchmark representative CPU and GPU kernels on the target machine. Then the planner uses those measurements to compare placement strategies within each token tier and writes the chosen schedules to a registry. At inference time, the runtime selects the smallest token tier that can cover the currentn_tokensand uses that tier's schedule.This is meant to work without manual tuning. The runtime can discover an optimal
ubatchsize, adapt across context, decode, and multi-user decode, and choose schedules based on the current VRAM budget, CPU thread count, and PCIe bandwidth, without requiring manual tensor overrides. The full set of TTFT, TPS, and end-to-end results for a variety of models is included in the inference PR (#22692).This work was proposed in a recent meeting with @ggerganov and @JohannesGaessler, and more details can be found in our paper: Efficient, VRAM-Constrained xLM Inference on Clients.
This PR adds the second phase:
planning. Given a model, target VRAM budget, token tiers, and roofline profiles fromllama-profiler-{cpu,gpu}(#22495), the planner predicts TPS for each candidate placement and writes the selected plans to a plain-text registry next to the model:The implementation in this PR builds on @JohannesGaessler's llama-fit-params (#16653) and @am17an's tensor-override prefetching work (#21067). It makes things elegant and vendor-agnostic, compared to our original from-scratch implementation for the paper.
Additional information
User Interface
New tool:
./build/bin/llama-pshard-plan-params \ -m /path/to/model.gguf \ -c 8192 \ -fa on \ -mva 12000Useful options:
-mva, --max-vram-alloc <MiB>: planning budget.0or omitted uses currently free VRAM.--pshard-tier-max <N>: largest batch-size tier to probe. The default ismin(max(n_batch, 16384), n_ctx).PSHARD_STRATEGY: force one strategy (STATIC_FITPARAMS_DENSEPRIO_MOEONLY,STATIC_ATTNPRIO_ALLMODELS,DYNAMIC_FFNCPU_ATTNSTREAM,GPUONLY_LAYERPIN_LAYERSTREAM, orGPUONLY_ATTNPIN_FFNSTREAM).PSHARD_CPU_PROFILE/PSHARD_GPU_PROFILE: override the profiler inputs used by the throughput predictor.The registry is plain text. Each tier stores the chosen strategy, predicted TPS, VRAM requirement, and an extended tensor override list:
backend_idinot=entries lets the planner target a specific scheduler backend instead of only a buffer type. In the current pshard layout:0: GPU pinned compute1,2: shard compute lanes used for overlap3: CPUImplementation Details
llama_params_fit_pshard(). It uses no-alloc probe models and contexts to estimate model, cache, and compute memory without loading weights for inference.bs=1,bs=16,bs=512, ...) and picks the best placement for each tier.ATTNrefers to the attention/dense side of the layer, as opposed to FFN/MoE weights.Static schedules: GPU-resident tensors run on GPU, CPU-resident tensors run on CPU, with no streamed GPU execution for host-resident weights.
STATIC_FITPARAMS_DENSEPRIO_MOEONLY: staticllama_params_fitplacement with front-to-back fill mode enabled. This is the baseline schedule: MoE models first keep dense-only parts on GPU with expert tensors on CPU, then convert dense-only layers to full layers as budget allows. Dense models get full layers.STATIC_ATTNPRIO_ALLMODELS: static attention-priority placement. It pins the attention/dense side across as many layers as fit, then uses the remaining budget to pin full layers. FFN/MoE that does not fit remains on CPU. Unlikellama_params_fit, this attention-priority placement applies to dense models too.Dynamic schedules: split the layer between CPU and GPU execution. Some host-resident tensors are streamed to GPU scratch for execution, while other parts of the layer remain on CPU.
DYNAMIC_FFNCPU_ATTNSTREAM: pin as many full layers as fit, keep FFN/MoE on CPU in the remaining layers, and stream the attention/dense side for GPU execution.GPU-only schedules: execute repeating-layer compute on GPU. Weights that do not fit in VRAM stay resident in host memory and are streamed to GPU scratch before use.
GPUONLY_LAYERPIN_LAYERSTREAM: pin as many full layers as fit, then stream the remaining layers for GPU execution.GPUONLY_ATTNPIN_FFNSTREAM: pin the attention/dense side for all layers, pin as many complete layers as the remaining budget allows, and stream FFN/MoE weights for GPU execution.n_pinnedunder the VRAM budget, then tries one-layer fractional overflow in the same priority order asllama_params_fit: attention, up, gate, then MoE tensors.llama_benchmark_predictorranks viable plans by predicted TPS. Without profiles,tpsremains unset and the planner picks the viable plan with the most pinned layers under the VRAM budget, using measured VRAM as the final tie-breaker.llama_params_fit) already fits there, the registry recordspshard_disabled=1 baseline_vram=<MiB>so the runtime can skip pshard setup when that baseline fits the current budget.n_ctx,n_seq_max, thread count, Flash Attention mode, KV cache types, GGUF file size, and forcedPSHARD_STRATEGY. Budget and cache sizing are stored as[variant budget=... cache_ubatch=...], so multiple budget/cache variants can share one model/context section.SPLIT_MODE_TENSOR,SPLIT_MODE_ROW,n_gpu_layers,tensor_split,tensor_buft_overrides) or disabled KV/KQV offload.backend_id, andllama_modelrecords tensor/layer backend maps so probe graphs can use the selected placement.llama_memory_pshardandllama_memory_pipe_shard_ilet probe contexts size pinned versus streamed KV/RS memory.token_embdandoutput.weightmay receive different placements when pshard needs them on different backends.Backend Changes
ggml_backend_sched_get_split_info()exposes per-split input, activation, and writeback bytes for the planner cost model.GGML_TENSOR_FLAG_WRITEBACKmarks KV/RS staging tensors that must stay alive until post-compute writeback.copy_streamis added as a backend capability. The scheduler only creates copy backends and prefetch reservations for devices that report a separate copy stream.Requirements