llama: add pshard runtime for plan switching and streamed weights#22692
llama: add pshard runtime for plan switching and streamed weights#22692aukarande wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Hi @aukarande, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Full benchmark tables used for the summary in the PR description: Dense Models (Single-User)Single-user runs (OSL=200) with the rest of the context filled by the input prompt.
Dense Models (Batched)Batched runs with ISL/OSL=4K/200 with 4 to 64 concurrent users.
MoE/Hybrid (16 GB VRAM Budget)Single-user runs (OSL=200) with the rest of the context filled by the input prompt.
MoE/Hybrid (8 GB VRAM Budget)Single-user runs (OSL=200) with the rest of the context filled by the input prompt.
|
|
In a quick test I first checked out the planning branch since to my understanding the planning code is not in this PR. I ran |
|
I forgot: the RTX 4090 is connected to my server with an EPYC 7742 CPU and 3200 "MHz" octa-channel RAM. |
Overview
Pipelined sharding (
pshard) is a CPU/GPU scheduling approach for VRAM-constrained inference. It combines prioritized VRAM placement, sub-layer sharding, CPU offload, and pipelined copy/compute. It has three stages:profiling(#22495),planning(#22691), andinference. First, we benchmark representative CPU and GPU kernels on the target machine. Then the planner uses those measurements to compare placement strategies within each token tier and writes the chosen schedules to a registry. At inference time, the runtime selects the smallest token tier that can cover the currentn_tokensand uses that tier's schedule.This is meant to work without manual tuning. The runtime can discover an optimal
ubatchsize, adapt across context, decode, and multi-user decode, and choose schedules based on the current VRAM budget, CPU thread count, and PCIe bandwidth, without requiring manual tensor overrides.This work was proposed in a recent meeting with @ggerganov and @JohannesGaessler, and more details can be found in our paper: Efficient, VRAM-Constrained xLM Inference on Clients.
This PR adds the third phase:
inference. It loads the registry written by the planner, picks the smallest tier that covers the currentn_tokens, and applies that tier's schedule for decode.The implementation in this PR leverages @JohannesGaessler's llama-fit-params (#16653) and @am17an's tensor-override prefetching work (#21067) to keep our changes compact compared to our original prototype for the paper which had more extensive changes.
Additional information
Usage
New runtime flag:
./build/bin/llama-cli \ -m /path/to/model.gguf \ -pshard \ --no-mmap \ -mva 12000 \ -c 8192 \ -fa onUseful options and behavior:
-pshard: enable pipelined sharding in the common tools.-mva, --max-vram-alloc <MiB>: runtime device-memory budget.0or omitted uses currently free VRAM.--no-mmap: load all model weights into backend-managed host buffers, enabling copy/compute overlap<model>.gguf.tensor_overrides.pshard_registry, checks the fingerprint, and loads the best budget variant for the current budget.Example log:
Implementation Details
common_init_resultcreates a pshard registry, callsllama_params_fit_pshard(), and disables pshard if the plan is unavailable or marked unnecessary.[ pinned weights | compute scratch / streamed tensors | pinned KV/RS cache ]. Pinned cache is packed from the right side so plan switches can resize the scratch range.bs=1,bs=16,bs=512, ...). Decode and prefill can use different tiers because they have different scratch and cache pressure.pshard_apply_plan()converts the selected tier into tensor and layer backend maps, applies external cache addresses, sets gallocr allocation ranges, and restores cached allocator/backend state when possible.pshard_warmup_plans()reserves and saves gallocr state for viable tiers before generation starts. Later switches can reuse the saved state instead of doing a full reserve again.pshard_switch_plan()downloads KV/RS that are being unpinned, applies the new tier, uploads newly pinned layers, and logs the download/upload counts.llama_memory_pshardhandles KV, SWA KV, recurrent state, and hybrid memory. It keeps the CPU copy authoritative for host operations and syncs GPU copies when a tier pins those layers.Backend Changes
ggml_gallocr_set_buffer()and can constrain allocations to a subrange throughggml_gallocr_set_alloc_range().copy_streamsupport. CUDA wires this capability for copy/compute overlap.ggml_backend_sched_add_writeback()andGGML_TENSOR_FLAG_WRITEBACKmark KV/RS staging tensors that are written during a split and must stay alive until post-compute writeback.Results
Tested on RTX 5080 - 16 GB (PCIe Gen5), Intel Xeon w5-3425 (12 cores), 512 GB system RAM (peak ~90 GB/s). The baseline is
llama-fit-params(#16653).Single user runs use 16K-256K context, OSL=200 and ISL = Context - OSL.
Summary (full result set in comments):
Models used:
Qwen3.5-397B-A17B-Q4 (227 GB)
Single-user runs (OSL=200) with the rest of the context filled by the input prompt.
Requirements