Feat/qlora training by srossitto79 · Pull Request #22705 · ggml-org/llama.cpp

srossitto79 · 2026-05-05T09:12:24Z

Overview

Adds a QLoRA fine-tuning example for Mixture-of-Experts models (Mixtral, Qwen-MoE, DeepSeek-MoE),
built on the MUL_MAT_ID backward pass. Supports both standard SFT and reward-weighted SFT (GRPO-style
training) with gradient checkpointing to reduce activation VRAM.

Additional information

New example: examples/qlora_training/finetune_qlora.cpp

Reward-weighted SFT (grpo_example.py)

The example includes diagnostic tooling (check_lora_norms.py) and sample datasets to help users
get started quickly.

Requirements

Depends on the feat/backward-mul-mat-id branch (or equivalent) for the MUL_MAT_ID backward pass,
OUT_PROD_ID op, and gclip AdamW parameter.

I have read and agree with the contributing guidelines
AI usage disclosure: YES, AI assisted with formatting, diff organization, and applying consistent changes across backends

Adds a new ggml op GGML_OP_OUT_PROD_ID — scattered outer-product for the MUL_MAT_ID backward pass (MoE LoRA training). result[:,:,e] += sum_{(i,t): ids[i,t]==e} a[:,i,t] ⊗ b[:,i,t] This computes the gradient w.r.t. expert weight matrices (src0) of MUL_MAT_ID, accumulating per-expert gradients from dispatched tokens. - ggml.h: enum GGML_OP_OUT_PROD_ID, ggml_out_prod_id() declaration - ggml.c: op name/symbol, GGML_OP_COUNT 96→97, ggml_out_prod_id() impl

Enables gradient computation through MUL_MAT_ID (expert-dispatched matrix multiply used in MoE LoRA training). Backward w.r.t. activations (src1): MUL_MAT_ID with transposed experts. Backward w.r.t. expert weights (src0, F32 only): OUT_PROD_ID scattered outer-product accumulates per-expert gradients from dispatched tokens. Quantized src0 is treated as frozen (stop-gradient). Also: - ggml_get_rows_back: support 3D output (needed for 3D grad accumulation) - sigmoid backward: d/dx = sigmoid(x)*(1-sigmoid(x)) = tensor - tensor^2 - ggml_build_backward_expand: ignore integer/frozen srcs for MUL_MAT_ID, SET_ROWS, SSM_CONV, SSM_SCAN, FLASH_ATTN_EXT; replace ASSERT on unsupported inplace ops with a warning+skip to avoid crashes

Adds ggml_cuda_out_prod_id() — the CUDA kernel for OUT_PROD_ID, which computes scattered outer-products for the MUL_MAT_ID backward pass (gradient w.r.t. MoE expert weight matrices). For each expert e the kernel gathers the dispatched token vectors into contiguous GPU buffers and calls cublasSgemm (beta=1 accumulation) to compute grad_weight[:,:,e] += sum_t a[:,t] ⊗ b[:,t]. Also extends ggml_cuda_out_prod() to accept quantized src0 by dequantizing to a temporary F32 buffer before the Sgemm call, allowing it to serve the transposed-expert gradient path in the MUL_MAT_ID backward on CPUs that use mixed quantized+F32 LoRA adapters.

Extends the AdamW optimizer with an element-wise gradient clipping parameter (gclip, pars[7]). When gclip > 0 each gradient element is clamped to [-gclip, gclip] before the first/second moment update, preventing outlier gradients from corrupting the momentum state. Set gclip = 0 (default) to disable — existing behavior is preserved. Also adds two training-infrastructure improvements to ggml-opt: - Moment tensors (m, v) are now allocated on the same backend buffer type as their param tensor, avoiding cross-device mismatches when LoRA adapters span CPU and GPU (partial offload scenarios). - Non-static graphs: gradient accumulators are explicitly zeroed at the start of each accumulation cycle, fixing stale-gradient carry-over that occurred because ggml_graph_reset is not called between evals. And gradient checkpointing support: ggml_opt_params.grad_checkpoint_interval marks every Nth forward node as OUTPUT so the allocator keeps activations alive through the backward pass, trading compute for VRAM reduction.

Reads pars[7] (gclip) from the AdamW parameter tensor and clamps each gradient element to [-gclip, gclip] before the moment update. gclip == 0 disables clipping (preserves existing behavior).

Reads params[8] (gclip) from the AdamW parameter buffer in the GLSL shader and clamps each gradient element to [-gclip, gclip] before the moment update. Also extends the params array declaration from [7] to [8] and updates the matching assert in ggml-vulkan.cpp. gclip == 0 disables clipping (preserves existing behavior).

…feat/backward-mul-mat-id

- llama_opt_params: add grad_checkpoint_interval (forwarded to ggml-opt) - llama_opt_set_reward_weights(): thread-local reward-weight array for reward-weighted SFT (GRPO); passed as reward_scale into opt_epoch_iter so each data window can have a different cross-entropy weight - llama_opt_epoch(): add shuffle param — calls ggml_opt_dataset_shuffle on the training split at the start of each epoch - llama_context::opt_init(): inflate the scheduler and graph-result capacity before creating opt_ctx, sized from the actual measured training forward graph (4x multiplier for gf+gb_grad+gb_opt) - llama_context::opt_epoch_iter(): allocate ctx_compute_opt once per batch window instead of per ubatch (reset instead of free+alloc); add per-ubatch timing logs; support -1 sentinel label (masked position) - llama-adapter: prefer device-native buffer type over CPU fallback when a LoRA tensor's repack buft is not usable (keeps LoRA on GPU)

Adds command-line arguments for the qlora_training example: gradient checkpointing interval and reward weights path.

Adds a new training example for QLoRA fine-tuning of MoE models (Mixtral, Qwen-MoE, DeepSeek-MoE). Uses the new MUL_MAT_ID backward pass to train LoRA adapters on the expert weight matrices, with support for reward-weighted SFT (GRPO-style training). Files: - finetune_qlora.cpp: main training binary - grpo_example.py: end-to-end GRPO training script - check_lora_norms.py: diagnostic to inspect LoRA weight magnitudes - sample_data.jsonl, sample_rwsft_data.jsonl: example datasets - CMakeLists.txt: build target registration - README.md: usage instructions Also updates examples/training/finetune.cpp to pass the new shuffle argument to llama_opt_epoch().

…79/llama.cpp into feat/qlora-training-v2

ggml-gh-bot · 2026-05-05T09:17:04Z

Hi @srossitto79, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.
Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

srossitto79 added 14 commits May 4, 2026 09:15

metal: implement gclip in AdamW

cb95b04

Reads pars[7] (gclip) from the AdamW parameter tensor and clamps each gradient element to [-gclip, gclip] before the moment update. gclip == 0 disables clipping (preserves existing behavior).

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

31edbcd

…feat/backward-mul-mat-id

common: add qlora training CLI args

6c101d1

Adds command-line arguments for the qlora_training example: gradient checkpointing interval and reward weights path.

Merge branch 'ggml-org:master' into feat/qlora-training-v2

fab1880

implemented GGML_OP_OUT_PROD_ID in cpu backend as fallback

3760da1

Merge branch 'feat/qlora-training-v2' of https://github.com/srossitto…

8a30a71

…79/llama.cpp into feat/qlora-training-v2

Merge branch 'ggml-org:master' into feat/qlora-training-v2

b1da22e

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/qlora training#22705

Feat/qlora training#22705
srossitto79 wants to merge 14 commits intoggml-org:masterfrom
srossitto79:feat/qlora-training-v2

srossitto79 commented May 5, 2026

Uh oh!

ggml-gh-bot Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srossitto79 commented May 5, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant