Feat/qlora training#22705
Conversation
Adds a new ggml op GGML_OP_OUT_PROD_ID — scattered outer-product for
the MUL_MAT_ID backward pass (MoE LoRA training).
result[:,:,e] += sum_{(i,t): ids[i,t]==e} a[:,i,t] ⊗ b[:,i,t]
This computes the gradient w.r.t. expert weight matrices (src0) of
MUL_MAT_ID, accumulating per-expert gradients from dispatched tokens.
- ggml.h: enum GGML_OP_OUT_PROD_ID, ggml_out_prod_id() declaration
- ggml.c: op name/symbol, GGML_OP_COUNT 96→97, ggml_out_prod_id() impl
Enables gradient computation through MUL_MAT_ID (expert-dispatched matrix multiply used in MoE LoRA training). Backward w.r.t. activations (src1): MUL_MAT_ID with transposed experts. Backward w.r.t. expert weights (src0, F32 only): OUT_PROD_ID scattered outer-product accumulates per-expert gradients from dispatched tokens. Quantized src0 is treated as frozen (stop-gradient). Also: - ggml_get_rows_back: support 3D output (needed for 3D grad accumulation) - sigmoid backward: d/dx = sigmoid(x)*(1-sigmoid(x)) = tensor - tensor^2 - ggml_build_backward_expand: ignore integer/frozen srcs for MUL_MAT_ID, SET_ROWS, SSM_CONV, SSM_SCAN, FLASH_ATTN_EXT; replace ASSERT on unsupported inplace ops with a warning+skip to avoid crashes
Adds ggml_cuda_out_prod_id() — the CUDA kernel for OUT_PROD_ID, which computes scattered outer-products for the MUL_MAT_ID backward pass (gradient w.r.t. MoE expert weight matrices). For each expert e the kernel gathers the dispatched token vectors into contiguous GPU buffers and calls cublasSgemm (beta=1 accumulation) to compute grad_weight[:,:,e] += sum_t a[:,t] ⊗ b[:,t]. Also extends ggml_cuda_out_prod() to accept quantized src0 by dequantizing to a temporary F32 buffer before the Sgemm call, allowing it to serve the transposed-expert gradient path in the MUL_MAT_ID backward on CPUs that use mixed quantized+F32 LoRA adapters.
Extends the AdamW optimizer with an element-wise gradient clipping parameter (gclip, pars[7]). When gclip > 0 each gradient element is clamped to [-gclip, gclip] before the first/second moment update, preventing outlier gradients from corrupting the momentum state. Set gclip = 0 (default) to disable — existing behavior is preserved. Also adds two training-infrastructure improvements to ggml-opt: - Moment tensors (m, v) are now allocated on the same backend buffer type as their param tensor, avoiding cross-device mismatches when LoRA adapters span CPU and GPU (partial offload scenarios). - Non-static graphs: gradient accumulators are explicitly zeroed at the start of each accumulation cycle, fixing stale-gradient carry-over that occurred because ggml_graph_reset is not called between evals. And gradient checkpointing support: ggml_opt_params.grad_checkpoint_interval marks every Nth forward node as OUTPUT so the allocator keeps activations alive through the backward pass, trading compute for VRAM reduction.
Reads pars[7] (gclip) from the AdamW parameter tensor and clamps each gradient element to [-gclip, gclip] before the moment update. gclip == 0 disables clipping (preserves existing behavior).
Reads params[8] (gclip) from the AdamW parameter buffer in the GLSL shader and clamps each gradient element to [-gclip, gclip] before the moment update. Also extends the params array declaration from [7] to [8] and updates the matching assert in ggml-vulkan.cpp. gclip == 0 disables clipping (preserves existing behavior).
…feat/backward-mul-mat-id
- llama_opt_params: add grad_checkpoint_interval (forwarded to ggml-opt) - llama_opt_set_reward_weights(): thread-local reward-weight array for reward-weighted SFT (GRPO); passed as reward_scale into opt_epoch_iter so each data window can have a different cross-entropy weight - llama_opt_epoch(): add shuffle param — calls ggml_opt_dataset_shuffle on the training split at the start of each epoch - llama_context::opt_init(): inflate the scheduler and graph-result capacity before creating opt_ctx, sized from the actual measured training forward graph (4x multiplier for gf+gb_grad+gb_opt) - llama_context::opt_epoch_iter(): allocate ctx_compute_opt once per batch window instead of per ubatch (reset instead of free+alloc); add per-ubatch timing logs; support -1 sentinel label (masked position) - llama-adapter: prefer device-native buffer type over CPU fallback when a LoRA tensor's repack buft is not usable (keeps LoRA on GPU)
Adds command-line arguments for the qlora_training example: gradient checkpointing interval and reward weights path.
Adds a new training example for QLoRA fine-tuning of MoE models (Mixtral, Qwen-MoE, DeepSeek-MoE). Uses the new MUL_MAT_ID backward pass to train LoRA adapters on the expert weight matrices, with support for reward-weighted SFT (GRPO-style training). Files: - finetune_qlora.cpp: main training binary - grpo_example.py: end-to-end GRPO training script - check_lora_norms.py: diagnostic to inspect LoRA weight magnitudes - sample_data.jsonl, sample_rwsft_data.jsonl: example datasets - CMakeLists.txt: build target registration - README.md: usage instructions Also updates examples/training/finetune.cpp to pass the new shuffle argument to llama_opt_epoch().
…79/llama.cpp into feat/qlora-training-v2
|
Hi @srossitto79, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Overview
Adds a QLoRA fine-tuning example for Mixture-of-Experts models (Mixtral, Qwen-MoE, DeepSeek-MoE),
built on the
MUL_MAT_IDbackward pass. Supports both standard SFT and reward-weighted SFT (GRPO-styletraining) with gradient checkpointing to reduce activation VRAM.
Additional information
New example:
examples/qlora_training/finetune_qlora.cppReward-weighted SFT (
grpo_example.py)The example includes diagnostic tooling (
check_lora_norms.py) and sample datasets to help usersget started quickly.
Requirements
Depends on the
feat/backward-mul-mat-idbranch (or equivalent) for theMUL_MAT_IDbackward pass,OUT_PROD_IDop, andgclipAdamW parameter.