Skip to content

Fuse rms_norm, mul, quantize_q8_1#22710

Open
lnigam wants to merge 2 commits intoggml-org:masterfrom
lnigam:fuse_rms_norm_mul_qunatize_q8_1
Open

Fuse rms_norm, mul, quantize_q8_1#22710
lnigam wants to merge 2 commits intoggml-org:masterfrom
lnigam:fuse_rms_norm_mul_qunatize_q8_1

Conversation

@lnigam
Copy link
Copy Markdown
Contributor

@lnigam lnigam commented May 5, 2026

Overview

Fuse rms_norm+ mul+ qunatize_q8_1

Additional information

Tested on Qwen-3.6-35B-A3B-Q4_KM model, out of 81 RMS norms 41 are fused and 40 remained unfused due to incompatible MoE gate_input which requires F32 input only.

On 5090, it gives around 3-5% perf boost:
command: ./llama-bench -m ~/trust/Qwen3.6-35B-A3B-Q4_K_M.gguf -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0

Without change:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 220.76 ± 8.60
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 248.71 ± 11.13
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 245.73 ± 10.62
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 242.65 ± 10.62

With-change:

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 232.91 ± 9.05
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d100 259.02 ± 10.00
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d512 255.50 ± 10.40
qwen35moe 35B.A3B Q4_K - Medium 19.82 GiB 34.66 B CUDA 500 1 tg32 @ d4096 251.02 ± 9.78

Requirements

@lnigam lnigam requested review from a team and ggerganov as code owners May 5, 2026 11:08
@lnigam lnigam changed the title Fuse rms norm mul qunatize q8 1 Fuse rms norm mul quantize q8 1 May 5, 2026
@lnigam lnigam changed the title Fuse rms norm mul quantize q8 1 Fuse rms_norm, mul, quantize_q8_1 May 5, 2026
}

// Patch mul_node type and strides so downstream MUL_MAT sees Q8_1
mul_node->type = GGML_TYPE_Q8_1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the node like this is going to cause issues, we should find another way

@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants