Fuse rms_norm, mul, quantize_q8_1 by lnigam · Pull Request #22710 · ggml-org/llama.cpp

lnigam · 2026-05-05T11:08:57Z

Overview

Fuse rms_norm+ mul+ qunatize_q8_1

Additional information

Tested on Qwen-3.6-35B-A3B-Q4_KM model, out of 81 RMS norms 41 are fused and 40 remained unfused due to incompatible MoE gate_input which requires F32 input only.

On 5090, it gives around 3-5% perf boost:
command: ./llama-bench -m ~/trust/Qwen3.6-35B-A3B-Q4_K_M.gguf -n 32 -ngl 500 -r 5 -fa 1 -d 0,100,512,4096 -p 0

Without change:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	220.76 ± 8.60
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	248.71 ± 11.13
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	245.73 ± 10.62
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	242.65 ± 10.62

With-change:

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32	232.91 ± 9.05
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d100	259.02 ± 10.00
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d512	255.50 ± 10.40
qwen35moe 35B.A3B Q4_K - Medium	19.82 GiB	34.66 B	CUDA	500	1	tg32 @ d4096	251.02 ± 9.78

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: yes, mostly for code review

…cks whether the consumers of MUL are all MAT_MUL and MAT_MUL weights are not F32

am17an · 2026-05-05T11:21:48Z

+    }
+
+    // Patch mul_node type and strides so downstream MUL_MAT sees Q8_1
+    mul_node->type  = GGML_TYPE_Q8_1;


Changing the node like this is going to cause issues, we should find another way

lnigam added 2 commits May 5, 2026 08:33

Fuse rms_norm+mul+qunatize_q8_1 operators. for every RMS_NORM, it che…

9191123

…cks whether the consumers of MUL are all MAT_MUL and MAT_MUL weights are not F32

Add test case for fused RMS_NORM+MUL+QUNATIZE_Q8_1 operator

16a8a6a

lnigam requested review from a team and ggerganov as code owners May 5, 2026 11:08

lnigam changed the title ~~Fuse rms norm mul qunatize q8 1~~ Fuse rms norm mul quantize q8 1 May 5, 2026

lnigam changed the title ~~Fuse rms norm mul quantize q8 1~~ Fuse rms_norm, mul, quantize_q8_1 May 5, 2026

am17an reviewed May 5, 2026

View reviewed changes

am17an requested changes May 5, 2026

View reviewed changes

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse rms_norm, mul, quantize_q8_1#22710

Fuse rms_norm, mul, quantize_q8_1#22710
lnigam wants to merge 2 commits intoggml-org:masterfrom
lnigam:fuse_rms_norm_mul_qunatize_q8_1

lnigam commented May 5, 2026

Uh oh!

am17an May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lnigam commented May 5, 2026

Overview

Additional information

Requirements

Uh oh!

am17an May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants