Muon torch submission by Niccolo-Ajroldi · Pull Request #15 · mlcommons/submissions_algorithms

Niccolo-Ajroldi · 2025-09-30T18:54:29Z

New Submission: Muon

Submission Information

submission_name: "MuonTorch"
submission_folder: "submissions/self_tuning/muon_torch/"
authors: "Niccolò Ajroldi*"
affiliations: "ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems"
version: "1.0"
ruleset: "self-tuning"
framework: "PyTorch"
description: "Muon DDP implementation in PyTorch."

*credits to original Muon implementation from Keller Jordan

Evidence for the Submission's Performance

Muon original blogpost
Muon is Scalable for LLM Training
Modded-nanogpt

Submission Development

The Muon optimizer is primarily based on the approximate orthogonalization of weight matrices.
The employed Newton–Schulz algorithm, despite using only 5 iterations, introduces a significant slowdown compared to optimizers that update parameters independently.

For example, when updating an (N \times N) matrix in PyTorch on 1×A100-40GB, we observe that Muon can be up to ~30× slower than Adam:

+-------+---------+----------+----------+
|    N  | SGD (s) | Adam (s) | Muon (s) |
+-------+---------+----------+----------+
|    10 | 0.00116 |  0.00088 |  0.00662 |
|   100 | 0.00032 |  0.00044 |  0.00377 |
|  1000 | 0.00034 |  0.00045 |  0.00990 |
| 10000 | 0.00201 |  0.00480 |  0.14230 |
+-------+---------+----------+----------+

Therefore, when using multiple devices, available compute should be allocated smartly by distributing the orthogonalization workload across devices, rather than simply replicating it.
To test the effectiveness of this paradigm, we compare two implementation on AlgoPerf:

MuonVanilla, where the optimization algorithm is replicated identically across devices: each rank orthogonalizes all parameters.
MuonDataParallel: this design is based on the Muon GitHub repo. Each device locally updates a distinct subset of parameters, which are later all-gathered. This leverages PyTorch 2.5 support for all-gathering tensors of different shapes (docs).

Compared to the original implementation, we sort parameters according to their Newton–Schulz computational complexity rather than parameter size.
We further replace the default PyTorch gradient all-reduce with a custom reduce-scatter. Since different devices update different parameter subsets, each device only requires the reduced gradients corresponding to the parameters it owns. Crucially, the scatter operation follows the block structure of the distributed Muon update. Finally, we make the all-gather operation asynchronous, enabling efficient overlap between communication and computation.

Notice that several more orthogonalization strategies are possible, and we give an overview of some of them in this diagram and in a dedicated wandb report.

Chossing a single implementaton:

We benchmark both Muon implementations (and AdamW for comparison) across AlgoPerf workloads (FineWebEdu workload was not avaialable at the time of this analysis) on 4×A100-40GB, using the batch sizes from the NadamW baseline submission.
Each run trains for 5% of the workload step_hint, and the first 100 steps are excluded as burn-in.

optim_name	accumulated_submission_time_min	time_saved_over_vanilla_min
MuonVanilla	352.18	0.00
MuonDataParallel	334.76	17.42
AdamW	337.06	15.12

Distributing the orthogonalization burden across devices substantially reduces Muon overhead, bringing its runtime in line with AdamW. We observe a slow-down compared to the vanilla single-device version only on crieto1tb, but a significant advantage on wmt.

Submission Details

Backup optimizer.
We use AdamW as the backup optimizer, optimizing the following parameters with it:
- 1D params (biases, layernorm, batchnorm)
- Embeddings of wmt, criteo1tb, finewebedu_lm
We attach txt files of the resulting parameter split for each workloads for ease of inspection.
Momentum implementation.
We follow Adam-style EMA implementation of momentum, as also done in Muon official repo, and in modded-nanogpt. Notice, however, that the original formulation of Muon uses PyTorch-SGD-style momentum, and a similar implementation is followed by MoonShootAI.
3D,4D parameters.
3D parameters are flattened on the trailing dimensions and NS orthogonalization is applied. No 4D parameters are present in AlgoPerf at the time of this analysis.

Next steps

github-actions · 2025-09-30T18:54:40Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…lgorithms into muon_torch

Added finewebedu_lm.txt with model parameters for Muon and Adam.

muon torch vanilla DP

1a4a873

Niccolo-Ajroldi requested a review from a team as a code owner September 30, 2025 18:54

Niccolo-Ajroldi changed the title ~~muon torch vanilla DP~~ Muon torch submission (vanilla DP) Sep 30, 2025

Niccolo-Ajroldi added 4 commits October 3, 2025 10:38

add muon dampening hyperparam

4462202

cleanup: split muon-adam

87c64dd

rename MuonVanilla, add MuonKJ, add MuonBucketed, custom reduce scatter

81f66b7

add tests

b2326ff

Niccolo-Ajroldi changed the title ~~Muon torch submission (vanilla DP)~~ Muon torch submission Oct 6, 2025

Niccolo-Ajroldi changed the title ~~Muon torch submission~~ Muon torch submission [WIP] Oct 6, 2025

Niccolo-Ajroldi and others added 20 commits October 7, 2025 13:21

add diagrams

5e914e0

Add files via upload

98c5585

Add files via upload

2dd75e1

add utils, polished, add param aplitting

cccffe0

moved diagrams

9ff8184

enable AdamW fused optim

3926947

Add files via upload

ec5d1a7

Add files via upload

3798e2e

Add files via upload

8895419

cleaned diagrams

4b291ed

Add files via upload

66c0385

fix ReduceScatter

231c5dd

Merge branch 'muon_torch' of github.com:Niccolo-Ajroldi/submissions_a…

9c865e5

…lgorithms into muon_torch

Add files via upload

6c61acc

removed dampening, use adam style ema

4618f38

separate lr and wd for muon adam

eb0f779

separate lr and wd for muon adam

3ad574f

format

8649960

Merge branch 'muon_torch' of github.com:Niccolo-Ajroldi/submissions_a…

d28b9b6

…lgorithms into muon_torch

clean

1d5ff98

Niccolo-Ajroldi and others added 7 commits December 11, 2025 15:09

cleanup, def muon submission

23b198b

cleanup

fd6584d

Add finewebedu_lm.txt with model parameters

b87c93d

Added finewebedu_lm.txt with model parameters for Muon and Adam.

Update finewebedu_lm.txt

0b9935e

Update finewebedu_lm.txt

a69c9d4

update for self tuning submission

2812aaa

merge fix

015f68d

Niccolo-Ajroldi changed the title ~~Muon torch submission [WIP]~~ Muon torch submission May 18, 2026

Niccolo-Ajroldi assigned Niccolo-Ajroldi and unassigned Niccolo-Ajroldi May 18, 2026

Niccolo-Ajroldi force-pushed the muon_torch branch from 15cd68e to 015f68d Compare May 18, 2026 14:36

priyakasimbeg approved these changes May 19, 2026

View reviewed changes

priyakasimbeg merged commit eaa6195 into mlcommons:main May 19, 2026
9 checks passed

github-actions Bot locked and limited conversation to collaborators May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muon torch submission#15

Muon torch submission#15
priyakasimbeg merged 32 commits into
mlcommons:mainfrom
Niccolo-Ajroldi:muon_torch

Niccolo-Ajroldi commented Sep 30, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Niccolo-Ajroldi commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Submission: Muon

Submission Information

Evidence for the Submission's Performance

Submission Development

Chossing a single implementaton:

Submission Details

Next steps

Uh oh!

github-actions Bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Niccolo-Ajroldi commented Sep 30, 2025 •

edited

Loading

github-actions Bot commented Sep 30, 2025 •

edited

Loading