Zach Anderson
Apr 22, 2026 20:41
NVIDIA integrates Muon and superior optimizers into Megatron to boost large-scale LLM coaching with near-parity throughput to AdamW.
NVIDIA is pushing the boundaries of huge language mannequin (LLM) coaching with its integration of superior optimizers like Muon into the Megatron Core framework. In response to NVIDIA’s April 22, 2026 weblog publish, the Muon optimizer, based mostly on higher-order mathematical strategies, has achieved near-parity coaching throughput with the extensively used AdamW optimizer whereas enhancing mannequin efficiency on large-scale techniques just like the NVIDIA GB300 NVL72.
Muon, quick for MomentUm Orthogonalized by Newton-Schulz, is a higher-order optimization algorithm. It has been instrumental in coaching main open-source fashions akin to Kimi K2 and GLM-5. By leveraging superior preconditioning methods, the optimizer ensures larger FLOPs utilization (floating level operations per second), a crucial metric for maximizing computational effectivity in LLMs.
Efficiency Metrics: Muon vs. AdamW
Desk 1 from NVIDIA’s report reveals that Muon delivers comparable throughput to AdamW on the GB300 NVL72 system. As an example, the Kimi K2 mannequin achieved 1,080 TFLOPs/s/GPU with Muon, barely surpassing AdamW’s 1,051 TFLOPs/s/GPU. Equally, the Qwen3 30B mannequin reached 721 TFLOPs/s/GPU with Muon in comparison with 713 TFLOPs/s/GPU with AdamW.
These outcomes have been obtained utilizing the NVIDIA NeMo Megatron Bridge 26.02, a PyTorch-native library designed for pretraining and fine-tuning LLMs. The high-performance benchmarks spotlight Muon’s means to deal with the computational calls for of recent AI workloads with out sacrificing effectivity.
Technological Improvements
Scaling Muon to 1000’s of GPUs presents challenges, together with elevated computational and reminiscence prices throughout preconditioning, in addition to communication bottlenecks in distributed techniques. NVIDIA addresses these hurdles by a number of improvements:
Layer-Clever Distributed Optimizer: Full layers of mannequin parameters are distributed throughout GPUs, enabling environment friendly preconditioning with out extreme communication overhead.
Distributed Newton-Schulz: Two modes—duplicated and distributed—permit versatile dealing with of momentum updates. Whereas the duplicated mode minimizes latency, the distributed mode optimizes computational effectivity.
Communication Hiding and SYRK Fusion: Methods like overlapping parameter updates with computation and fusing SYRK operations with communication considerably scale back latency, boosting general throughput.
Implications and Future Developments
By integrating Muon into the Megatron Core, NVIDIA is equipping researchers and builders with instruments to enhance LLM coaching at scale. The near-parity efficiency with AdamW makes Muon a pretty alternative, particularly as upcoming updates promise additional effectivity positive factors. These embody enhanced load balancing, higher communication methods, and superior kernel optimizations for SYRK operations.
For these desirous to discover these applied sciences, NVIDIA has made instruments and efficiency recipes out there by its Megatron Bridge GitHub repository. With these assets, researchers can implement and benchmark rising optimizers like Muon in their very own LLM tasks.
Picture supply: Shutterstock






