I just published the first part of a tutorial series on the Muon Optimizer.
Muon (Momentum Orthogonalized by Newton-Schulz) is quickly becoming the go-to optimizer for large-scale training. It’s already powering trillion-parameter frontier models like Kimi-2 (MuonClip) and was critical for the ATLAS paper, where first-order optimizers failed.
In this series, I’m breaking Muon down step by step: intuition, pseudocode, PyTorch implementation, and practical guidance on when/where to use it.
Also — I’d really like to contribute this as a guest article to the Hugging Face blog. I know the blog is managed by a group, but it looks like external contributors can’t directly join. If anyone here has advice or connections on how to submit contributions, I’d love to hear it ![]()
Muon deserves more attention in the open-source community, and I’d be excited to help bridge that gap.
