Sharing field notes from a small-scale GRPO + Muon experiment.
I was curious whether Muon’s high-velocity updates are compatible with RL-style training (vs pretraining). This write-up documents four failure modes I hit, what signals turned out to be misleading, and the narrow stability pocket where learning actually emerged.
I also included a section of “Four Horsemen of failure mode”. I don’t see many people talk about failure mode, so I hope this section will be helpful for some of you.
Would love feedback from folks who’ve experimented with Muon / non-Adam optimizers in RL.
blog post
Field Notes: The Dilemma of Training Reasoning with Muon
-Jen
1 Like
Update / Follow-up:
After publishing these field notes, I ran one more constrained experiment with a lower LR and more conservative scaling.
Interestingly, the model recovered formatting consistency (\boxed{}) and produced correct, step-by-step solutions on a small evaluation set. This suggests there are narrow stability pockets where Muon + GRPO can hold both structure and correctness — though the basin still seems fragile.
This doesn’t contradict the failure modes above, but adds nuance: the issue seems to be retention and damping, not the absence of learning. Sharing in case this is useful to others experimenting in this space.
1 Like
Update: In my second instalment of my field note, I documented a very puzzling observation: Muon struggles to stabilize despite very low KL. This instalment focuses on entropy dynamics, format reward failure modes, and a hypothesis about orthogonality vs variance states (with an ablation plan using SGD as a bookend).
Sharing in case it’s useful to anyone else poking at optimizer geometry in RL.
Feedback welcome.
Field Notes: Why Muon "Hollows Out" in RL (and How We Plan To DO Next)
1 Like