Sharing field notes from a small-scale GRPO + Muon experiment

Sharing field notes from a small-scale GRPO + Muon experiment.

I was curious whether Muon’s high-velocity updates are compatible with RL-style training (vs pretraining). This write-up documents four failure modes I hit, what signals turned out to be misleading, and the narrow stability pocket where learning actually emerged.

I also included a section of “Four Horsemen of failure mode”. I don’t see many people talk about failure mode, so I hope this section will be helpful for some of you.

Would love feedback from folks who’ve experimented with Muon / non-Adam optimizers in RL.

blog post :backhand_index_pointing_right: Field Notes: The Dilemma of Training Reasoning with Muon

-Jen

1 Like

Update / Follow-up:

After publishing these field notes, I ran one more constrained experiment with a lower LR and more conservative scaling.

Interestingly, the model recovered formatting consistency (\boxed{}) and produced correct, step-by-step solutions on a small evaluation set. This suggests there are narrow stability pockets where Muon + GRPO can hold both structure and correctness — though the basin still seems fragile.

This doesn’t contradict the failure modes above, but adds nuance: the issue seems to be retention and damping, not the absence of learning. Sharing in case this is useful to others experimenting in this space.

1 Like

Update: In my second instalment of my field note, I documented a very puzzling observation: Muon struggles to stabilize despite very low KL. This instalment focuses on entropy dynamics, format reward failure modes, and a hypothesis about orthogonality vs variance states (with an ablation plan using SGD as a bookend).

Sharing in case it’s useful to anyone else poking at optimizer geometry in RL.

Feedback welcome.

:backhand_index_pointing_right: Field Notes: Why Muon "Hollows Out" in RL (and How We Plan To DO Next)

1 Like