Sharing field notes from a small-scale GRPO + Muon experiment

bird-of-paradise · January 26, 2026, 3:01am

Sharing field notes from a small-scale GRPO + Muon experiment.

I was curious whether Muon’s high-velocity updates are compatible with RL-style training (vs pretraining). This write-up documents four failure modes I hit, what signals turned out to be misleading, and the narrow stability pocket where learning actually emerged.

I also included a section of “Four Horsemen of failure mode”. I don’t see many people talk about failure mode, so I hope this section will be helpful for some of you.

Would love feedback from folks who’ve experimented with Muon / non-Adam optimizers in RL.

blog post Field Notes: The Dilemma of Training Reasoning with Muon

-Jen

bird-of-paradise · January 26, 2026, 3:47am

Update / Follow-up:

After publishing these field notes, I ran one more constrained experiment with a lower LR and more conservative scaling.

Interestingly, the model recovered formatting consistency (\boxed{}) and produced correct, step-by-step solutions on a small evaluation set. This suggests there are narrow stability pockets where Muon + GRPO can hold both structure and correctness — though the basin still seems fragile.

This doesn’t contradict the failure modes above, but adds nuance: the issue seems to be retention and damping, not the absence of learning. Sharing in case this is useful to others experimenting in this space.

bird-of-paradise · February 1, 2026, 1:43am

Update: In my second instalment of my field note, I documented a very puzzling observation: Muon struggles to stabilize despite very low KL. This instalment focuses on entropy dynamics, format reward failure modes, and a hypothesis about orthogonality vs variance states (with an ablation plan using SGD as a bookend).

Sharing in case it’s useful to anyone else poking at optimizer geometry in RL.

Feedback welcome.

Field Notes: Why Muon "Hollows Out" in RL (and How We Plan To DO Next)

Topic		Replies	Views
Scaling Is Not Plug-and-Play: What Muon Teaches Us About Optimizers at Scale Show and Tell	0	48	January 4, 2026
Hopper — partial orthogonalization changes early reasoning behavior in RL Show and Tell	1	25	February 6, 2026
Research papers about Muon from MIT Research	0	38	January 10, 2026
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	860	February 12, 2025
Practical Exercise: GRPO with Unsloth reward curve Course	1	441	April 1, 2025

Sharing field notes from a small-scale GRPO + Muon experiment

Related topics