I experimented with Colab within my understanding to narrow down the problem:
Why this plateaus (and why “reward → 0.98 but exact-match stays flat” is expected)
1) Exact-match is a brittle objective
In output-prediction benchmarks like CRUXEval, the task is to produce the exact output string for a Python function and a given input. Small surface-form differences (quotes, whitespace/newlines, float formatting, container repr details) turn a “correct value” into a strict failure. CRUXEval shows even strong systems are far from perfect on this kind of execution-style output prediction. (arXiv)
2) GRPO’s learning signal collapses on the “hard tail”
GRPO is group-relative: for each prompt you sample a group of rollouts and compute advantages relative to that group. When a tail prompt yields rollouts that all get the same reward (often all 0 under sparse exact-match), the within-group variance goes to ~0, so the effective gradient becomes tiny/zero on exactly the examples you need to improve.
TRL exposes this failure mode directly via frac_reward_zero_std (“fraction of samples … with a reward std of zero, implying there is little diversity for that prompt”). (Hugging Face)
This is also the core motivation of replay-based fixes like RePO (Replay-Enhanced Policy Optimization): when groups are homogeneous, you get “ineffective steps,” so you need mechanisms that bring back variance / usable comparisons. (arXiv)
3) Dense rewards saturate before they force exact strings
Dense similarity-style rewards can approach 1.0 without “seeing” the final 1–2 character differences that your verifier cares about. So average rollout reward rises smoothly while exact-match hits become rarer and rarer.
4) RLVR often reallocates probability mass rather than inventing new capability
A common pattern in RL with verifiable rewards is that training improves the likelihood of certain solution paths (and can reduce diversity), so you may see better top-1 behavior on some slice but not necessarily improved Pass@K or tail breakthroughs unless you explicitly address exploration/variance and probability-mass concentration. (OpenReview)
The fastest way to know what’s actually blocking you (diagnostics that decide the fix)
Run these on the tail slice (prompts where greedy is wrong), and track them over training:
A) Top-1 vs Pass@K (or Pass@G)
- If Pass@G_exact is meaningfully higher than Top-1_exact, the model can hit exact outputs but assigns low probability mass to them → prioritize distillation / likelihood shaping.
- If Pass@G_exact is near zero on the tail, it’s more of a capability / protocol / search problem → prioritize replay/scaffolding/constraints.
B) Dead-group breakdown (per prompt, across G rollouts)
Track:
dead_all_wrong (all 0)
dead_mixed (some 0, some 1)
dead_all_correct (all 1)
If dead_all_wrong dominates the tail, sparse exact-match GRPO will stall unless you change the algorithmic situation (replay, scaffolding, bigger groups, etc.). TRL’s frac_reward_zero_std is a good global proxy, but the tail-only breakdown is what you want. (Hugging Face)
C) Error taxonomy (cheap wins vs hard wins)
Bucket each completion into:
- Non-literal / extra text (explanations, prompt echo, multiple lines)
- Literal but wrong value (semantic execution error)
- Right value, wrong repr (formatting/protocol mismatch)
If bucket (3) is large, you’re mostly fighting last-mile formatting, which is very fixable with constraints + last-mile shaping.
Fixes that reliably break the wall (ordered by ROI)
1) Make dead groups learnable again: replay-based GRPO
(a) TRL “GRPO with replay buffer” (practical drop-in)
TRL provides an experimental GRPO trainer that replaces groups with 0 reward standard deviation using replayed groups with higher reward/std from previous batches. (Hugging Face)
(b) RePO-style replay (principled generalization)
RePO formalizes replay/off-policy retrieval for GRPO-like training to increase “effective optimization steps” and reduce data inefficiency. (arXiv)
What this targets: your tail’s dead_all_wrong / zero-std collapse.
2) Verify reward scaling/normalization isn’t silently killing your signal
In TRL GRPO, reward scaling mode affects how advantages are normalized; TRL documents how reward_std is computed under scale_rewards modes and logs frac_reward_zero_std. (Hugging Face)
Also note there has been a reported issue where scale_rewards="batch" behaved unexpectedly in some revisions. Treat this as “must regression-test” rather than an assumption. (GitHub)
Actionable: add a tiny synthetic unit test that feeds known rewards and asserts the std/advantages match your intended semantics before you trust any plateau conclusions.
3) If you combine multiple dense rewards: consider GDPO (multi-reward normalization fix)
If your 72% comes from mixing dense rewards, you may be hitting a multi-reward resolution collapse: GDPO shows that applying GRPO-style normalization directly to combined multi-reward signals can collapse distinct reward combinations into similar advantages, harming convergence/stability. GDPO decouples normalization per reward component as a drop-in replacement for GRPO in multi-reward RL. (arXiv)
What this targets: “reward looks high and smooth” but doesn’t translate into the exact-match metric because the effective advantage signal loses resolution.
4) Constrain decoding so the model stays in the “single-literal” manifold
If a noticeable fraction of failures are “extra text / not a literal,” you should treat it as a decoding/protocol problem, not an RL problem.
(a) Stop strings
Transformers supports stop_strings to terminate generation when a string is produced. (Hugging Face)
In practice, it often requires passing a tokenizer (tokenizer=...) alongside stop_strings. (GitHub)
Be aware there are open issues where stop_strings can behave unexpectedly in some setups; test on your exact stack. (GitHub)
(b) Prefix-constrained decoding
Use prefix_allowed_tokens_fn (or a trie constraint) to restrict generation to tokens consistent with a Python literal (numbers, brackets, quotes, commas, True/False/None, etc.). This prevents drifting into explanations.
There are also known edge cases/bugs when constraints return degenerate allowed sets; test thoroughly. (GitHub)
What this targets: it converts many “formatting fails” into “valid literal but wrong value/repr,” which makes reward shaping far more effective.
5) Use a protocol-gated last-mile shaping ladder (dense only where it matters)
A robust pattern for exact-match tasks:
- Hard gate: reward = 0 unless the output is exactly one parseable literal (no extra text).
- Value reward: once parseable, reward correctness of the parsed value (semantic).
- Last-mile string shaping: only after (1) and ideally after value is correct, add shaping on prefix match / edit distance / exact repr features.
This avoids the main failure mode of dense rewards: rewarding “close-looking” text that never becomes an exact canonical string.
6) If Pass@G_exact exists but Top-1_exact stalls: RL → distill → RL
When the model already occasionally hits exact outputs on tail prompts, the fastest way to push top-1 exact is often:
- sample rollouts (temperature mix / best-of-G) and harvest exact hits
- do a short supervised distillation pass to concentrate probability mass on the exact output format
- resume RL on the newly defined tail
This aligns with RLVR analyses that emphasize probability-mass reallocation and early emergence of “correct path” incentives. (OpenReview)
7) If the tail is truly beyond capability: scaffold the tail (Scaf-GRPO idea, adapted)
Scaf-GRPO targets the “learning cliff” where hard problems stay at persistent zero reward, making them invisible to GRPO gradients; it injects minimal hints only when learning stagnates. (arXiv)
For code-execution output prediction, a “scaffold” can be lightweight and verifier-compatible, e.g.:
- provide an intermediate trace format only for examples that are repeatedly dead-all-wrong
- add a two-stage prompt (first compute value internally, then output only the literal) while still verifying only the final literal
- curriculum on output types (ints → lists → nested → floats/strings/edge reprs)
A concrete ablation plan that usually identifies the bottleneck in 2–3 runs
On a fixed tail slice, compare these variants with the same total compute:
- Baseline (current best) + log: Top-1_exact, Pass@G_exact,
dead_all_wrong rate, non-literal rate
- + Replay-buffer GRPO (or RePO-style replay) (Hugging Face)
- + Protocol-constrained decoding (stop strings and/or prefix constraints) (Hugging Face)
- Multi-reward: swap GRPO normalization for GDPO-style decoupling (if multi-reward) (arXiv)
- If Pass@G gap exists: harvest exact hits → distill → resume RL (OpenReview)
The key is: do not judge progress by “average reward” alone once you hit the plateau. Judge it by (a) tail dead-group rate, (b) non-literal rate, and (c) Pass@G vs Top-1. Those three numbers will tell you which intervention will actually move strict exact-match.