RLVR for code execution prediction

jeff4700 · February 27, 2026, 2:45am

Hi everyone,
I’m currently training a small language model to improve its accuracy on code execution prediction (i.e., predicting the exact output from the code and input). I’m working with the Qwen3-4B model and have been using GRPO for training.
By combining various dense reward signals, I was able to increase the accuracy to around 72%. This approach also helped eliminate the infinite Repeat Curse(a common problem in smaller Qwen models), and overall training has been stable and quite goes well. However, pushing performance beyond 72% has been extremely challenging.

With the current setup, the reward per rollout increases smoothly during training, which aligns well with the observed improvement in accuracy. However, as the reward approaches 1 (e.g., 0.972, 0.984, etc.), it becomes very difficult to reach exactly 1. Since the task requires the predicted code execution output to match the ground truth exactly to be considered correct, even minor deviations prevent further gains. I believe this is the main reason training plateaus at 72%.

What I’ve tried so far:

Switching from dense rewards to sparse rewards once accuracy reached 72% (reward = 1 for exact match, 0 otherwise).
Experimenting with different learning rates and kl coef.
Varying batch sizes.
Training with different datasets.
Running multiple long training experiments over several days.

Despite extensive experimentation, I haven’t been able to break past this performance ceiling.
Has anyone here worked with GRPO, RLVR, or similar reinforcement learning approaches for code execution prediction tasks? I’d greatly appreciate any insights or suggestions.
If helpful, I can share detailed Weights & Biases logs and other experiment logs for further discussion.
Thank you!

John6666 · February 27, 2026, 10:44pm

I experimented with Colab within my understanding to narrow down the problem:

Why this plateaus (and why “reward → 0.98 but exact-match stays flat” is expected)

1) Exact-match is a brittle objective

In output-prediction benchmarks like CRUXEval, the task is to produce the exact output string for a Python function and a given input. Small surface-form differences (quotes, whitespace/newlines, float formatting, container repr details) turn a “correct value” into a strict failure. CRUXEval shows even strong systems are far from perfect on this kind of execution-style output prediction. (arXiv)

2) GRPO’s learning signal collapses on the “hard tail”

GRPO is group-relative: for each prompt you sample a group of rollouts and compute advantages relative to that group. When a tail prompt yields rollouts that all get the same reward (often all 0 under sparse exact-match), the within-group variance goes to ~0, so the effective gradient becomes tiny/zero on exactly the examples you need to improve.

TRL exposes this failure mode directly via frac_reward_zero_std (“fraction of samples … with a reward std of zero, implying there is little diversity for that prompt”). (Hugging Face)

This is also the core motivation of replay-based fixes like RePO (Replay-Enhanced Policy Optimization): when groups are homogeneous, you get “ineffective steps,” so you need mechanisms that bring back variance / usable comparisons. (arXiv)

3) Dense rewards saturate before they force exact strings

Dense similarity-style rewards can approach 1.0 without “seeing” the final 1–2 character differences that your verifier cares about. So average rollout reward rises smoothly while exact-match hits become rarer and rarer.

4) RLVR often reallocates probability mass rather than inventing new capability

A common pattern in RL with verifiable rewards is that training improves the likelihood of certain solution paths (and can reduce diversity), so you may see better top-1 behavior on some slice but not necessarily improved Pass@K or tail breakthroughs unless you explicitly address exploration/variance and probability-mass concentration. (OpenReview)

The fastest way to know what’s actually blocking you (diagnostics that decide the fix)

Run these on the tail slice (prompts where greedy is wrong), and track them over training:

A) Top-1 vs Pass@K (or Pass@G)

If Pass@G_exact is meaningfully higher than Top-1_exact, the model can hit exact outputs but assigns low probability mass to them → prioritize distillation / likelihood shaping.
If Pass@G_exact is near zero on the tail, it’s more of a capability / protocol / search problem → prioritize replay/scaffolding/constraints.

B) Dead-group breakdown (per prompt, across G rollouts)

Track:

dead_all_wrong (all 0)
dead_mixed (some 0, some 1)
dead_all_correct (all 1)

If dead_all_wrong dominates the tail, sparse exact-match GRPO will stall unless you change the algorithmic situation (replay, scaffolding, bigger groups, etc.). TRL’s frac_reward_zero_std is a good global proxy, but the tail-only breakdown is what you want. (Hugging Face)

C) Error taxonomy (cheap wins vs hard wins)

Bucket each completion into:

Non-literal / extra text (explanations, prompt echo, multiple lines)
Literal but wrong value (semantic execution error)
Right value, wrong repr (formatting/protocol mismatch)

If bucket (3) is large, you’re mostly fighting last-mile formatting, which is very fixable with constraints + last-mile shaping.

Fixes that reliably break the wall (ordered by ROI)

1) Make dead groups learnable again: replay-based GRPO

(a) TRL “GRPO with replay buffer” (practical drop-in)
TRL provides an experimental GRPO trainer that replaces groups with 0 reward standard deviation using replayed groups with higher reward/std from previous batches. (Hugging Face)

(b) RePO-style replay (principled generalization)
RePO formalizes replay/off-policy retrieval for GRPO-like training to increase “effective optimization steps” and reduce data inefficiency. (arXiv)

What this targets: your tail’s dead_all_wrong / zero-std collapse.

2) Verify reward scaling/normalization isn’t silently killing your signal

In TRL GRPO, reward scaling mode affects how advantages are normalized; TRL documents how reward_std is computed under scale_rewards modes and logs frac_reward_zero_std. (Hugging Face)

Also note there has been a reported issue where scale_rewards="batch" behaved unexpectedly in some revisions. Treat this as “must regression-test” rather than an assumption. (GitHub)

Actionable: add a tiny synthetic unit test that feeds known rewards and asserts the std/advantages match your intended semantics before you trust any plateau conclusions.

3) If you combine multiple dense rewards: consider GDPO (multi-reward normalization fix)

If your 72% comes from mixing dense rewards, you may be hitting a multi-reward resolution collapse: GDPO shows that applying GRPO-style normalization directly to combined multi-reward signals can collapse distinct reward combinations into similar advantages, harming convergence/stability. GDPO decouples normalization per reward component as a drop-in replacement for GRPO in multi-reward RL. (arXiv)

What this targets: “reward looks high and smooth” but doesn’t translate into the exact-match metric because the effective advantage signal loses resolution.

4) Constrain decoding so the model stays in the “single-literal” manifold

If a noticeable fraction of failures are “extra text / not a literal,” you should treat it as a decoding/protocol problem, not an RL problem.

(a) Stop strings
Transformers supports stop_strings to terminate generation when a string is produced. (Hugging Face)
In practice, it often requires passing a tokenizer (tokenizer=...) alongside stop_strings. (GitHub)
Be aware there are open issues where stop_strings can behave unexpectedly in some setups; test on your exact stack. (GitHub)

(b) Prefix-constrained decoding
Use prefix_allowed_tokens_fn (or a trie constraint) to restrict generation to tokens consistent with a Python literal (numbers, brackets, quotes, commas, True/False/None, etc.). This prevents drifting into explanations.
There are also known edge cases/bugs when constraints return degenerate allowed sets; test thoroughly. (GitHub)

What this targets: it converts many “formatting fails” into “valid literal but wrong value/repr,” which makes reward shaping far more effective.

5) Use a protocol-gated last-mile shaping ladder (dense only where it matters)

A robust pattern for exact-match tasks:

Hard gate: reward = 0 unless the output is exactly one parseable literal (no extra text).
Value reward: once parseable, reward correctness of the parsed value (semantic).
Last-mile string shaping: only after (1) and ideally after value is correct, add shaping on prefix match / edit distance / exact repr features.

This avoids the main failure mode of dense rewards: rewarding “close-looking” text that never becomes an exact canonical string.

6) If Pass@G_exact exists but Top-1_exact stalls: RL → distill → RL

When the model already occasionally hits exact outputs on tail prompts, the fastest way to push top-1 exact is often:

sample rollouts (temperature mix / best-of-G) and harvest exact hits
do a short supervised distillation pass to concentrate probability mass on the exact output format
resume RL on the newly defined tail

This aligns with RLVR analyses that emphasize probability-mass reallocation and early emergence of “correct path” incentives. (OpenReview)

7) If the tail is truly beyond capability: scaffold the tail (Scaf-GRPO idea, adapted)

Scaf-GRPO targets the “learning cliff” where hard problems stay at persistent zero reward, making them invisible to GRPO gradients; it injects minimal hints only when learning stagnates. (arXiv)

For code-execution output prediction, a “scaffold” can be lightweight and verifier-compatible, e.g.:

provide an intermediate trace format only for examples that are repeatedly dead-all-wrong
add a two-stage prompt (first compute value internally, then output only the literal) while still verifying only the final literal
curriculum on output types (ints → lists → nested → floats/strings/edge reprs)

A concrete ablation plan that usually identifies the bottleneck in 2–3 runs

On a fixed tail slice, compare these variants with the same total compute:

Baseline (current best) + log: Top-1_exact, Pass@G_exact, dead_all_wrong rate, non-literal rate
+ Replay-buffer GRPO (or RePO-style replay) (Hugging Face)
+ Protocol-constrained decoding (stop strings and/or prefix constraints) (Hugging Face)
Multi-reward: swap GRPO normalization for GDPO-style decoupling (if multi-reward) (arXiv)
If Pass@G gap exists: harvest exact hits → distill → resume RL (OpenReview)

The key is: do not judge progress by “average reward” alone once you hit the plateau. Judge it by (a) tail dead-group rate, (b) non-literal rate, and (c) Pass@G vs Top-1. Those three numbers will tell you which intervention will actually move strict exact-match.

Topic		Replies	Views
Huggingface trl GRPO loss is always zero Beginners	5	1191	May 18, 2025
Generating More Varied Trajectories During Inference for GRPO Beginners	2	107	November 22, 2025
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	860	February 12, 2025
Practical Exercise: GRPO with Unsloth reward curve Course	1	441	April 1, 2025
Fine-tune a minimal LLM model with RTX 2050 GPU Beginners	19	830	January 7, 2026