ReMix: Reincarnating Mix-policy Proximal Policy Gradient

🧽 Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Project Page arXiv Tweets GitHub

ReMix (Reincarnating Mix-policy Proximal Policy Gradient) is a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data.

📰 News

[2025/08/30] We release Remix-R1-Distilled-Qwen-1.5B and Remix-R1-Distilled-Qwen-7B, the checkpoints of the ReMix-PPO reported in our paper. We also release a ready-to-use [evaluation scripts] that reproduces all benchmark results.

[2025/07/11] We release paper Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model presenting ReMix, a simple yet effective approach that equips on-policy proximal policy gradient methods (eg. PPO and GRPO) with off-policy replay to slash LLM reasoning-finetuning costs while setting new SOTA math performance and, for the first time, exposing how off-policy RL shapes the emergence of reasoning behaviors.

🌠 Key Results

Below are the pass@1 results of our models versus all baselines on five widely-used benchmarks. For each entry we report two figures in the format “greedy / temperature-0.1”. On the two small, high-variance datasets (AIME and AMC) we average 32 runs at temperature 0.1

In the table, “cost” refers to the total rollout data volume generated during training. Rollout data volume defined as the total number of rollouts generated by the model during the training process. It reflects the total amount of model inference, which is usually \textbf{the dominant source} of computational cost during training.

Model AIME 24 AMC 23 MATH 500 Minerva OlympiadBench Avg. Cost
R1-Distill-Qwen-7B (Base) 33.33/37.53 68.68/66.55 83.80/84.80 30.15/32.72 44.44/43.41 52.08/53.00 N/A
ReasonFlux-F1 20.00/20.19 54.22/53.07 77.20/79.60 29.04/31.99 37.04/38.81 43.50/44.77 -
Light-R1 30.00/40.00 66.27/66.73 87.00/86.8 34.56/31.62 47.56/48.30 53.08/54.69 -
Skywork-OR1-Preview 43.33/36.31 63.86/61.60 84.40/83.40 29.41/31.25 46.22/43.85 53.44/51.28 >8.192M
Polaris 40.00/39.71 63.86/67.40 87.60/86.40 36.40/34.19 48.00/48.30 55.17/55.20 -
AdaptThink 46.67/47.62 75.90/74.20 87.60/87.00 33.46/34.56 50.22/50.52 58.77/58.78 0.307M
AceReason-Nemotron 60.00/50.00 80.72/77.48 89.00/89.60 36.40/35.29 50.07/53.85 63.24/60.99 >3.584M
ReMix-PPO 63.33/50.31 78.31/77.78 90.20/91.00 37.50/37.87 52.59/52.89 64.39/61.97 0.011M

📎 Citation

If you find ReMix helpful, please cite us.

@article{liang2025squeezesoakedspongeefficient,
      title={Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model}, 
      author={Jing Liang and Hongyao Tang and Yi Ma and Jinyi Liu and Yan Zheng and Shuyue Hu and Lei Bai and Jianye Hao},
      journal={arXiv preprint arXiv:2507.06892},
      url={https://arxiv.org/abs/2507.06892}, 
      year={2025}
}
Downloads last month
6
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AnitaLeung/Remix-R1-Distilled-Qwen-7B