RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Abstract
RePro, a reinforcement learning-based method, generates high-quality rephrasings of pretraining data to enhance the efficiency and accuracy of large language models.
High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.
Community
We propose RePro, a novel web recycling method that trains a language model with RL to perform effective and faithful rephrasing. It outperforms state-of-the-art recycling method using a 17× larger model and improves organic data efficiency by 2-3×.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels (2025)
- Aligning Large Language Models via Fully Self-Synthetic Data (2025)
- Jointly Reinforcing Diversity and Quality in Language Model Generations (2025)
- Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization (2025)
- Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models (2025)
- CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning (2025)
- PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper