Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
Abstract
Introducing reasoning data during pretraining significantly enhances LLM performance compared to post-training, with pretraining benefiting more from diverse data patterns while SFT benefits more from high-quality data.
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.
Community
This work investigates the underexplored role of reasoning data in the pretraining phase of LLM development. Through controlled studies varying data scale, diversity, and quality, we find that front-loading reasoning data during pretraining yields lasting improvements—up to 19% average gains—that cannot be recovered through later-stage SFT. We uncover a clear asymmetry: diverse reasoning patterns most benefit pretraining, while high-quality data drives post-training success. Moreover, high-quality pretraining data exhibits latent effects, activated only during fine-tuning, whereas naïvely scaling SFT data can erode prior gains. These findings challenge the conventional separation of language modeling and reasoning, providing a principled framework for allocating reasoning data across training stages.
When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning?
Our new work, "Front-Loading Reasoning," challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the frontier.
The key? An asymmetric data strategy.
📝 Blog: https://research.nvidia.com/labs/adlr/Synergy/
🔗Paper: https://tinyurl.com/3tzkemtp
We find that "front-loading" reasoning data into pretraining creates a durable, compounding advantage.
📈 Stage 1 (Pretraining): +16% avg. gain out of the gate.
📈 Stage 2 (SFT): Advantage grows to +9.3% after fine-tuning.
📈 Stage 3 (RL): Finishes with a massive +19% lead on expert benchmarks.
SFT & RL amplify a strong foundation; they can't create one.
The optimal data strategy is phase-dependent:
🧠 Pretraining thrives on DIVERSITY & SCALE. A broad mix of reasoning patterns builds a robust foundation, giving an +11% boost over using only narrow, high-quality data at this stage.
🎯 SFT demands QUALITY. Fine-tuning on a small, high-quality dataset is far more effective, boosting performance by +15% over a large, mixed-quality one.
High-quality data has a surprising latent effect.
Adding a small, high-quality dataset to a diverse pretraining mix showed minimal immediate gains. But after SFT, its value was "unlocked," providing an additional +4% boost.
A deep synergy exists: pretraining can instill the potential that alignment activates.
Can a model with no reasoning in its pretraining "catch up" by getting more SFT data? No.
We doubled the SFT data for our baseline model. While it improved, it still couldn't match the performance of even the weakest reasoning-pretrained model.
A strong start is irreplaceable.
Is more data always better in SFT? No.
Our ablations show that blindly scaling SFT with mixed-quality data is actively HARMFUL.
❌ Doubling the SFT data dropped math reasoning scores by -5%.
✅ Scaling with small high quality data provides consistent gains.
SFT is for targeted refinement, not brute-force scaling.
Our work provides a principled guide for training reasoning-centric LLMs:
Don't wait: Inject reasoning data into pretraining.
Be strategic: Use DIVERSE data for pretraining, emphasize HIGH-QUALITY data for SFT.
Be careful: Avoid polluting your SFT with low-quality data.
This moves us from "more data" to a smarter, phase-aware approach.
Hi 
@SieraL
	! Thank you for publishing these insights and learnings in such a digestible manner!
I would like to ask about the formatting of the added high-quality data added to the pretraining mixture.
For the experiments done, were there any special care taken for the tokenization and packing of the high quality datasets?
For example, do they need to be untruncated and tokenized with special tokens like what is commonly done for post-training, or were they simply truncated randomly and tokenized without any special tokens other than bos and eos?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RLP: Reinforcement as a Pretraining Objective (2025)
- Apriel-Nemotron-15B-Thinker (2025)
- MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes (2025)
- Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs (2025)
- Apriel-1.5-15b-Thinker (2025)
- Long Chain-of-Thought Reasoning Across Languages (2025)
- Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
 
	 
					 
					 
					




 
						