NusaCrowd: Open Source Initiative for Indonesian NLP Resources Paper • 2212.09648 • Published Dec 19, 2022
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset Paper • 2412.02595 • Published Dec 3, 2024 • 5
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models Paper • 2504.03624 • Published Apr 4 • 15
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining Paper • 2412.15285 • Published Dec 18, 2024
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published Apr 17 • 93