Essential-Web v1.0: 24T tokens of organized web data Paper • 2506.14111 • Published Jun 17, 2025 • 46
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning Paper • 2507.16812 • Published Jul 22, 2025 • 63
RAVine: Reality-Aligned Evaluation for Agentic Search Paper • 2507.16725 • Published Jul 22, 2025 • 29
MiniCPM4 Collection MiniCPM4: Ultra-Efficient LLMs on End Devices • 29 items • Updated Sep 8, 2025 • 82
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data Paper • 2505.05427 • Published May 8, 2025 • 4