Stalin16
's Collections
Data and other things
updated
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
55
How to Synthesize Text Data without Model Collapse?
Paper
•
2412.14689
•
Published
•
52
Token-Budget-Aware LLM Reasoning
Paper
•
2412.18547
•
Published
•
46
WavePulse: Real-time Content Analytics of Radio Livestreams
Paper
•
2412.17998
•
Published
•
11
Bridging the Data Provenance Gap Across Text, Speech and Video
Paper
•
2412.17847
•
Published
•
10
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper
•
2412.11768
•
Published
•
43
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
•
2501.00958
•
Published
•
107
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
•
2501.04686
•
Published
•
53
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
•
2501.00192
•
Published
•
31
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
•
2501.09751
•
Published
•
48
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
•
2501.18511
•
Published
•
20
LIMO: Less is More for Reasoning
Paper
•
2502.03387
•
Published
•
62
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
•
2502.07617
•
Published
•
29
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Paper
•
2502.05003
•
Published
•
43
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
•
2502.07870
•
Published
•
44
Jailbreaking to Jailbreak
Paper
•
2502.09638
•
Published
•
5
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
•
2502.14846
•
Published
•
14
Paper
•
2503.08507
•
Published
•
7
"Principal Components" Enable A New Language of Images
Paper
•
2503.08685
•
Published
•
12
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Paper
•
2503.08638
•
Published
•
70
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
•
2503.07920
•
Published
•
101
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
•
2503.24379
•
Published
•
76
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
•
2504.00072
•
Published
•
6
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
•
2504.01990
•
Published
•
300
URECA: Unique Region Caption Anything
Paper
•
2504.05305
•
Published
•
35
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction
Fine-Tuning
Paper
•
2504.09081
•
Published
•
16
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
•
2504.14538
•
Published
•
30
Towards Understanding Camera Motions in Any Video
Paper
•
2504.15376
•
Published
•
158
Alchemist: Turning Public Text-to-Image Data into Generative Gold
Paper
•
2505.19297
•
Published
•
84
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image
Generative Models
Paper
•
2505.22523
•
Published
•
7
Large Language Models for Data Synthesis
Paper
•
2505.14752
•
Published
•
49
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Paper
•
2505.24098
•
Published
•
43
OpenThoughts: Data Recipes for Reasoning Models
Paper
•
2506.04178
•
Published
•
48
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Paper
•
2506.02096
•
Published
•
52
One Missing Piece for Open-Source Reasoning Models: A Dataset to
Mitigate Cold-Starting Short CoT LLMs in RL
Paper
•
2506.02338
•
Published
•
4
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning
Vision Models from DataSeeds' Annotated Imagery
Paper
•
2506.05673
•
Published
•
10
Institutional Books 1.0: A 242B token dataset from Harvard Library's
collections, refined for accuracy and usability
Paper
•
2506.08300
•
Published
•
8
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Paper
•
2506.10857
•
Published
•
30
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture
without Training
Paper
•
2506.10952
•
Published
•
22
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation
Paper
•
2506.18095
•
Published
•
65
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
LLMs
Paper
•
2506.19290
•
Published
•
52
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
Paper
•
2507.14119
•
Published
•
58
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science
Reasoning
Paper
•
2507.16812
•
Published
•
63
PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized
Timestep Adaptation
Paper
•
2507.16116
•
Published
•
10
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Paper
•
2507.21033
•
Published
•
20
HPSv3: Towards Wide-Spectrum Human Preference Score
Paper
•
2508.03789
•
Published
•
18
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved
Image Generation
Paper
•
2508.09987
•
Published
•
25
Open Data Synthesis For Deep Research
Paper
•
2509.00375
•
Published
•
68
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
Paper
•
2509.06652
•
Published
•
24
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper
•
2509.09676
•
Published
•
31
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform
Data
Paper
•
2509.15221
•
Published
•
107
AutoIntent: AutoML for Text Classification
Paper
•
2509.21138
•
Published
•
32
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
•
2509.24900
•
Published
•
53
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining
Levels
Paper
•
2510.06499
•
Published
•
31
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully
Open MLLMs
Paper
•
2510.13795
•
Published
•
49
Scaling Instruction-Based Video Editing with a High-Quality Synthetic
Dataset
Paper
•
2510.15742
•
Published
•
47
FineVision: Open Data Is All You Need
Paper
•
2510.17269
•
Published
•
52