From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model Paper • 2510.19871 • Published 7 days ago • 28
Generative Universal Verifier as Multimodal Meta-Reasoner Paper • 2510.13804 • Published 14 days ago • 24
LongLive: Real-time Interactive Long Video Generation Paper • 2509.22622 • Published Sep 26 • 177
Reconstruction Alignment Improves Unified Multimodal Models Paper • 2509.07295 • Published Sep 8 • 40
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Paper • 2507.12841 • Published Jul 17 • 41
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation Paper • 2507.09862 • Published Jul 14 • 49
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening Paper • 2502.12146 • Published Feb 17 • 16
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation Paper • 2502.12148 • Published Feb 17 • 17
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation Paper • 2501.10687 • Published Jan 18 • 14
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis Paper • 2412.04431 • Published Dec 5, 2024 • 18
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation Paper • 2410.07171 • Published Oct 9, 2024 • 43
A Survey on the Honesty of Large Language Models Paper • 2409.18786 • Published Sep 27, 2024 • 32
RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models Paper • 2402.12908 • Published Feb 20, 2024 • 10