Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning Paper • 2510.11027 • Published 13 days ago • 19
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Paper • 2510.10518 • Published 14 days ago • 17
Diffusion Transformers with Representation Autoencoders Paper • 2510.11690 • Published 12 days ago • 157
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 9 days ago • 64
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving Paper • 2510.07944 • Published 17 days ago • 24
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Paper • 2510.13747 • Published 10 days ago • 28
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints Paper • 2510.08565 • Published 16 days ago • 19
Paper2Video: Automatic Video Generation from Scientific Papers Paper • 2510.05096 • Published 19 days ago • 106
A Survey of Reinforcement Learning for Large Reasoning Models Paper • 2509.08827 • Published Sep 10 • 183
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18 • 109
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper • 2508.18265 • Published Aug 25 • 201
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Paper • 2508.21496 • Published Aug 29 • 54
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study Paper • 2508.13142 • Published Aug 18 • 34