VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos Paper • 2510.19488 • Published 13 days ago • 19
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities Paper • 2510.08759 • Published 26 days ago • 45
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions Paper • 2510.08211 • Published 26 days ago • 22
Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step Paper • 2509.23924 • Published Sep 28 • 7
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective Paper • 2509.18905 • Published Sep 23 • 28
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Paper • 2506.09049 • Published Jun 10 • 36
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Paper • 2506.04308 • Published Jun 4 • 43
Position: Interactive Generative Video as Next-Generation Game Engine Paper • 2503.17359 • Published Mar 21 • 61
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Paper • 2412.04455 • Published Dec 5, 2024 • 38
VLSBench: Unveiling Visual Leakage in Multimodal Safety Paper • 2411.19939 • Published Nov 29, 2024 • 10
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models Paper • 2402.05044 • Published Feb 7, 2024 • 2
FLAME: Factuality-Aware Alignment for Large Language Models Paper • 2405.01525 • Published May 2, 2024 • 28
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions Paper • 2402.03040 • Published Feb 5, 2024 • 18