From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 12 days ago • 65
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Paper • 2508.21496 • Published Aug 29 • 54
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study Paper • 2508.13142 • Published Aug 18 • 34
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters Paper • 2412.00174 • Published Nov 29, 2024 • 23
Trajectory Attention for Fine-grained Video Motion Control Paper • 2411.19324 • Published Nov 28, 2024 • 13