Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
Abstract
Ego3D-Bench evaluates VLMs on ego-centric, multi-view outdoor data, revealing performance gaps, and Ego3D-VLM enhances 3D spatial reasoning through cognitive map generation.
Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.
Community
Key Highlights of this paper:
๐ Ego3D-Bench: A benchmark of 8,600+ human-verified QA pairs for evaluating VLMs in ego-centric, multi-view outdoor environments.
๐ง Ego3D-VLM: A post-training framework that builds cognitive maps from global 3D coordinates, achieving +12% QA accuracy and +56% distance estimation improvements.
๐ Impact: Together, Ego3D-Bench and Ego3D-VLM move VLMs closer to human-level 3D spatial understanding in real-world settings.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes (2025)
- VLM4D: Towards Spatiotemporal Awareness in Vision Language Models (2025)
- EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs (2025)
- SIFThinker: Spatially-Aware Image Focus for Visual Reasoning (2025)
- Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions (2025)
- RynnEC: Bringing MLLMs into Embodied World (2025)
- Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper