Papers
arxiv:2509.06266

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Published on Sep 8
ยท Submitted by Mohammad Akbari on Sep 12
Authors:
,
,

Abstract

Ego3D-Bench evaluates VLMs on ego-centric, multi-view outdoor data, revealing performance gaps, and Ego3D-VLM enhances 3D spatial reasoning through cognitive map generation.

AI-generated summary

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

Community

Paper author Paper submitter

Key Highlights of this paper:
๐Ÿ“Š Ego3D-Bench: A benchmark of 8,600+ human-verified QA pairs for evaluating VLMs in ego-centric, multi-view outdoor environments.
๐Ÿง  Ego3D-VLM: A post-training framework that builds cognitive maps from global 3D coordinates, achieving +12% QA accuracy and +56% distance estimation improvements.
๐Ÿš€ Impact: Together, Ego3D-Bench and Ego3D-VLM move VLMs closer to human-level 3D spatial understanding in real-world settings.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.06266 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.06266 in a Space README.md to link it from this page.

Collections including this paper 2