Unifying 2D and 3D Vision-Language Understanding
Abstract
UniVLG, a unified architecture for 2D and 3D vision-language understanding, leverages pre-trained 2D models and incorporates novel language-conditioned mask decoders and 2D-to-3D lifting strategies to achieve state-of-the-art performance in 3D tasks while maintaining 2D capabilities.
Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model (2025)
- Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model (2025)
- UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding (2025)
- TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP (2025)
- OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision (2025)
- Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models (2025)
- Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
