VLM-FO1: Qwen2.5-VL-3B-v01
This repository contains the VLM-FO1_Qwen2.5-VL-3B-v01 model, an implementation of the VLM-FO1 framework built on the Qwen2.5-VL-3B base model.
VLM-FO1 is a novel plug-and-play framework designed to bridge the gap between the high-level reasoning of Vision-Language Models (VLMs) and the need for fine-grained visual perception.
Model Details
Model Description
VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models.
VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more.
π§© Plug-and-Play Modularity: Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities.
π§ Hybrid Region Encoder (HFRE): We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail.
π― State-of-the-Art Performance: VLM-FO1 achieves SOTA results across a diverse suite of benchmarks.
β Preserves General Abilities: Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities.
Model Sources
- Repository: [https://github.com/om-ai-lab/VLM-FO1]
- Paper: [https://arxiv.org/pdf/2509.25916]
Citation
@article{liu2025vlm,
title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs},
author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng},
journal={arXiv preprint arXiv:2509.25916},
year={2025}
}
- Downloads last month
- 634
Model tree for omlab/VLM-FO1_Qwen2.5-VL-3B-v01
Base model
Qwen/Qwen2.5-VL-3B-Instruct