VLM-FO1: Qwen2.5-VL-3B-v01

This repository contains the VLM-FO1_Qwen2.5-VL-3B-v01 model, an implementation of the VLM-FO1 framework built on the Qwen2.5-VL-3B base model.

VLM-FO1 is a novel plug-and-play framework designed to bridge the gap between the high-level reasoning of Vision-Language Models (VLMs) and the need for fine-grained visual perception.

Model Details

Model Description

VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models.

VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more.

🧩 Plug-and-Play Modularity: Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities.

🧠 Hybrid Region Encoder (HFRE): We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail.

🎯 State-of-the-Art Performance: VLM-FO1 achieves SOTA results across a diverse suite of benchmarks.

βœ… Preserves General Abilities: Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities.

Model Sources

Citation

@article{liu2025vlm,
  title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs},
  author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng},
  journal={arXiv preprint arXiv:2509.25916},
  year={2025}
}
Downloads last month
634
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for omlab/VLM-FO1_Qwen2.5-VL-3B-v01

Finetuned
(559)
this model

Spaces using omlab/VLM-FO1_Qwen2.5-VL-3B-v01 2

Collection including omlab/VLM-FO1_Qwen2.5-VL-3B-v01