Senqiao
/

VisionThink-General

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library_name, paper, code, usage, and additional tags

by nielsr HF Staff - opened Jul 19

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+115

-4

Files changed (1) hide show

README.md +115 -4

README.md CHANGED Viewed

@@ -1,8 +1,119 @@
 ---
-license: apache-2.0
 datasets:
 - Senqiao/VisionThink-General-Train
 - Senqiao/VisionThink-General-Val
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
----

 ---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 datasets:
 - Senqiao/VisionThink-General-Train
 - Senqiao/VisionThink-General-Val
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- vision-language-model
+- multimodal
+- qwen
+---
+<p align="center" width="100%">
+<img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/VisionThink.jpg" alt="VisionThink" style="width: 100%; min-width: 300px; display: block; margin: auto;">
+</p>
+# VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
+This repository contains the `VisionThink-General` model, a smart and efficient vision-language model. VisionThink introduces a new paradigm for visual token compression in Vision-Language Models (VLMs). It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Unlike existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case.
+The model was presented in the paper [**VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning**](https://huggingface.co/papers/2507.13348).
+The official code and more details can be found on the [**VisionThink GitHub repository**](https://github.com/dvlab-research/VisionThink).
+## Highlights
+<p align="center" width="80%">
+<img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/Framework.jpg" alt="VisionThink Framework" style="width: 80%; min-width: 300px; display: block; margin: auto;">
+</p>
+1. Our VisionThink leverages reinforcement learning to **autonomously** learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on **fine-grained** benchmarks, such as those involving OCR-related tasks.
+2. VisionThink improves performance on **General VQA** tasks while reducing visual tokens by **50%**, achieving **102%** of the original model’s performance across nine benchmarks.
+3. VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into **Efficient Reasoning Vision Language Models**.
+## Installation
+The environment follows the [Verl](https://github.com/volcengine/verl).
+```bash
+git clone https://github.com/dvlab-research/VisionThink.git
+conda create -n visionthink python=3.11 -y
+conda activate visionthink
+# veRL
+pip3 install -e .
+# flash-attn
+pip3 install flash-attn --no-build-isolation
+```
+If you want to use the Qwen3 as the Judge Model.
+```bash
+pip install -U tensordict
+pip install transformers==4.51.0
+```
+## Usage
+You can easily load and use VisionThink with the Hugging Face `transformers` library. Below is a quick example demonstrating how to load the `VisionThink-General` model and perform inference.
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+# Load model and processor
+model_id = "Senqiao/VisionThink-General" # Or "Senqiao/VisionThink-Efficient"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True
+)
+# Prepare input image and text
+# Replace with your image path
+image = Image.open("./path/to/your/image.jpg").convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "Describe this image in detail."},
+        ],
+    }
+]
+# Apply chat template and process inputs
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+inputs = processor(text=text, images=image, return_tensors="pt")
+inputs = inputs.to(model.device)
+# Generate response
+generated_ids = model.generate(**inputs, max_new_tokens=512)
+# Decode and print the output
+generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
+response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+## Citation
+If you find this project useful in your research, please consider citing:
+```bibtex
+@article{yang2025visionthink,
+  title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
+  author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
+  journal={arXiv preprint arXiv:2507.13348},
+  year={2025},
+}
+```
+## Acknowledgement
+- This work is built upon [Verl](https://github.com/volcengine/verl), [EasyR1](https://github.com/hiyouga/EasyR1), [Lmms-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), and [MMSearch-R1](https://github.com/EvolvingLMMs-Lab/multimodal-search-r1). We thank them for their excellent open-source contributions.
+- We also thank [Qwen](https://github.com/QwenLM/Qwen2.5-VL), [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [VisionZip](https://github.com/dvlab-research/VisionZip), [FastV](https://github.com/pkunlp-icler/FastV), [SparseVLM](https://github.com/Gumpest/SparseVLMs), and others for their contributions, which have provided valuable insights.