Safetensors
qwen2_5_vl

Improve model card: Add pipeline tag, library_name, paper, code, usage, and additional tags

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +115 -4
README.md CHANGED
@@ -1,8 +1,119 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - Senqiao/VisionThink-General-Train
5
  - Senqiao/VisionThink-General-Val
6
- base_model:
7
- - Qwen/Qwen2.5-VL-7B-Instruct
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - Senqiao/VisionThink-General-Train
6
  - Senqiao/VisionThink-General-Val
7
+ license: apache-2.0
8
+ pipeline_tag: image-text-to-text
9
+ library_name: transformers
10
+ tags:
11
+ - vision-language-model
12
+ - multimodal
13
+ - qwen
14
+ ---
15
+
16
+ <p align="center" width="100%">
17
+ <img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/VisionThink.jpg" alt="VisionThink" style="width: 100%; min-width: 300px; display: block; margin: auto;">
18
+ </p>
19
+
20
+ # VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
21
+
22
+ This repository contains the `VisionThink-General` model, a smart and efficient vision-language model. VisionThink introduces a new paradigm for visual token compression in Vision-Language Models (VLMs). It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Unlike existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case.
23
+
24
+ The model was presented in the paper [**VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning**](https://huggingface.co/papers/2507.13348).
25
+
26
+ The official code and more details can be found on the [**VisionThink GitHub repository**](https://github.com/dvlab-research/VisionThink).
27
+
28
+ ## Highlights
29
+ <p align="center" width="80%">
30
+ <img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/Framework.jpg" alt="VisionThink Framework" style="width: 80%; min-width: 300px; display: block; margin: auto;">
31
+ </p>
32
+
33
+ 1. Our VisionThink leverages reinforcement learning to **autonomously** learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on **fine-grained** benchmarks, such as those involving OCR-related tasks.
34
+
35
+ 2. VisionThink improves performance on **General VQA** tasks while reducing visual tokens by **50%**, achieving **102%** of the original model’s performance across nine benchmarks.
36
+
37
+ 3. VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into **Efficient Reasoning Vision Language Models**.
38
+
39
+ ## Installation
40
+
41
+ The environment follows the [Verl](https://github.com/volcengine/verl).
42
+ ```bash
43
+ git clone https://github.com/dvlab-research/VisionThink.git
44
+ conda create -n visionthink python=3.11 -y
45
+ conda activate visionthink
46
+ # veRL
47
+ pip3 install -e .
48
+ # flash-attn
49
+ pip3 install flash-attn --no-build-isolation
50
+ ```
51
+ If you want to use the Qwen3 as the Judge Model.
52
+ ```bash
53
+ pip install -U tensordict
54
+ pip install transformers==4.51.0
55
+ ```
56
+
57
+ ## Usage
58
+
59
+ You can easily load and use VisionThink with the Hugging Face `transformers` library. Below is a quick example demonstrating how to load the `VisionThink-General` model and perform inference.
60
+
61
+ ```python
62
+ from transformers import AutoProcessor, AutoModelForCausalLM
63
+ from PIL import Image
64
+
65
+ # Load model and processor
66
+ model_id = "Senqiao/VisionThink-General" # Or "Senqiao/VisionThink-Efficient"
67
+ processor = AutoProcessor.from_pretrained(model_id)
68
+ model = AutoModelForCausalLM.from_pretrained(
69
+ model_id,
70
+ torch_dtype="auto",
71
+ device_map="auto",
72
+ trust_remote_code=True
73
+ )
74
+
75
+ # Prepare input image and text
76
+ # Replace with your image path
77
+ image = Image.open("./path/to/your/image.jpg").convert("RGB")
78
+ messages = [
79
+ {
80
+ "role": "user",
81
+ "content": [
82
+ {"type": "image", "image": image},
83
+ {"type": "text", "text": "Describe this image in detail."},
84
+ ],
85
+ }
86
+ ]
87
+
88
+ # Apply chat template and process inputs
89
+ text = processor.apply_chat_template(
90
+ messages, tokenize=False, add_generation_prompt=True
91
+ )
92
+ inputs = processor(text=text, images=image, return_tensors="pt")
93
+ inputs = inputs.to(model.device)
94
+
95
+ # Generate response
96
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
97
+
98
+ # Decode and print the output
99
+ generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
100
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
101
+ print(response)
102
+ ```
103
+
104
+ ## Citation
105
+ If you find this project useful in your research, please consider citing:
106
+
107
+ ```bibtex
108
+ @article{yang2025visionthink,
109
+ title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
110
+ author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
111
+ journal={arXiv preprint arXiv:2507.13348},
112
+ year={2025},
113
+ }
114
+ ```
115
+
116
+ ## Acknowledgement
117
+ - This work is built upon [Verl](https://github.com/volcengine/verl), [EasyR1](https://github.com/hiyouga/EasyR1), [Lmms-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), and [MMSearch-R1](https://github.com/EvolvingLMMs-Lab/multimodal-search-r1). We thank them for their excellent open-source contributions.
118
+
119
+ - We also thank [Qwen](https://github.com/QwenLM/Qwen2.5-VL), [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [VisionZip](https://github.com/dvlab-research/VisionZip), [FastV](https://github.com/pkunlp-icler/FastV), [SparseVLM](https://github.com/Gumpest/SparseVLMs), and others for their contributions, which have provided valuable insights.