File size: 4,264 Bytes
247943c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
datasets:
- lmms-lab/LLaVA-OneVision-Data
- BAAI/Infinity-MM
language:
- en
- zh
base_model:
- google/siglip2-so400m-patch16-512
- Qwen/Qwen2-1.5B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---

# FlashVL-2B-Static-GRPO
[\[📜 FlashVL\]](https://www.arxiv.org/abs/2505.09498)

![image/png](https://s3plus.meituan.net/automl-datasets/mlm/logo.jpg)

## Introduction

We are excited to introduce **FlashVL**, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.


### Environment Setup

```bash
pip install torch==2.1.2
pip install transformers==4.50.0.dev0
```


### How to use it?

```python
import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, SiglipProcessor

model_path = "Flash-VL/FlashVL-2B-Static-GRPO"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = SiglipProcessor.from_pretrained(model_path).image_processor

image_url ="https://s3plus.meituan.net/automl-datasets/mlm/3FF4.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')

messages = [{'role': 'user', 'content': "说说图中第一行第二列是什么蔬菜,买一斤多少钱"}] 
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)
# 图片中第一行第二列的蔬菜是**荷兰豆**,买一斤的价格是**¥16.8**。
```

### Evaluation

| Method/model            | Average            | DynaMath           | MathVision         | MathVerse          | MMMU Pro           | WeMath             |
| :--------------------: | :----------------: | :----------------: | :----------------: | :----------------: | :----------------: | :----------------: |
| Flash-VL-2B<sub>s</sub> | 23.80              | 23.19              | 26.72              | 16.84              | 16.24              | 36.03              |
| InternVL3-2B          | 27.03              | 32.55              | 26.49              | 17                 | 22.56              | 36.55              |
| + SFT                 | 26.08 (+2.28)      | 28.28              | 31.06              | 16.97              | 15.95              | 38.16              |
| + RL                  | 27.23 (+3.43)      | 26.94              | 27.94              | 17.73              | 16.99              | 46.55              |
| FlashVL-2B-Static-GRPO| 29.05 (+5.25)      | 30.61              | 32.48              | 18.45              | 16.53              | 47.18              |

Note: FlashVL-2B-Static-GRPO applies both SFT and RL.


We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static. 



## Citation
If you find this project useful in your research, please consider citing:

```BibTeX
@misc{zhang2025flashvl2boptimizingvisionlanguage,
      title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, 
      author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
      year={2025},
      eprint={2505.09498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09498}, 
}
```