---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Qwen2.5-VL
- Qwen2.5-VL-7B-Instruct
- Int8
- VLM
---

# Qwen2.5-VL-7B-Instruct

This version of Qwen2.5-VL-7B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 3.4

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : 
https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera/tree/main) 

[AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera/tree/axcl)


## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

**Image Process**
|Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 448*448 | 1 | 760 ms | 3500 ms | 2.0 tokens/sec| 10.0 GiB |  9.8 GiB  |

**Video Process**
|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 308*308 | 8  | 1500 ms | 5080 ms | 2.0 tokens/sec| 10.0 GiB |  9.8 GiB  | 

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

## How to use

Download all files from this repository to the device

**If you using AX650 Board**
```
(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ tree -L 2
.
├── images
├── main_axcl_x86
├── post_config.json
├── Qwen2.5-VL-7B-Instruct-AX650-chunk_prefill_1280
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── Qwen2.5-VL-7B-Instruct_vision.axmodel
│   ├── qwen2_5_vl_p128_l0_together.axmodel
......
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5_vl_7b_tokenizer
├── qwen2_tokenizer_images.py
├── qwen2_tokenizer_video_308.py
├── README.md
├── run_qwen2_5vl_image.sh
├── run_qwen2_5vl_video.sh
└── video

```

### Prepare tokenizer server

#### Install transformer

```
pip install transformers==4.55.2 jinja2
```

### Demo Run

#### Image understand demo

##### start tokenizer server for image understand demo

```
python3 qwen2_tokenizer_images.py --port 12345
```

##### run image understand demo

- input text

```
What are these attractions? Please give their names in Chinese and English
```

- input image

![](./images/attractions)

```
(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_image.sh 
[I][                            Init][ 162]: LLM init start

[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> What are these attractions? Please give their names in Chinese and English
image >> images/attractions
images/attractions/recoAll_attractions_1.jpg
images/attractions/recoAll_attractions_2.jpg
images/attractions/recoAll_attractions_3.jpg
images/attractions/recoAll_attractions_4.jpg
[I][                          Encode][ 552]: image encode time : 3014.224121 ms, size : 4
[I][                          Encode][ 594]: input_ids size:1064
[I][                          Encode][ 602]: offset 15
[I][                          Encode][ 602]: offset 273
[I][                          Encode][ 602]: offset 531
[I][                          Encode][ 602]: offset 789
[I][                          Encode][ 624]: out_embed size:3813376
[I][                          Encode][ 626]: position_ids size:7982
[I][                             Run][ 645]: input token num : 1064, prefill_split_num : 9
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:40
[I][                             Run][ 816]: ttft: 15817.47 ms
1. **金字塔 (Pyramids)**  
   - **英文**: Pyramids  
   - **位置**: ��及 (Egypt)

2. **长城 (Great Wall of China)**  
   - **英文**: Great Wall of China  
   - **位置**: 中国 (China)

3. **自由女神像 (Statute of Liberty)**  
   - **英文**: Statue of Liberty  
   - **位置**: 美国 (United States)

4. **兵马俑 (Terracotta Army)**  
   - **英文**: Terracotta Army  
   - **位置**: 中国 (China)

[N][                             Run][ 969]: hit eos,avg 2.05 token/s

```

#### Video understand demo

Please pre-process the image of the video file into a 308x308 size picture

##### start tokenizer server for image understand demo

```
python qwen2_tokenizer_video_308.py --port 12345
```

##### run video understand demo

```
(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_video.sh 
[I][                            Init][ 162]: LLM init start
[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 2047
[I][                            Init][ 343]: kv_cache_size : 512, kv_cache_num: 2047
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 355]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 355]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 355]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 355]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 355]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 355]: grp: 11, prefill_max_token_num : 1280
[I][                            Init][ 359]: prefill_max_token_num : 1280
[I][                     load_config][ 282]: load config: 
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 30,
    "repetition_penalty": 2,
    "temperature": 0.1,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频的内容
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 528]: pixel_values,size:4
[I][                          Encode][ 554]: image encode time : 1546.058960 ms, size : 4
[I][                          Encode][ 596]: input_ids size:509
[I][                          Encode][ 604]: offset 15
[I][                          Encode][ 620]: img_embed.size:4, 433664
[I][                          Encode][ 625]: offset:136
[I][                          Encode][ 625]: offset:257
[I][                          Encode][ 625]: offset:378
[I][                          Encode][ 634]: out_embed size:1824256
[I][                          Encode][ 636]: position_ids size:509
[I][                             Run][ 655]: input token num : 509, prefill_split_num : 4
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:125
[I][                             Run][ 826]: ttft: 5081.97 ms
这张图片展示了两只土拨鼠在户外的山地环境中进行互动。它们似乎在进行一种类似打斗的行为，可能是在争夺领地或展示攻击性。背景是蓝天和山脉，环境看起来非常自然和开阔。土拨鼠的毛色主要是棕色和灰色，带有白色的斑纹。它们的姿势和动作显示出它们正在积极地互动。

[N][                             Run][ 979]: hit eos,avg 2.08 token/s
```