--- license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - Qwen2.5-VL - Qwen2.5-VL-7B-Instruct - Int8 - VLM --- # Qwen2.5-VL-7B-Instruct This version of Qwen2.5-VL-7B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: 3.4 ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) [AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera/tree/main) [AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera/tree/axcl) ## Support Platform - AX650 - AX650N DEMO Board - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) **Image Process** |Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash | |--|--|--|--|--|--|--|--| |AX650| 448*448 | 1 | 760 ms | 3500 ms | 2.0 tokens/sec| 10.0 GiB | 9.8 GiB | **Video Process** |Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash | |--|--|--|--|--|--|--|--| |AX650| 308*308 | 8 | 1500 ms | 5080 ms | 2.0 tokens/sec| 10.0 GiB | 9.8 GiB | The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. ## How to use Download all files from this repository to the device **If you using AX650 Board** ``` (base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ tree -L 2 . ├── images ├── main_axcl_x86 ├── post_config.json ├── Qwen2.5-VL-7B-Instruct-AX650-chunk_prefill_1280 │ ├── model.embed_tokens.weight.bfloat16.bin │ ├── Qwen2.5-VL-7B-Instruct_vision.axmodel │ ├── qwen2_5_vl_p128_l0_together.axmodel ...... │ └── qwen2_5_vl_post.axmodel ├── qwen2_5_vl_7b_tokenizer ├── qwen2_tokenizer_images.py ├── qwen2_tokenizer_video_308.py ├── README.md ├── run_qwen2_5vl_image.sh ├── run_qwen2_5vl_video.sh └── video ``` ### Prepare tokenizer server #### Install transformer ``` pip install transformers==4.55.2 jinja2 ``` ### Demo Run #### Image understand demo ##### start tokenizer server for image understand demo ``` python3 qwen2_tokenizer_images.py --port 12345 ``` ##### run image understand demo - input text ``` What are these attractions? Please give their names in Chinese and English ``` - input image ![](./images/attractions) ``` (base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_image.sh [I][ Init][ 162]: LLM init start [I][ Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652 [I][ Init][ 328]: image encoder output float32 [I][ Init][ 456]: LLM init ok Type "q" to exit, Ctrl+c to stop current running prompt >> What are these attractions? Please give their names in Chinese and English image >> images/attractions images/attractions/recoAll_attractions_1.jpg images/attractions/recoAll_attractions_2.jpg images/attractions/recoAll_attractions_3.jpg images/attractions/recoAll_attractions_4.jpg [I][ Encode][ 552]: image encode time : 3014.224121 ms, size : 4 [I][ Encode][ 594]: input_ids size:1064 [I][ Encode][ 602]: offset 15 [I][ Encode][ 602]: offset 273 [I][ Encode][ 602]: offset 531 [I][ Encode][ 602]: offset 789 [I][ Encode][ 624]: out_embed size:3813376 [I][ Encode][ 626]: position_ids size:7982 [I][ Run][ 645]: input token num : 1064, prefill_split_num : 9 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:128 [I][ Run][ 679]: input_num_token:40 [I][ Run][ 816]: ttft: 15817.47 ms 1. **金字塔 (Pyramids)** - **英文**: Pyramids - **位置**: ��及 (Egypt) 2. **长城 (Great Wall of China)** - **英文**: Great Wall of China - **位置**: 中国 (China) 3. **自由女神像 (Statute of Liberty)** - **英文**: Statue of Liberty - **位置**: 美国 (United States) 4. **兵马俑 (Terracotta Army)** - **英文**: Terracotta Army - **位置**: 中国 (China) [N][ Run][ 969]: hit eos,avg 2.05 token/s ``` #### Video understand demo Please pre-process the image of the video file into a 308x308 size picture ##### start tokenizer server for image understand demo ``` python qwen2_tokenizer_video_308.py --port 12345 ``` ##### run video understand demo ``` (base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_video.sh [I][ Init][ 162]: LLM init start [I][ Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652 [I][ Init][ 328]: image encoder output float32 [I][ Init][ 340]: max_token_len : 2047 [I][ Init][ 343]: kv_cache_size : 512, kv_cache_num: 2047 [I][ Init][ 351]: prefill_token_num : 128 [I][ Init][ 355]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 355]: grp: 2, prefill_max_token_num : 128 [I][ Init][ 355]: grp: 3, prefill_max_token_num : 256 [I][ Init][ 355]: grp: 4, prefill_max_token_num : 384 [I][ Init][ 355]: grp: 5, prefill_max_token_num : 512 [I][ Init][ 355]: grp: 6, prefill_max_token_num : 640 [I][ Init][ 355]: grp: 7, prefill_max_token_num : 768 [I][ Init][ 355]: grp: 8, prefill_max_token_num : 896 [I][ Init][ 355]: grp: 9, prefill_max_token_num : 1024 [I][ Init][ 355]: grp: 10, prefill_max_token_num : 1152 [I][ Init][ 355]: grp: 11, prefill_max_token_num : 1280 [I][ Init][ 359]: prefill_max_token_num : 1280 [I][ load_config][ 282]: load config: { "enable_repetition_penalty": false, "enable_temperature": true, "enable_top_k_sampling": true, "enable_top_p_sampling": false, "penalty_window": 30, "repetition_penalty": 2, "temperature": 0.1, "top_k": 10, "top_p": 0.8 } [I][ Init][ 456]: LLM init ok Type "q" to exit, Ctrl+c to stop current running prompt >> 描述这个视频的内容 image >> video video/frame_0000.jpg video/frame_0008.jpg video/frame_0016.jpg video/frame_0024.jpg video/frame_0032.jpg video/frame_0040.jpg video/frame_0048.jpg video/frame_0056.jpg [I][ Encode][ 528]: pixel_values,size:4 [I][ Encode][ 554]: image encode time : 1546.058960 ms, size : 4 [I][ Encode][ 596]: input_ids size:509 [I][ Encode][ 604]: offset 15 [I][ Encode][ 620]: img_embed.size:4, 433664 [I][ Encode][ 625]: offset:136 [I][ Encode][ 625]: offset:257 [I][ Encode][ 625]: offset:378 [I][ Encode][ 634]: out_embed size:1824256 [I][ Encode][ 636]: position_ids size:509 [I][ Run][ 655]: input token num : 509, prefill_split_num : 4 [I][ Run][ 689]: input_num_token:128 [I][ Run][ 689]: input_num_token:128 [I][ Run][ 689]: input_num_token:128 [I][ Run][ 689]: input_num_token:125 [I][ Run][ 826]: ttft: 5081.97 ms 这张图片展示了两只土拨鼠在户外的山地环境中进行互动。它们似乎在进行一种类似打斗的行为,可能是在争夺领地或展示攻击性。背景是蓝天和山脉,环境看起来非常自然和开阔。土拨鼠的毛色主要是棕色和灰色,带有白色的斑纹。它们的姿势和动作显示出它们正在积极地互动。 [N][ Run][ 979]: hit eos,avg 2.08 token/s ```