Ming-V2
Collection
7 items
•
Updated
•
25
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope| 💾 GitHub
Figure 1: Conceptual comparison and qualitative examples of Ming-UniVision built upon MingTok.
Figure 2: Multi-Round image understanding, generation and editing architecture of Ming-UniVision, powered by MingTok.
from mingunivisioninfer import MingUniVisionInfer
model = MingUniVisionInfer("inclusionAI/Ming-UniVision-16B-A3B")
# single round generation
image_gen_prompt = "Please generate the corresponding image based on the description. A cute girl."
messages = [{
"role": "HUMAN",
"content": [{"type": "text", "text": image_gen_prompt},],
}]
output_text = model.generate(messages, max_new_tokens=512, output_image_prefix="a_cute_girl")
model.reset_inner_state()
# single ground understanding
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Please describe the picture in detail."},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
# multi-round editing
messages = [{
"role": "HUMAN",
"content": [
{"type": "image", "image": "a_cute_girl.png"},
{"type": "text", "text": "Given the edit instruction: Change the color of her cloth to red, please identify the editing region"},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_0")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Change the color of her cloth to red."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_1")
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Refine the image for better clarity."},
],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_2")
model.reset_inner_state()
# single round text-based conversation
messages = [{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的习性。"},
],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()
📌 Tips:
output_image_prefix
to save output.generate(..., for_edit=True)
calls with unique output_image_prefix
names.model.reset_inner_state()
to reset.📝 Note (Model Limitations):
Model | MMB ↑ | MMS ↑ | MMMU ↑ | MathV ↑ | Hall ↑ | AI2D ↑ | MM-Vet ↑ | OCRBench ↑ | MME ↑ |
---|---|---|---|---|---|---|---|---|---|
Understanding Only | |||||||||
Emu3-Chat | 58.5 | - | 31.6 | - | - | - | 37.2 | 687 | - |
Qwen2.5-VL-3B | 79.1 | 55.9 | 53.1 | 62.3 | 46.3 | 81.6 | - | 797 | 2157 |
Qwen2.5-VL-7B | 83.5 | 63.9 | 58.6 | 68.2 | 52.9 | 83.9 | 67.1 | 864 | 2347 |
InternVL2.5-4B | 81.1 | 58.3 | 52.3 | 60.5 | 46.3 | 81.4 | 60.6 | 828 | 2338 |
InternVL2.5-8B | 84.6 | 62.8 | 56.0 | 64.4 | 50.1 | 84.5 | 62.8 | 822 | 2344 |
DeepSeek-VL2 | 79.6 | 61.3 | 51.1 | 62.8 | - | 81.4 | - | 811 | 2253 |
Unified model, Separate representation | |||||||||
Janus-Pro-7B | 79.2 | - | 41.0 | - | - | - | 50.0 | - | - |
LMFusion | - | - | 41.7 | - | - | - | - | - | 1603 |
MetaQuery-L | 78.6 | - | 53.1 | - | - | - | 63.2 | - | - |
Show-o2-7B | 79.3 | 56.6 | 48.9 | - | - | 78.6 | - | - | - |
BLIP3-o 4B | 78.6 | - | 46.6 | - | - | - | 60.1 | - | 2161 |
BAGEL | 85.0 | - | 55.3 | 73.1 | - | - | 67.2 | - | 2388 |
Unified model, Unified representation | |||||||||
VILA-U | - | - | - | - | - | - | 33.5 | - | 1402 |
TokenFlow-XL | 76.8 | - | 43.2 | - | - | - | 48.2 | - | 1922 |
UniTok | - | - | - | - | - | - | 33.9 | - | 1448 |
Harmon-1.5B | 65.5 | - | 38.9 | - | - | - | - | - | 1476 |
TokLIP | 67.6 | - | 43.1 | - | - | - | 29.8 | - | - |
Ming-UniVision-16B-A3B (Ours) | 78.5 | 63.7 | 40.3 | 66.6 | 47.8 | 82.8 | 64.2 | 724 | 2023 |
Method | Single Obj. ↑ | Two Obj. ↑ | Counting ↑ | Colors ↑ | Position ↑ | Color Attri. ↑ | Overall ↑ | DPG-Bench ↑ |
---|---|---|---|---|---|---|---|---|
Generation Only | ||||||||
LlamaGen | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 | 0.32 | - |
PixArt-α | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 | - |
SDv2.1 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 | - |
DALL-E 2 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 | 0.52 | - |
Emu3-Gen | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | 80.60 |
SDXL | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 74.65 |
DALL-E 3 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | 83.50 |
SD3-Medium | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 | 84.08 |
Unified model, Separate representation | ||||||||
Show-o | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | 0.53 | - |
Ming-Lite-Uni | 0.99 | 0.76 | 0.53 | 0.87 | 0.26 | 0.30 | 0.62 | - |
Janus-Pro-1B | 0.98 | 0.82 | 0.51 | 0.89 | 0.65 | 0.56 | 0.73 | 82.63 |
Janus-Pro-7B | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 | 84.19 |
Show-o2-7B | 1.00 | 0.87 | 0.58 | 0.92 | 0.52 | 0.62 | 0.76 | 86.14 |
MetaQuery-L† | - | - | - | - | - | - | 0.78 | 81.10 |
Blip3-o 4B | - | - | - | - | - | - | 0.81 | 79.36 |
BAGEL | 0.99 | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 | 0.82 | - |
Unified model, Unified representation | ||||||||
Harmon-1.5B | 0.99 | 0.86 | 0.66 | 0.85 | 0.74 | 0.48 | 0.79 | - |
TokenFlow-XL | 0.95 | 0.60 | 0.41 | 0.81 | 0.16 | 0.24 | 0.55 | 73.38 |
Ming-UniVision-16B-A3B (Ours) | 1.00 | 0.93 | 0.59 | 0.93 | 0.92 | 0.70 | 0.85 | 82.12 |
@article{huang2025mingunivision,
title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
journal={arXiv preprint arXiv:2510.06590},
year={2025}
}