Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope| 💾 GitHub

Key Features

  • 🌐 First Unified Autoregressive MLLM with Continuous Vision Tokens: Ming-UniVision is the first multimodal large language model that natively integrates continuous visual representations from MingTok into a next-token prediction (NTP) framework—unifying vision and language under a single autoregressive paradigm without discrete quantization or modality-specific heads.
  • 3.5× Faster Convergence in Joint Vision-Language Training: The coherent representational space between understanding and generation—enabled by MingTok—reduces optimization conflicts across tasks, leading to dramatically faster convergence during end-to-end multimodal pretraining.
  • 🔄 Multi-Round In-Context Vision Tasks: Ming-UniVision supports iterative understanding, generation, and editing entirely within the continuous latent space—without the need to decode intermediate states into images, enabling efficient and coherent multimodal reasoning.Users can alternate between asking questions and requesting edits, just like conversing with a human.
Conceptual comparison and qualitative examples

Figure 1: Conceptual comparison and qualitative examples of Ming-UniVision built upon MingTok.

Model Architecture

Figure 2: Multi-Round image understanding, generation and editing architecture of Ming-UniVision, powered by MingTok.

Usage

from mingunivisioninfer import MingUniVisionInfer
model = MingUniVisionInfer("inclusionAI/Ming-UniVision-16B-A3B")

# single round generation
image_gen_prompt = "Please generate the corresponding image based on the description. A cute girl."
messages = [{
  "role": "HUMAN",
  "content": [{"type": "text", "text": image_gen_prompt},],
}]
output_text = model.generate(messages, max_new_tokens=512, output_image_prefix="a_cute_girl")
model.reset_inner_state()

# single ground understanding
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "image", "image": "a_cute_girl.png"},
    {"type": "text", "text": "Please describe the picture in detail."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()

# multi-round editing
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "image", "image": "a_cute_girl.png"},
    {"type": "text", "text": "Given the edit instruction: Change the color of her cloth to red, please identify the editing region"},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_0")

messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "Change the color of her cloth to red."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_1")

messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "Refine the image for better clarity."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_2")

model.reset_inner_state()

# single round text-based conversation
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "请详细介绍鹦鹉的习性。"},
  ],
}]

output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()

📌 Tips:

  • Image generation: Use descriptive prompts + output_image_prefix to save output.
  • Image understanding: Include "image" and "text" in the same message.
  • Image editing: Chain multiple generate(..., for_edit=True) calls with unique output_image_prefix names.
  • Multi-turn interactions are supported via internal state — call model.reset_inner_state() to reset.
  • Input types: "text" and "image" — flexible order, mixed inputs allowed.

📝 Note (Model Limitations):

  • The current model was trained with only two-turn conversations, and has not been optimized for alternating rounds of image understanding and generation, although it may generalize to more than two turns during inference. As a result, performance may be limited in complex, multi-modal dialogue scenarios requiring deep contextual reasoning across turns.
  • This open-sourced version was trained using mixed-resolution strategies: high resolution for image understanding, but lower resolution for image editing and generation. Additionally, large-scale interleaved image-text data was not included during pretraining.
  • Due to these factors, image editing quality and consistency may be suboptimal compared to fully end-to-end, high-resolution multimodal models. We are actively working on improved versions with unified resolution training and richer interleaved data.

Performance

Image Reconstruction

Quantitative Evaluations on Multimodal Benchmarks
Table 1. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.
Model MMB  MMS  MMMU  MathV  Hall  AI2D  MM-Vet  OCRBench  MME 
Understanding Only
Emu3-Chat 58.5 - 31.6 - - - 37.2 687 -
Qwen2.5-VL-3B 79.1 55.9 53.1 62.3 46.3 81.6 - 797 2157
Qwen2.5-VL-7B 83.5 63.9 58.6 68.2 52.9 83.9 67.1 864 2347
InternVL2.5-4B 81.1 58.3 52.3 60.5 46.3 81.4 60.6 828 2338
InternVL2.5-8B 84.6 62.8 56.0 64.4 50.1 84.5 62.8 822 2344
DeepSeek-VL2 79.6 61.3 51.1 62.8 - 81.4 - 811 2253
Unified model, Separate representation
Janus-Pro-7B 79.2 - 41.0 - - - 50.0 - -
LMFusion - - 41.7 - - - - - 1603
MetaQuery-L 78.6 - 53.1 - - - 63.2 - -
Show-o2-7B 79.3 56.6 48.9 - - 78.6 - - -
BLIP3-o 4B 78.6 - 46.6 - - - 60.1 - 2161
BAGEL 85.0 - 55.3 73.1 - - 67.2 - 2388
Unified model, Unified representation
VILA-U - - - - - - 33.5 - 1402
TokenFlow-XL 76.8 - 43.2 - - - 48.2 - 1922
UniTok - - - - - - 33.9 - 1448
Harmon-1.5B 65.5 - 38.9 - - - - - 1476
TokLIP 67.6 - 43.1 - - - 29.8 - -
Ming-UniVision-16B-A3B (Ours) 78.5 63.7 40.3 66.6 47.8 82.8 64.2 724 2023
Text-to-Image Generation Evaluation
Table 2. Evaluation of text-to-image generation ability on GenEval and DPG-Bench. † denotes performance obtained by rewritten prompts.
Method Single Obj.  Two Obj.  Counting  Colors  Position  Color Attri.  Overall  DPG-Bench 
Generation Only
LlamaGen 0.71 0.34 0.21 0.58 0.07 0.04 0.32 -
PixArt-α 0.98 0.50 0.44 0.80 0.08 0.07 0.48 -
SDv2.1 0.98 0.51 0.44 0.85 0.07 0.17 0.50 -
DALL-E 2 0.94 0.66 0.49 0.77 0.10 0.19 0.52 -
Emu3-Gen 0.98 0.71 0.34 0.81 0.17 0.21 0.54 80.60
SDXL 0.98 0.74 0.39 0.85 0.15 0.23 0.55 74.65
DALL-E 3 0.96 0.87 0.47 0.83 0.43 0.45 0.67 83.50
SD3-Medium 0.99 0.94 0.72 0.89 0.33 0.60 0.74 84.08
Unified model, Separate representation
Show-o 0.95 0.52 0.49 0.82 0.11 0.28 0.53 -
Ming-Lite-Uni 0.99 0.76 0.53 0.87 0.26 0.30 0.62 -
Janus-Pro-1B 0.98 0.82 0.51 0.89 0.65 0.56 0.73 82.63
Janus-Pro-7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80 84.19
Show-o2-7B 1.00 0.87 0.58 0.92 0.52 0.62 0.76 86.14
MetaQuery-L† - - - - - - 0.78 81.10
Blip3-o 4B - - - - - - 0.81 79.36
BAGEL 0.99 0.94 0.81 0.88 0.64 0.63 0.82 -
Unified model, Unified representation
Harmon-1.5B 0.99 0.86 0.66 0.85 0.74 0.48 0.79 -
TokenFlow-XL 0.95 0.60 0.41 0.81 0.16 0.24 0.55 73.38
Ming-UniVision-16B-A3B (Ours) 1.00 0.93 0.59 0.93 0.92 0.70 0.85 82.12

Reference

@article{huang2025mingunivision,
  title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
  author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
  journal={arXiv preprint arXiv:2510.06590},
  year={2025}
}
Downloads last month
323
Safetensors
Model size
19B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Collection including inclusionAI/Ming-UniVision-16B-A3B