Papers
arxiv:2505.23606

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Published on May 29
ยท Submitted by Jinbin Bai on May 30
Authors:
,
,
,
,
,

Abstract

Muddit, a unified discrete diffusion transformer, achieves fast and high-quality generation across text and image modalities by integrating pretrained visual priors with a lightweight text decoder.

AI-generated summary

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

Community

Paper author Paper submitter

๐Ÿš€ Diffusion for text generation is booming โ€” and we're pushing it further.

While recent works explore unified generation via diffusion for faster decoding, they mostly rely on language priors.

We introduce Muddit โ€” a next-generation foundation model in the Meissonic family, built upon discrete diffusion for unified and efficient multimodal generation.

Unlike traditional autoregressive methods, Muddit leverages discrete diffusion (a.k.a. MaskGIT-style masking) as its core mechanism โ€” enabling fast, parallel decoding across modalities.

While most unified models are still rooted in language priors, Muddit is developed from a visual-first perspective for scalable and flexible generation and it supports super fast t2i i2t and vqa tasks.

The code and model are released at \url{https://github.com/M-E-AGI-Lab/Muddit}.

Paper author Paper submitter

unified-10.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.23606 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 4