Reconstruction Alignment Improves Unified Multimodal Models

The model was presented in the paper Reconstruction Alignment Improves Unified Multimodal Models.

Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

🧠 Method

🔥 News

2025.9.10: BAGEL training code is released! Harmon training code will be released soon.
2025.9.9: Our finetuned weights and arXiv paper are available! We expect to release the training code tomorrow.

🍭 Results

RecA achieves state-of-the-art performance on generation benchmarks with remarkable efficiency. Despite using only 1.5B parameters, RecA surpasses models with 7B-24B parameters, achieving GenEval 0.86 and DPGBench 87.21 without GPT-4o distillation data or reinforcement learning. RecA also improves BAGEL's editing performance significantly across all categories. Further two-stage fine-tuning with GPT-4o-Image distillation data enhances the score to 0.90 and 88.15 respectively.

We've tested RecA on various base architectures, including Show-o, OpenUni, Harmon, and BAGEL, consistently observing significant performance improvements across all models and benchmarks.

🏆 Model Zoo

A collection of RecA models on Hugging Face with benchmark performance:

| Model Name | Parameters | GenEval | DPGBench | ImgEdit | GEdit | |------------|------------|---------|----------|---------|-------|
| BAGEL-RecA | 14B | 82.4 (+3.6) | 85.29 (+1.26) | 3.75 (+0.37) | 7.27 (+0.33) |
| Harmon-0.5B-RecA | 0.5B | 78.7 (+11.1) | 84.67 (+4.55) | - | - |
| Harmon-1.5B-RecA | 1.5B | 85.7 (+12.8) | 87.21 (+6.28) | - | - |
| Show-o-RecA | 1.3B | 61.9 (+5.3) | 75.70 (+5.05) | - | - |
| Show-o-512x512-RecA | 1.3B | 72.3 (+6.1) | 84.94 (+2.73) | - | - |
| Harmon-1.5B-RecA-plus | 1.5B | 90.0 | 88.15 | - | - |
| OpenUni-RecA | 3.6B | 74.1 (+12.2) | 82.75 (+3.73) | - | - |

✨ Getting Started

For detailed instructions on installation, training, and evaluation, please refer to the respective repository READMEs:

BAGEL Training Guide: Complete guide for BAGEL model training and evaluation.
Benchmark Evaluation Guide: Multi-benchmark evaluation scripts and setup instructions.

🚧 TODO

Release our model weights on Hugging Face.
Release BAGEL training code.
Release Harmon training code.
Release Show-o and OpenUni training code.
Further scale-up BAGEL training.
Add support for new UMM architectures like Show-o2.

📮 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out!

✍️ Citation

If you find RecA useful for your research, please consider citing:

@misc{xie2025reconstructionalignmentimprovesunified,
      title={Reconstruction Alignment Improves Unified Multimodal Models},
      author={Ji Xie and Trevor Darrell and Luke Zettlemoyer and XuDong Wang},
      year={2025},
      eprint={2509.07295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.07295},
}

⭐ If you find this project helpful, please consider giving it a star! ⭐

Downloads last month: 3

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using sanaka87/OpenUni-RecA 1

Collection including sanaka87/OpenUni-RecA

RecA

Collection

Unlocking the Massive Zero-shot Potential in Unified Multimodal Models through Self-supervised Learning! • 8 items • Updated Sep 22 • 12