--- license: mit pipeline_tag: image-to-image library_name: diffusers ---

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng^1* · Jaskirat Singh^1* · Yunzhong Hou¹ · Zhenchang Xing² · Saining Xie³ · Liang Zheng¹

¹ Australian National University ²Data61-CSIRO ³New York University
_{^*Project Leads}

--- We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.

teaser

**REPA-E** significantly accelerates training — achieving over **17×** speedup compared to REPA and **45×** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256×256: **1.12** with CFG and **1.69** without CFG.

🆕 AutoencoderKL-Compatible Release

> **New in this release:** We are releasing the **REPA-E E2E-VAE** as a fully **Hugging Face AutoencoderKL** checkpoint — ready to use with `diffusers` out of the box. We previously released the REPA-E VAE checkpoint, which required loading through the model class in our REPA-E repository. This new version provides a **Hugging Face–compatible AutoencoderKL** checkpoint that can be loaded directly via the `diffusers` API — no extra code or custom wrapper needed. It offers **plug-and-play compatibility** with diffusion pipelines and can be seamlessly used to build or train new diffusion models. ## ⚡️ Quickstart ```python from diffusers import AutoencoderKL vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda") ``` > Use `vae.encode(...)` / `vae.decode(...)` in your pipeline. (A full example is provided below.) ## 📦 Requirements The following packages are required to load and run the REPA-E VAEs with the `diffusers` library: ```bash pip install diffusers>=0.33.0 pip install torch>=2.3.1 ``` ## 🚀 Example Usage Below is a minimal example showing how to load and use the REPA-E end-to-end trained IN-VAE with `diffusers`: ```python from io import BytesIO import requests from diffusers import AutoencoderKL import numpy as np import torch from PIL import Image response = requests.get("https://s3.amazonaws.com/masters.galleries.prod.dpreview.com/2935392.jpg?X-Amz-Expires=3600&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUIXIAMA3N436PSEA/20251019/us-east-1/s3/aws4_request&X-Amz-Date=20251019T103721Z&X-Amz-SignedHeaders=host&X-Amz-Signature=219dc5f98e5c2e5f3b72587716f75889b8f45b0a01f1bd08dbbc44106e484144") device = "cuda" image = torch.from_numpy( np.array( Image.open(BytesIO(response.content)).resize((512, 512)) ) ).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1 image = image.to(device) vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device) with torch.no_grad(): latents = vae.encode(image).latent_dist.sample() reconstructed = vae.decode(latents).sample ``` ## 📚 Citation ```bibtex @article{leng2025repae, title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers}, author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng}, year={2025}, journal={arXiv preprint arXiv:2504.10483}, } ```