REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng1*   Β·   Jaskirat Singh1*   Β·   Yunzhong Hou1   Β·   Zhenchang Xing2  Β·   Saining Xie3  Β·   Liang Zheng1 

1 Australian National University   2Data61-CSIRO   3New York University  
*Project Leads 

🌐 Project Page   πŸ€— Models   πŸ“ƒ Paper  


We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective β€” often degrading final performance β€” we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model.

teaser

REPA-E significantly accelerates training β€” achieving over 17Γ— speedup compared to REPA and 45Γ— over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting E2E-VAE provides better latent structure and serves as a drop-in replacement for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ—256: 1.12 with CFG and 1.69 without CFG.

πŸ†• AutoencoderKL-Compatible Release

New in this release: We are releasing the REPA-E E2E-VAE as a fully Hugging Face AutoencoderKL checkpoint β€” ready to use with diffusers out of the box.

We previously released the REPA-E VAE checkpoint, which required loading through the model class in our REPA-E repository.
This new version provides a Hugging Face–compatible AutoencoderKL checkpoint that can be loaded directly via the diffusers API β€” no extra code or custom wrapper needed.

It offers plug-and-play compatibility with diffusion pipelines and can be seamlessly used to build or train new diffusion models.

⚑️ Quickstart

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")

Use vae.encode(...) / vae.decode(...) in your pipeline. (A full example is provided below.)

πŸ“¦ Requirements

The following packages are required to load and run the REPA-E VAEs with the diffusers library:

pip install diffusers>=0.33.0
pip install torch>=2.3.1

πŸš€ Example Usage

Below is a minimal example showing how to load and use the REPA-E end-to-end trained VA-VAE with diffusers:

from io import BytesIO
import requests

from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image


response = requests.get("https://s3.amazonaws.com/masters.galleries.prod.dpreview.com/2935392.jpg?X-Amz-Expires=3600&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUIXIAMA3N436PSEA/20251019/us-east-1/s3/aws4_request&X-Amz-Date=20251019T103721Z&X-Amz-SignedHeaders=host&X-Amz-Signature=219dc5f98e5c2e5f3b72587716f75889b8f45b0a01f1bd08dbbc44106e484144")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

πŸ“š Citation

@article{leng2025repae,
  title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  year={2025},
  journal={arXiv preprint arXiv:2504.10483},
}
Downloads last month
59
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support