Overcooked and refactored

I cooked an incorrect cross-attention mechanism based version where the clip_l was learning from the clip_g. This next version will have correctly decoupled behaviors and the correct cantor cross-attention formula. The simple logistics outcome is latents modified by invalid projections - not even cantor - and the representations housing the incorrect behavior towards the t5_xl. This version essentially weighted 20/60/5% reconstruction accuracy; L/G/T5. This means that it failed for obvious reasons. The t5 is supposed to come out different, but the L and G are supposed to be useful.

Apologies for the incorrect formulas.

VAE Lyra 🎡 - SDXL Edition

Multi-modal Variational Autoencoder for SDXL text embedding transformation using geometric fusion. Fuses CLIP-L, CLIP-G, and T5-XXL into a unified latent space.

Model Details

  • Fusion Strategy: cantor
  • Latent Dimension: 2048
  • Training Steps: 15,634
  • Best Loss: 0.3316

Architecture

  • Modalities:
    • CLIP-L (768d) - SDXL text_encoder
    • CLIP-G (1280d) - SDXL text_encoder_2
    • T5-XXL (2048d) - Additional conditioning
  • Encoder Layers: 3
  • Decoder Layers: 3
  • Hidden Dimension: 1024

SDXL Compatibility

This model outputs both CLIP embeddings needed for SDXL:

  • clip_l: [batch, 77, 768] β†’ text_encoder output
  • clip_g: [batch, 77, 1280] β†’ text_encoder_2 output

T5-XXL information is encoded into the latent space but not directly output.

Usage

from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra-sdxl-t5xl",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={"clip_l": 768, "clip_g": 1280, "t5_xl": 2048},
    latent_dim=2048,
    fusion_strategy="cantor"
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model - train on all three
inputs = {
    "clip_l": clip_l_embeddings,   # [batch, 77, 768]
    "clip_g": clip_g_embeddings,   # [batch, 77, 1280]
    "t5_xl": t5_xl_embeddings      # [batch, 77, 2048]
}

# For SDXL inference - only decode CLIP outputs
recons, mu, logvar = model(inputs, target_modalities=["clip_l", "clip_g"])

# Use recons["clip_l"] and recons["clip_g"] with SDXL

Training Details

  • Trained on 50,000 diverse prompts
  • Mix of LAION flavors (95%) and synthetic prompts (5%)
  • KL Annealing: True
  • Learning Rate: 0.0001

Citation

@software{vae_lyra_sdxl_2025,
  author = {AbstractPhil},
  title = {VAE Lyra SDXL: Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra-sdxl-t5xl}
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support