UniVerse-1: Unified Audio-Video Generation via Stitching of Experts.

πŸ€— UniVerse-1 Models   |   πŸ€— Verse-Bench   |    πŸ“‘ Tech Report    |    πŸ“‘ Project Page    πŸ’» Code   

This is official inference code of UniVerse-1

Paper Abstract

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: this https URL .

πŸ”₯πŸ”₯πŸ”₯ News!!

  • Sep 28, 2025: πŸ‘‹ We release Verse-Bench metric tools, Verse-Bench tools.
  • Sep 09, 2025: πŸ‘‹ We release the technical report of UniVerse-1.
  • Sep 08, 2025: πŸ‘‹ We release Verse-Bench datasets, Verse-Bench Dataset.
  • Sep 08, 2025: πŸ‘‹ We release model weights of UniVerse-1.
  • Sep 08, 2025: πŸ‘‹ We release inference code of UniVerse-1.
  • Sep 03, 2025: πŸ‘‹ We release the project page of UniVerse-1.

Introduction

UniVerse-1 is a unified, Veo-3-like model that simultaneously generates synchronized audio and video from a reference image and a text prompt.

  • Unified Audio-Video synthesis: Features the fascinating ability to generate audio and video in tandem. It interprets the input prompt to produce a perfectly synchronized audio-visual output.

  • Speech audio generation: The model can generate fluent speech directly from a text prompt, demonstrating a built-in text-to-speech (TTS) ability. Crucially, it tailors the voice timbre to match the specific character being generated.

  • Musical instrument playing sound generation: The model is also highly proficient at creating sounds of musical instruments. Additionally, it offers some capability for "singing while playing," generating both vocal and instrumental tracks concurrently.

  • Ambient sound generation: The model can generate ambient sounds, producing background audio that matches the visual environment of the video.

  • The first open-sourced Dit-based Audio-Video joint method: We are the first to open-source a DiT-based, Veo-3-like model for joint audio-visual generation.

Model Download

Models πŸ€— Hugging Face
UniVerse-1 Base UniVerse-1

download our pretrained model into ./checkpoints/UniVerse-1-base/

Model Usage

πŸ”§ Dependencies and Installation

conda create -n universe python=3.10
conda activate universe
pip install torch==2.5.0 torchaudio==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja && pip install flash-attn==2.7.0.post2 --no-build-isolation 
pip install -r requirements-lint.txt
pip install -e .

git clone https://github.com/Dorniwang/UniVerse-1-code/
cd UniVerse-1-code

πŸš€ Inference Scripts

bash scripts/inference/inference_universe.sh

Acknowledgements

Part of the code for this project comes from:

Thank you to all the open-source projects for their contributions to this project!

License

The code in the repository is licensed under Apache 2.0 License.

Citation

@article{wang2025universe,
  title={UniVerse-1: Unified Audio-Video Generation via Stitching of Experts},
  author={Wang, Duomin and Zuo, Wei and Li, Aojie and Chen, Ling-Hao and Liao, Xinyao and Zhou, Deyu and Yin, Zixin and Dai, Xili and Jiang, Daxin and Yu, Gang},
  journal={arXiv preprint arXiv:2509.06155},
  year={2025}
}

Star History

Star History Chart

Downloads last month
607
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support