File size: 5,207 Bytes
2fadbb0 701b0ad 2fadbb0 3262367 40c0e02 3262367 81252db 3262367 81252db 3262367 2fadbb0 3262367 e9a8de0 3262367 4baa627 3262367 2fadbb0 3262367 2fadbb0 3262367 2fadbb0 3262367 2fadbb0 3262367 2fadbb0 3262367 2fadbb0 e9a8de0 2fadbb0 3262367 2fadbb0 3262367 e3945e7 3262367 e3945e7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: apache-2.0
pipeline_tag: text-to-video
library_name: diffusers
---
# ContentV: Efficient Training of Video Generation Models with Limited Compute
<div align="center">
<p align="center">
<a href="https://contentv.github.io">
<img
src="https://img.shields.io/badge/Gallery-Project Page-0A66C2?logo=googlechrome&logoColor=blue"
alt="Project Page"
/>
</a>
<a href='https://arxiv.org/abs/2506.05343'>
<img
src="https://img.shields.io/badge/Tech Report-ArXiv-red?logo=arxiv&logoColor=red"
alt="Tech Report"
/>
</a>
<a href="https://huggingface.co/ByteDance/ContentV-8B">
<img
src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface&logoColor=yellow"
alt="Model"
/>
</a>
<a href="https://github.com/bytedance/ContentV">
<img
src="https://img.shields.io/badge/Code-GitHub-orange?logo=github&logoColor=white"
alt="Code"
/>
</a>
<a href="https://www.apache.org/licenses/LICENSE-2.0">
<img
src="https://img.shields.io/badge/License-Apache 2.0-5865F2?logo=apache&logoColor=purple"
alt="License"
/>
</a>
</p>
</div>
This project presents *ContentV*, an efficient framework for accelerating the training of DiT-based video generation models through three key innovations:
- A minimalist architecture that maximizes reuse of pre-trained image generation models for video synthesis
- A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency
- A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations
Our open-source 8B model (based on Stable Diffusion 3.5 Large and Wan-VAE) achieves state-of-the-art result (85.14 on VBench) in only 4 weeks of training with 256×64GB NPUs.
<div align="center">
<img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/demo.jpg" width="100%">
<img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/arch.jpg" width="100%">
</div>
## ⚡ Quickstart
#### Recommended PyTorch Version
- GPU: torch >= 2.3.1 (CUDA >= 12.2)
#### Installation
```bash
git clone https://github.com/bytedance/ContentV.git
cd ContentV
pip3 install -r requirements.txt
```
#### T2V Generation
```bash
## For GPU
python3 demo.py
```
## 📊 VBench
| Model | Total Score | Quality Score | Semantic Score | Human Action | Scene | Dynamic Degree | Multiple Objects | Appear. Style |
|----------------------|--------|-------|-------|-------|-------|-------|-------|-------|
| Wan2.1-14B | 86.22 | 86.67 | 84.44 | 99.20 | 61.24 | 94.26 | 86.59 | 21.59 |
| **ContentV (Long)** | 85.14 | 86.64 | 79.12 | 96.80 | 57.38 | 83.05 | 71.41 | 23.02 |
| Goku† | 84.85 | 85.60 | 81.87 | 97.60 | 57.08 | 76.11 | 79.48 | 23.08 |
| Open-Sora 2.0 | 84.34 | 85.40 | 80.12 | 95.40 | 52.71 | 71.39 | 77.72 | 22.98 |
| Sora† | 84.28 | 85.51 | 79.35 | 98.20 | 56.95 | 79.91 | 70.85 | 24.76 |
| **ContentV (Short)** | 84.11 | 86.23 | 75.61 | 89.60 | 44.02 | 79.26 | 74.58 | 21.21 |
| EasyAnimate 5.1 | 83.42 | 85.03 | 77.01 | 95.60 | 54.31 | 57.15 | 66.85 | 23.06 |
| Kling 1.6† | 83.40 | 85.00 | 76.99 | 96.20 | 55.57 | 62.22 | 63.99 | 20.75 |
| HunyuanVideo | 83.24 | 85.09 | 75.82 | 94.40 | 53.88 | 70.83 | 68.55 | 19.80 |
| CogVideoX-5B | 81.61 | 82.75 | 77.04 | 99.40 | 53.20 | 70.97 | 62.11 | 24.91 |
| Pika-1.0† | 80.69 | 82.92 | 71.77 | 86.20 | 49.83 | 47.50 | 43.08 | 22.26 |
| VideoCrafter-2.0 | 80.44 | 82.20 | 73.42 | 95.00 | 55.29 | 42.50 | 40.66 | 25.13 |
| AnimateDiff-V2 | 80.27 | 82.90 | 69.75 | 92.60 | 50.19 | 40.83 | 36.88 | 22.42 |
| OpenSora 1.2 | 79.23 | 80.71 | 73.30 | 85.80 | 42.47 | 47.22 | 58.41 | 23.89 |
## ✅ Todo List
- [x] Inference code and checkpoints
- [ ] Training code of RLHF
## 🧾 License
This code repository and part of the model weights are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Please note that:
- MMDiT are derived from [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) and trained with video samples. This Stability AI Model is licensed under the [Stability AI Community License](https://stability.ai/community-license-agreement), Copyright © Stability AI Ltd. All Rights Reserved
- Video VAE from [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) is licensed under [Apache 2.0 License](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/LICENSE.txt)
## ❤️ Acknowledgement
* [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
* [Wan2.1](https://github.com/Wan-Video/Wan2.1)
* [Diffusers](https://github.com/huggingface/diffusers)
* [HuggingFace](https://huggingface.co)
## 🔗 Citation
```bibtex
@article{contentv2025,
title = {ContentV: Efficient Training of Video Generation Models with Limited Compute},
author = {Bytedance Douyin Content Team},
journal = {arXiv preprint arXiv:2506.05343},
year = {2025}
}
``` |