Any-to-Any
Diffusers
PyTorch
Sierkinhane commited on
Commit
d3a220e
·
verified ·
1 Parent(s): f0d8edd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -10
README.md CHANGED
@@ -1,10 +1,9 @@
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: any-to-any
 
4
  ---
5
 
6
- # Show-o2: Improved Native Unified Multimodal Models
7
-
8
  <div align="center">
9
  <br>
10
 
@@ -16,20 +15,15 @@ pipeline_tag: any-to-any
16
 
17
  <sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore&nbsp; <sup>2</sup> Bytedance&nbsp;
18
 
19
- [![ArXiv](https://img.shields.io/badge/Arxiv-<2506.15564>-<COLOR>.svg)](https://arxiv.org/abs/2506.15564) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
20
  </div>
21
 
22
- This model is part of the **Show-o2** family of improved native unified multimodal models.
23
-
24
- **Paper:** [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564)
25
- **Code:** [https://github.com/showlab/Show-o/tree/main/show-o2](https://github.com/showlab/Show-o/tree/main/show-o2)
26
-
27
  ## Abstract
28
 
29
  This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
30
 
31
  ## What is the new about Show-o2?
32
- We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
33
  <img src="overview.png" width="1000">
34
 
35
  ## Pre-trained Model Weigths
@@ -37,6 +31,9 @@ The Show-o2 checkpoints can be found on Hugging Face:
37
  * [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B)
38
  * [showlab/show-o2-1.5B-HQ](https://huggingface.co/showlab/show-o2-1.5B-HQ)
39
  * [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B)
 
 
 
40
 
41
  ## Getting Started
42
  First, set up the environment:
@@ -50,7 +47,9 @@ wandb login <your wandb keys>
50
  Download Wan2.1 3D causal VAE model weight [here](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/Wan2.1_VAE.pth) and put it on the current directory.
51
 
52
  Demo for **Multimodal Understanding** and you can find the results on wandb.
 
53
  ```
 
54
  python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
55
  mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
56
 
@@ -59,8 +58,17 @@ python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
59
 
60
  python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
61
  mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
62
- ```
63
 
 
 
 
 
 
 
 
 
 
 
64
  Demo for **Text-to-Image Generation** and you can find the results on wandb.
65
  ```
66
  python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: any-to-any
4
+ library_name: diffusers
5
  ---
6
 
 
 
7
  <div align="center">
8
  <br>
9
 
 
15
 
16
  <sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore&nbsp; <sup>2</sup> Bytedance&nbsp;
17
 
18
+ [![ArXiv](https://img.shields.io/badge/Arxiv-<2506.15564>-<COLOR>.svg)](https://arxiv.org/abs/2506.15564) [![Code](https://img.shields.io/badge/Code-<GitHub_Repository>-<COLOR>.svg)](https://github.com/showlab/Show-o/tree/main/show-o2) [![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
19
  </div>
20
 
 
 
 
 
 
21
  ## Abstract
22
 
23
  This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
24
 
25
  ## What is the new about Show-o2?
26
+ We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
27
  <img src="overview.png" width="1000">
28
 
29
  ## Pre-trained Model Weigths
 
31
  * [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B)
32
  * [showlab/show-o2-1.5B-HQ](https://huggingface.co/showlab/show-o2-1.5B-HQ)
33
  * [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B)
34
+ * [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B-w-video-und) (further unified fine-tuning on video understanding data)
35
+ * [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B-w-video-und) (further unified fine-tuning on video understanding data)
36
+
37
 
38
  ## Getting Started
39
  First, set up the environment:
 
47
  Download Wan2.1 3D causal VAE model weight [here](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/Wan2.1_VAE.pth) and put it on the current directory.
48
 
49
  Demo for **Multimodal Understanding** and you can find the results on wandb.
50
+
51
  ```
52
+ # image-level
53
  python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
54
  mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
55
 
 
58
 
59
  python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
60
  mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
 
61
 
62
+ # video
63
+ python3 inference_mmu_vid.py config=configs/showo2_7b_demo_video_understanding.yaml \
64
+ mmu_video_path='./docs/videos/' question="Describe the video." \
65
+ num_video_frames_mmu=32
66
+
67
+ python3 inference_mmu_vid.py config=configs/showo2_1.5b_demo_video_understanding.yaml \
68
+ mmu_video_path='./docs/videos/' question="Describe the video." \
69
+ num_video_frames_mmu=32
70
+
71
+ ```
72
  Demo for **Text-to-Image Generation** and you can find the results on wandb.
73
  ```
74
  python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \