Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: any-to-any
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
# Show-o2: Improved Native Unified Multimodal Models
|
| 7 |
-
|
| 8 |
<div align="center">
|
| 9 |
<br>
|
| 10 |
|
|
@@ -16,20 +15,15 @@ pipeline_tag: any-to-any
|
|
| 16 |
|
| 17 |
<sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore <sup>2</sup> Bytedance
|
| 18 |
|
| 19 |
-
[](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
|
| 20 |
</div>
|
| 21 |
|
| 22 |
-
This model is part of the **Show-o2** family of improved native unified multimodal models.
|
| 23 |
-
|
| 24 |
-
**Paper:** [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564)
|
| 25 |
-
**Code:** [https://github.com/showlab/Show-o/tree/main/show-o2](https://github.com/showlab/Show-o/tree/main/show-o2)
|
| 26 |
-
|
| 27 |
## Abstract
|
| 28 |
|
| 29 |
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
|
| 30 |
|
| 31 |
## What is the new about Show-o2?
|
| 32 |
-
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency
|
| 33 |
<img src="overview.png" width="1000">
|
| 34 |
|
| 35 |
## Pre-trained Model Weigths
|
|
@@ -37,6 +31,9 @@ The Show-o2 checkpoints can be found on Hugging Face:
|
|
| 37 |
* [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B)
|
| 38 |
* [showlab/show-o2-1.5B-HQ](https://huggingface.co/showlab/show-o2-1.5B-HQ)
|
| 39 |
* [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B)
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Getting Started
|
| 42 |
First, set up the environment:
|
|
@@ -50,7 +47,9 @@ wandb login <your wandb keys>
|
|
| 50 |
Download Wan2.1 3D causal VAE model weight [here](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/Wan2.1_VAE.pth) and put it on the current directory.
|
| 51 |
|
| 52 |
Demo for **Multimodal Understanding** and you can find the results on wandb.
|
|
|
|
| 53 |
```
|
|
|
|
| 54 |
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
|
| 55 |
mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
|
| 56 |
|
|
@@ -59,8 +58,17 @@ python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
|
|
| 59 |
|
| 60 |
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
|
| 61 |
mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
|
| 62 |
-
```
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
Demo for **Text-to-Image Generation** and you can find the results on wandb.
|
| 65 |
```
|
| 66 |
python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
pipeline_tag: any-to-any
|
| 4 |
+
library_name: diffusers
|
| 5 |
---
|
| 6 |
|
|
|
|
|
|
|
| 7 |
<div align="center">
|
| 8 |
<br>
|
| 9 |
|
|
|
|
| 15 |
|
| 16 |
<sup>1</sup> [Show Lab](https://sites.google.com/view/showlab/home?authuser=0), National University of Singapore <sup>2</sup> Bytedance
|
| 17 |
|
| 18 |
+
[](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechat_qa_3.jpg)
|
| 19 |
</div>
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Abstract
|
| 22 |
|
| 23 |
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .
|
| 24 |
|
| 25 |
## What is the new about Show-o2?
|
| 26 |
+
We perform the unified learning of multimodal understanding and generation on the text token and **3D Causal VAE space**, which is scalable for **text, image, and video modalities**. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with **autoregressive modeling and flow matching** for the overall unified learning of **multimodal understanding, image/video and mixed-modality generation.**
|
| 27 |
<img src="overview.png" width="1000">
|
| 28 |
|
| 29 |
## Pre-trained Model Weigths
|
|
|
|
| 31 |
* [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B)
|
| 32 |
* [showlab/show-o2-1.5B-HQ](https://huggingface.co/showlab/show-o2-1.5B-HQ)
|
| 33 |
* [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B)
|
| 34 |
+
* [showlab/show-o2-1.5B](https://huggingface.co/showlab/show-o2-1.5B-w-video-und) (further unified fine-tuning on video understanding data)
|
| 35 |
+
* [showlab/show-o2-7B](https://huggingface.co/showlab/show-o2-7B-w-video-und) (further unified fine-tuning on video understanding data)
|
| 36 |
+
|
| 37 |
|
| 38 |
## Getting Started
|
| 39 |
First, set up the environment:
|
|
|
|
| 47 |
Download Wan2.1 3D causal VAE model weight [here](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/Wan2.1_VAE.pth) and put it on the current directory.
|
| 48 |
|
| 49 |
Demo for **Multimodal Understanding** and you can find the results on wandb.
|
| 50 |
+
|
| 51 |
```
|
| 52 |
+
# image-level
|
| 53 |
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
|
| 54 |
mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'
|
| 55 |
|
|
|
|
| 58 |
|
| 59 |
python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
|
| 60 |
mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'
|
|
|
|
| 61 |
|
| 62 |
+
# video
|
| 63 |
+
python3 inference_mmu_vid.py config=configs/showo2_7b_demo_video_understanding.yaml \
|
| 64 |
+
mmu_video_path='./docs/videos/' question="Describe the video." \
|
| 65 |
+
num_video_frames_mmu=32
|
| 66 |
+
|
| 67 |
+
python3 inference_mmu_vid.py config=configs/showo2_1.5b_demo_video_understanding.yaml \
|
| 68 |
+
mmu_video_path='./docs/videos/' question="Describe the video." \
|
| 69 |
+
num_video_frames_mmu=32
|
| 70 |
+
|
| 71 |
+
```
|
| 72 |
Demo for **Text-to-Image Generation** and you can find the results on wandb.
|
| 73 |
```
|
| 74 |
python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
|