Commit
·
1d11dd5
1
Parent(s):
6e8b570
Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,20 @@ tags:
|
|
| 20 |
|
| 21 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## 1. Usage
|
| 26 |
|
|
|
|
| 20 |
|
| 21 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
| 22 |
|
| 23 |
+

|
| 24 |
+
|
| 25 |
+
*The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
|
| 26 |
+
of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
|
| 27 |
+
U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
|
| 28 |
+
proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
|
| 29 |
+
2 )
|
| 30 |
+
with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
|
| 31 |
+
After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
|
| 32 |
+
training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
|
| 33 |
+
encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
|
| 34 |
+
schedule determines a percentage of the most confident token predictions to be fixed after every
|
| 35 |
+
iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
|
| 36 |
+
image pixels.*
|
| 37 |
|
| 38 |
## 1. Usage
|
| 39 |
|