Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
base_model:
|
| 5 |
+
- Qwen/Qwen2.5-7B-Instruct
|
| 6 |
+
- openai/clip-vit-large-patch14
|
| 7 |
+
- stabilityai/stable-diffusion-2-1
|
| 8 |
+
tags:
|
| 9 |
+
- Unified-models
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
https://github.com/FolSpark/DreamLLM-Qwen2.5
|
| 14 |
+
|
| 15 |
+
Multiple self-trained DreamLLMs.
|
| 16 |
+
|
| 17 |
+
# Model performance
|
| 18 |
+
|
| 19 |
+
DPO is trained using the dataset of [MM-RLHF](https://huggingface.co/datasets/yifanzhang114/MM-RLHF).
|
| 20 |
+
|
| 21 |
+
*indicates that only the comprehension data of LLaVA1.5 is used for the model's third-stage training.
|
| 22 |
+
|
| 23 |
+
Vicuna-CLIP-SD2.1 and Vicuna-CLIP-SD2.1* data comes from paper(https://openreview.net/forum?id=y01KGvd9Bw).
|
| 24 |
+
|
| 25 |
+
## Multimodal Comprehension Assessment
|
| 26 |
+
|
| 27 |
+
| Method | Captioning | | VQA | | | | Comprehensive |
|
| 28 |
+
|----------------------|------------|----------|-----------|----------|----------|----------|---------------|
|
| 29 |
+
| | COCO | 12Paragraph | VQAv2 | OKVQA | VizWiz | TextVQA | MM-Vet |
|
| 30 |
+
| | | | | | | | |
|
| 31 |
+
| [**Qwen-InternViT-SD3.5**](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5) | 106.4 | 10.7 | 73.9 | **54.2** | 49.1 | 54.8 | 44.0 |
|
| 32 |
+
| [**Qwen-InternViT-SD3.5***](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5-CompreOnly)| 102.1 | 10.9 | 73.0 | 53.6 | 48.6 | 55.2 | **45.7** |
|
| 33 |
+
| [**Qwen-InternViT-SD3.5-DPO**](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5-DPO) | 64.6 | 11.6 | **74.2** | 50.9 | 48.9 | **55.8** | 44.7 |
|
| 34 |
+
| | | | | | | | |
|
| 35 |
+
| [Qwen-CLIP-SD3.5](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD3.5) | 99.9 | 9.7 | 72.9 | 52.3 | 49.0 | 44.0 | 39.8 |
|
| 36 |
+
| [Qwen-CLIP-SD3.5*](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD3.5-CompreOnly) | 99.1 | 10.2 | 72.7 | 51.1 | 49.1 | 43.9 | 42.1 |
|
| 37 |
+
| | | | | | | | |
|
| 38 |
+
| [Qwen-CLIP-SD2.1](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD2.1) | 82.8 | 9.1 | 72.5 | 52.4 | 49.4 | 43.6 | 42 |
|
| 39 |
+
| [Qwen-CLIP-SD2.1*](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD2.1-CompreOnly) | 97.3 | 10.8 | 72.4 | 50.4 | **49.9** | 43.2 | 39.0 |
|
| 40 |
+
| | | | | | | | |
|
| 41 |
+
| Vicuna-CLIP-SD2.1 | **115.4** | **17.4** | 56.6 | 44.3 | 45.8 | 34.9 | 35.9 |
|
| 42 |
+
| Vicuna-CLIP-SD2.1* | 103.7 | 8.4 | 72.9 | 52.2 | 49.3 | 41.8 | 36.6 |
|
| 43 |
+
|
| 44 |
+
## Image Generation Evaluation
|
| 45 |
+
|
| 46 |
+
| Method | MS-COCO |
|
| 47 |
+
|--------------------------|---------|
|
| 48 |
+
| | |
|
| 49 |
+
| Qwen-InternViT-SD3.5-Stage1 | 11.72 |
|
| 50 |
+
| Qwen-InternViT-SD3.5 | **11.11** |
|
| 51 |
+
| Qwen-InternViT-SD3.5-DPO | 11.33 |
|
| 52 |
+
| | |
|
| 53 |
+
| Qwen-CLIP-SD3.5-Stage1 | 11.72 |
|
| 54 |
+
| Qwen-CLIP-SD3.5 | 11.61 |
|
| 55 |
+
| | |
|
| 56 |
+
| Qwen-CLIP-SD2.1-Stage1 | 13.94 |
|
| 57 |
+
| Qwen-CLIP-SD2.1 | 12.26 |
|
| 58 |
+
| | |
|
| 59 |
+
| Vicuna-CLIP-SD2.1-Stage1 | 8.76(+~2) |
|
| 60 |
+
| Vicuna-CLIP-SD2.1 | 8.46(+~2) |
|
| 61 |
+
|
| 62 |
+
In the original text of DreamLLm, Vicuna-CLIP-SD2.1 and Vicuna-CLIP-SD2.1 were run 8 times, and the best one among 8 images was selected for each figure. All my models were only tested once, with an approximate error of 2~3.
|