FolSpark
/

DreamLLM-Qwen2.5-CLIP-SD2.1-CompreOnly

+---
+language:
+- en
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+- openai/clip-vit-large-patch14
+- stabilityai/stable-diffusion-2-1
+tags:
+- Unified-models
+license: apache-2.0
+---
+https://github.com/FolSpark/DreamLLM-Qwen2.5
+Multiple self-trained DreamLLMs.
+# Model performance
+DPO is trained using the dataset of [MM-RLHF](https://huggingface.co/datasets/yifanzhang114/MM-RLHF).
+*indicates that only the comprehension data of LLaVA1.5 is used for the model's third-stage training.
+Vicuna-CLIP-SD2.1 and Vicuna-CLIP-SD2.1* data comes from paper(https://openreview.net/forum?id=y01KGvd9Bw).
+## Multimodal Comprehension Assessment
+| Method               | Captioning |          | VQA       |          |          |          | Comprehensive |
+|----------------------|------------|----------|-----------|----------|----------|----------|---------------|
+|                      | COCO       | 12Paragraph | VQAv2  | OKVQA    | VizWiz   | TextVQA  | MM-Vet        |
+|                      |            |          |           |          |          |          |               |
+| [**Qwen-InternViT-SD3.5**](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5) | 106.4      | 10.7     | 73.9      | **54.2**     | 49.1     | 54.8     | 44.0          |
+| [**Qwen-InternViT-SD3.5***](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5-CompreOnly)| 102.1      | 10.9     | 73.0      | 53.6     | 48.6     | 55.2     | **45.7**          |
+| [**Qwen-InternViT-SD3.5-DPO**](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-InternViT-SD3.5-DPO) | 64.6      | 11.6     | **74.2**      | 50.9     | 48.9     | **55.8**     | 44.7          |
+|                      |            |          |           |          |          |          |               |
+| [Qwen-CLIP-SD3.5](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD3.5)      | 99.9       | 9.7      | 72.9      | 52.3     | 49.0     | 44.0     | 39.8          |
+| [Qwen-CLIP-SD3.5*](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD3.5-CompreOnly)     | 99.1       | 10.2     | 72.7      | 51.1     | 49.1     | 43.9     | 42.1          |
+|                      |            |          |           |          |          |          |               |
+| [Qwen-CLIP-SD2.1](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD2.1)      | 82.8       | 9.1      | 72.5      | 52.4     | 49.4     | 43.6     | 42            |
+| [Qwen-CLIP-SD2.1*](https://huggingface.co/FolSpark/DreamLLM-Qwen2.5-CLIP-SD2.1-CompreOnly)     | 97.3       | 10.8     | 72.4      | 50.4     | **49.9**     | 43.2     | 39.0          |
+|                      |            |          |           |          |          |          |               |
+| Vicuna-CLIP-SD2.1    | **115.4**      | **17.4**     | 56.6      | 44.3     | 45.8     | 34.9     | 35.9          |
+| Vicuna-CLIP-SD2.1*   | 103.7      | 8.4      | 72.9      | 52.2     | 49.3     | 41.8     | 36.6          |
+## Image Generation Evaluation
+| Method                   | MS-COCO |
+|--------------------------|---------|
+|                          |         |
+| Qwen-InternViT-SD3.5-Stage1 | 11.72   |
+| Qwen-InternViT-SD3.5     | **11.11**   |
+| Qwen-InternViT-SD3.5-DPO     | 11.33   |
+|                          |         |
+| Qwen-CLIP-SD3.5-Stage1   | 11.72   |
+| Qwen-CLIP-SD3.5          | 11.61   |
+|                          |         |
+| Qwen-CLIP-SD2.1-Stage1   | 13.94   |
+| Qwen-CLIP-SD2.1          | 12.26   |
+|                          |        |
+| Vicuna-CLIP-SD2.1-Stage1 | 8.76(+~2)   |
+| Vicuna-CLIP-SD2.1        | 8.46(+~2)   |
+In the original text of DreamLLm, Vicuna-CLIP-SD2.1 and Vicuna-CLIP-SD2.1 were run 8 times, and the best one among 8 images was selected for each figure. All my models were only tested once, with an approximate error of 2~3.