Shanshan Wang
commited on
Commit
·
3ff7e45
1
Parent(s):
8f8b827
updated readme
Browse files
README.md
CHANGED
|
@@ -13,6 +13,12 @@ thumbnail: >-
|
|
| 13 |
pipeline_tag: text-generation
|
| 14 |
---
|
| 15 |
# Model Card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
|
| 17 |
|
| 18 |
The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
|
|
@@ -30,7 +36,29 @@ The Mississippi-2B model builds on the strong foundations of our H2O-Danube lang
|
|
| 30 |
- Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
|
| 31 |
- Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
### Install dependencies:
|
| 36 |
```bash
|
|
@@ -42,7 +70,7 @@ If you have ampere GPUs, install flash-attention to speed up inference:
|
|
| 42 |
pip install flash_attn
|
| 43 |
```
|
| 44 |
|
| 45 |
-
###
|
| 46 |
|
| 47 |
```python
|
| 48 |
import torch
|
|
@@ -86,23 +114,6 @@ print(f'User: {question}\nAssistant: {response}')
|
|
| 86 |
|
| 87 |
```
|
| 88 |
|
| 89 |
-
## Benchmarks
|
| 90 |
-
|
| 91 |
-
### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
|
| 92 |
-
|
| 93 |
-
| **Models** | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
|
| 94 |
-
|----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
|
| 95 |
-
| Qwen2-VL-2B | 2.1 | **57.2** | **72.2** | 47.5 | 42.2 | 47.8 | **42.4** | 74.7 | **797** | **51.5** |
|
| 96 |
-
| **H2OVL-Mississippi-2B** | 2.1 | 54.4 | 64.8 | 49.6 | 35.2 | **56.8** | 36.4 | 69.9 | 782 | 44.7 |
|
| 97 |
-
| InternVL2-2B | 2.1 | 53.9 | 69.6 | **49.8** | 36.3 | 46.0 | 38.0 | 74.1 | 781 | 39.7 |
|
| 98 |
-
| Phi-3-Vision | 4.2 | 53.6 | 65.2 | 47.7 | **46.1** | 44.6 | 39.0 | **78.4** | 637 | 44.1 |
|
| 99 |
-
| MiniMonkey | 2.2 | 52.7 | 68.9 | 48.1 | 35.7 | 45.3 | 30.9 | 73.7 | **794** | 39.8 |
|
| 100 |
-
| MiniCPM-V-2 | 2.8 | 47.9 | 65.8 | 39.1 | 38.2 | 39.8 | 36.1 | 62.9 | 605 | 41.0 |
|
| 101 |
-
| InternVL2-1B | 0.8 | 48.3 | 59.7 | 45.6 | 36.7 | 39.4 | 34.3 | 63.8 | 755 | 31.5 |
|
| 102 |
-
| PaliGemma-3B-mix-448 | 2.9 | 46.5 | 65.6 | 48.3 | 34.9 | 28.7 | 32.2 | 68.3 | 614 | 33.1 |
|
| 103 |
-
| **H2OVL-Mississippi-0.8B** | 0.8 | 43.5 | 47.7 | 39.1 | 34.0 | 39.0 | 29.6 | 53.6 | 751 | 30.0 |
|
| 104 |
-
| DeepSeek-VL-1.3B | 2.0 | 39.6 | 63.8 | 39.9 | 33.8 | 29.8 | 27.6 | 51.5 | 413 | 29.2 |
|
| 105 |
-
|
| 106 |
|
| 107 |
## Prompt Engineering for JSON Extraction
|
| 108 |
|
|
|
|
| 13 |
pipeline_tag: text-generation
|
| 14 |
---
|
| 15 |
# Model Card
|
| 16 |
+
[\[📜 H2OVL-Mississippi Paper\]](https://arxiv.org/abs/2410.13611)
|
| 17 |
+
[\[🤗 HF Demo\]](https://huggingface.co/spaces/h2oai/h2ovl-mississippi)
|
| 18 |
+
[\[🚀 Quick Start\]](#quick-start)
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
|
| 22 |
The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
|
| 23 |
|
| 24 |
The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
|
|
|
|
| 36 |
- Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
|
| 37 |
- Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
|
| 38 |
|
| 39 |
+
|
| 40 |
+
## Benchmarks
|
| 41 |
+
|
| 42 |
+
### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
|
| 43 |
+
|
| 44 |
+
| **Models** | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
|
| 45 |
+
|----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
|
| 46 |
+
| Qwen2-VL-2B | 2.1 | **57.2** | **72.2** | 47.5 | 42.2 | 47.8 | **42.4** | 74.7 | **797** | **51.5** |
|
| 47 |
+
| **H2OVL-Mississippi-2B** | 2.1 | 54.4 | 64.8 | 49.6 | 35.2 | **56.8** | 36.4 | 69.9 | 782 | 44.7 |
|
| 48 |
+
| InternVL2-2B | 2.1 | 53.9 | 69.6 | **49.8** | 36.3 | 46.0 | 38.0 | 74.1 | 781 | 39.7 |
|
| 49 |
+
| Phi-3-Vision | 4.2 | 53.6 | 65.2 | 47.7 | **46.1** | 44.6 | 39.0 | **78.4** | 637 | 44.1 |
|
| 50 |
+
| MiniMonkey | 2.2 | 52.7 | 68.9 | 48.1 | 35.7 | 45.3 | 30.9 | 73.7 | **794** | 39.8 |
|
| 51 |
+
| MiniCPM-V-2 | 2.8 | 47.9 | 65.8 | 39.1 | 38.2 | 39.8 | 36.1 | 62.9 | 605 | 41.0 |
|
| 52 |
+
| InternVL2-1B | 0.8 | 48.3 | 59.7 | 45.6 | 36.7 | 39.4 | 34.3 | 63.8 | 755 | 31.5 |
|
| 53 |
+
| PaliGemma-3B-mix-448 | 2.9 | 46.5 | 65.6 | 48.3 | 34.9 | 28.7 | 32.2 | 68.3 | 614 | 33.1 |
|
| 54 |
+
| **H2OVL-Mississippi-0.8B** | 0.8 | 43.5 | 47.7 | 39.1 | 34.0 | 39.0 | 29.6 | 53.6 | 751 | 30.0 |
|
| 55 |
+
| DeepSeek-VL-1.3B | 2.0 | 39.6 | 63.8 | 39.9 | 33.8 | 29.8 | 27.6 | 51.5 | 413 | 29.2 |
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
## Quick Start
|
| 60 |
+
|
| 61 |
+
We provide an example code to run h2ovl-mississippi-2b using `transformers`.
|
| 62 |
|
| 63 |
### Install dependencies:
|
| 64 |
```bash
|
|
|
|
| 70 |
pip install flash_attn
|
| 71 |
```
|
| 72 |
|
| 73 |
+
### Inference with Transformers:
|
| 74 |
|
| 75 |
```python
|
| 76 |
import torch
|
|
|
|
| 114 |
|
| 115 |
```
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
## Prompt Engineering for JSON Extraction
|
| 119 |
|