Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model:
|
| 4 |
+
- openbmb/MiniCPM-Llama3-V-2_5
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# M-STAR
|
| 8 |
+
|
| 9 |
+
<p align="center">
|
| 10 |
+
<img src="./assets/mstar-logo.png" width="500">
|
| 11 |
+
</p>
|
| 12 |
+
|
| 13 |
+
<p align="center">
|
| 14 |
+
<a href="https://mstar-lmm.github.io/">Project Page</a>
|
| 15 |
+
</p>
|
| 16 |
+
|
| 17 |
+
M-STAR is a framework to improve the **Multimodal Reasoning** ability of Large Multimodal Models (LMMs) via **Self-Evolving Training**.
|
| 18 |
+
|
| 19 |
+
Unlike traditional **Self-Evolving Training**, M-STAR supports **Large Multimodal Models**, **Training with Multimodal Process Reward Models (MPRM)**, and **Adaptive Explorations during Training**.
|
| 20 |
+
|
| 21 |
+
This is M-STAR-MiniCPM-V-2.5 model based on [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5), which has been trained using M-STAR, our self-evolving training framework for multimodal reasoning.
|
| 22 |
+
|
| 23 |
+
- **M-STAR Resources**:
|
| 24 |
+
|
| 25 |
+
| **Component** |**Description** |
|
| 26 |
+
|------------------------------|---------------------------------------------------------------------------------------------------------------------|
|
| 27 |
+
| **M-STAR Model** | A strong LMM for multimodal reasoning, scoring **59.5** on MathVista, based on [MiniCPM-V-2.5](https://github.com/OpenBMB/MiniCPM-V) with 8B parameters. |
|
| 28 |
+
| **M-STAR PRM** | A Multimodal Process Reward Model (MPRM) that evaluates the quality of multimodal reasoning data at the step level. |
|
| 29 |
+
| **M-STAR CoT Dataset** | A collection of 100K generated multimodal reasoning data with CoT, where the queries are sourced from [MathV360K](https://huggingface.co/datasets/Zhiqiang007/MathV360K). |
|
| 30 |
+
| **M-STAR MPRM Training Dataset** | A set of 50K multimodal reasoning data designed for training MPRM. |
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
## Performance
|
| 34 |
+
|
| 35 |
+
### Main Results
|
| 36 |
+
|
| 37 |
+
<div align="center">
|
| 38 |
+
|
| 39 |
+
| | MathVista | FQA | GPS | MWP | TQA | VQA |
|
| 40 |
+
|----------------------------|-----------|-------|-------|-------|-------|-------|
|
| 41 |
+
| **Baselines** | | | | | | |
|
| 42 |
+
| MiniCPM-V-2.5 | 52.4 | 59.2 | 44.7 | 50.5 | 53.8 | 48.0 |
|
| 43 |
+
| + warmup | 52.6 | 58.4 | 47.1 | 57.0 | 53.8 | 45.8 |
|
| 44 |
+
| SFT | 54.8 | 58.7 | 50.5 | 56.5 | 55.7 | 50.8 |
|
| 45 |
+
| ReST<sup>EM</sup> | 55.1 | 59.1 | 49.5 | 65.6 | 55.1 | 48.0 |
|
| 46 |
+
| Iterative RFT | 55.7 | 59.1 | 49.5 | 64.5 | 55.1 | 47.5 |
|
| 47 |
+
| **Static components only** | | | | | | |
|
| 48 |
+
| Cont. Self-Evolving | 57.2 | 57.6 | 56.3 | 65.1 | 57.0 | 49.7 |
|
| 49 |
+
| + PRM Re-Rank | 59.2 | 59.1β0.7 | 61.1β14 | 68.3β11.3 | 55.1β1.3 | 51.4β5.6 |
|
| 50 |
+
| **Automatically tuning the temperature T** | | | | | | |
|
| 51 |
+
| M-STAR (Reward-Pass@2) | 59.5 (+6.9) | 59.5β1.1 | 59.1β12 | 65.6β8.6 | 58.9β5.1 | 54.2β8.4 |
|
| 52 |
+
| **Reference** | | | | | | |
|
| 53 |
+
| GPT-4o | 63.8 | - | - | - | - | - |
|
| 54 |
+
| Gemini 1.5 Flash | 58.4 | - | - | - | - | - |
|
| 55 |
+
| GPT-4T 2024-04-09 | 58.1 | - | - | - | - | - |
|
| 56 |
+
| Pixtral 12B | 58.0 | - | - | - | - | - |
|
| 57 |
+
| InternLM-XComposer2-VL-7B | 57.6 | 55.0 | 63.0 | 73.7 | 56.3 | 39.7 |
|
| 58 |
+
| Math-LLaVA-13B | 46.6 | 37.2 | 57.7 | 56.5 | 51.3 | 33.5 |
|
| 59 |
+
| LLaVA-NeXT-34B | 46.5 | - | - | - | - | - |
|
| 60 |
+
|
| 61 |
+
</div>
|
| 62 |
+
|
| 63 |
+
<div align="center">
|
| 64 |
+
|
| 65 |
+
| Model | MathVista | M3CoT | MMStar-R | MMBench-R | AI2D | Average |
|
| 66 |
+
|--------------------------|-----------|---------|----------|-----------|--------|----------|
|
| 67 |
+
| MiniCPM-V-2.5 | 52.4 | 41.2 | 44.6 | 72.6 | 64.4 | 55.0 |
|
| 68 |
+
| + warmup | 52.6 | 47.8 | 45.1 | 76.9 | 65.9 | 57.7 |
|
| 69 |
+
| M-STAR | 59.5β6.9 | 48.7β0.9 | 50.7β5.6 | 79.9β3 | 69.1β3.2 | 61.6β3.9 |
|
| 70 |
+
| Phi-3.5-vision | 46.5 | 39.4 | 42.5 | 56.8 | 47.5 | 46.5 |
|
| 71 |
+
| + warmup | 49.3 | 46.5 | 44.2 | 70.9 | 65.5 | 55.3 |
|
| 72 |
+
| M-STAR | 54.5β5.2 | 51.3β4.8 | 48.8β4.6 | 73.6β2.7 | 67.9β2.4 | 59.2β3.9 |
|
| 73 |
+
| InternVL2-2B | 46.4 | 16.7 | 20.0 | 14.2 | 33.5 | 26.2 |
|
| 74 |
+
| + warmup | 47.6 | 45.6 | 41.8 | 68.8 | 60.0 | 52.8 |
|
| 75 |
+
| M-STAR | 50.3β2.7 | 47.1β1.5 | 42.0β0.2 | 67.3β1.5 | 59.7β0.3 | 53.3β0.5 |
|
| 76 |
+
|
| 77 |
+
</div>
|
| 78 |
+
|
| 79 |
+
### Effectiveness of Adaptively Adjusting Exploration
|
| 80 |
+
|
| 81 |
+
<p align="center">
|
| 82 |
+
<img src="./assets/dynamic.png" width="500">
|
| 83 |
+
</p>
|
| 84 |
+
|
| 85 |
+
Evaluating the effectiveness of adaptively adjusting exploration:
|
| 86 |
+
|
| 87 |
+
- **Reward-Pass@2**: The percentage of samples for which there exist correct responses among the top 2 responses ranked by the reward model. This metric directly reflects the exploitation efficacy of the reward model for the current policy. We choose Pass@2 since our training strategy involves selecting the top 2 responses using the reward model.
|
| 88 |
+
|
| 89 |
+
"Static" refers to models trained without adaptive exploration, while "Dynamic" indicates those trained with this mechanism. All models shown were trained using the M-STAR framework with optimized components as explored in our paper.
|
| 90 |
+
|
| 91 |
+
## M-STAR Resources
|
| 92 |
+
<div align="center">
|
| 93 |
+
|
| 94 |
+
| Resource | Link | License |
|
| 95 |
+
|------------------------------------------------|-----------|------------|
|
| 96 |
+
| **M-STAR Datasets**
|
| 97 |
+
| **M-STAR CoT Dataset** | ... | [MIT License](https://opensource.org/license/mit)
|
| 98 |
+
| **M-STAR MPRM Training Dataset** | ... | [MIT License](https://opensource.org/license/mit)
|
| 99 |
+
| **M-STAR Models** | | |
|
| 100 |
+
| M-STAR-8B-v1.0 | ... | [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md) |
|
| 101 |
+
| M-STAR-PRM-8B-v1.0 | ... | [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md) |
|
| 102 |
+
</div>
|