File size: 7,890 Bytes
d68afe0
 
 
3210549
 
 
 
 
d68afe0
3210549
e377af2
e52fb35
d1b06a5
e52fb35
 
 
 
d68afe0
 
 
 
 
 
 
2eea0de
d68afe0
 
2eea0de
 
d68afe0
a3b46d9
 
 
 
 
 
 
545b0ff
a3b46d9
 
 
 
 
 
e377af2
a3b46d9
d68afe0
 
 
 
 
 
 
 
5cc3f24
 
d68afe0
 
 
e377af2
d68afe0
 
 
 
 
e377af2
d68afe0
 
a3b46d9
d68afe0
a3b46d9
d68afe0
e377af2
d68afe0
 
e377af2
d68afe0
 
 
 
 
 
a3b46d9
d68afe0
a3b46d9
d68afe0
 
 
 
e377af2
 
 
 
 
 
 
d68afe0
e377af2
 
 
d68afe0
e377af2
 
 
 
 
 
d68afe0
e377af2
a3b46d9
e377af2
a3b46d9
e377af2
a3b46d9
 
 
e377af2
d68afe0
 
e377af2
a3b46d9
d68afe0
 
 
 
e52fb35
 
 
 
 
 
2eea0de
e52fb35
 
2eea0de
e52fb35
 
 
 
 
 
a3b46d9
e377af2
 
a3b46d9
e52fb35
 
 
 
 
e377af2
e52fb35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e377af2
 
e52fb35
e377af2
e52fb35
e377af2
 
 
 
e52fb35
 
 
 
d68afe0
 
 
 
 
 
 
 
 
 
 
817cc87
0fbbf22
d1b06a5
0fbbf22
d1b06a5
 
 
 
 
 
817cc87
d68afe0
2eea0de
d68afe0
3210549
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
base_model:
- Qwen/Qwen3-4B
language:
- en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

[[πŸ“š Arxiv Paper](https://arxiv.org/pdf/2508.21113)] [[πŸ€— Hugging Face](https://huggingface.co/YannQi/R-4B)]  [[πŸ€–οΈ ModelScope](https://huggingface.co/YannQi/R-4B)] [[πŸ’» Code](https://github.com/yannqi/R-4B)]

<div align="center">
<img src="asset/logo_R_4B.png" alt="logo" width="38" /> 
</div>

<div align="center">
  <img src="asset/R-4B.png" width="100%" alt="R-4B Performance">
</div>

## ⭐️ Introduction

In this repo, we present **R-4B**, a multimodal large language model designed for general-purpose auto-thinking, autonomously switching between step-by-step thinking and direct response generation based on task complexity. This capability enables R-4B to deliver high-quality responses while significantly improving inference efficiency and reducing computational costs.

The development of R-4B follows a two-stage training paradigm:
(1) Bi-mode Annealing, which establishes both thinking and non-thinking capabilities for VQA; and
(2) Bi-mode Policy Optimization (BPO), which enables the model to adaptively switch between thinking and non-thinking modes based on input demands.

## πŸš€ Key Features

- 🧠 **Think Smart, Act Fast: Adaptive & Controllable Thinking!**
  Our model provides three-mode control over the response process.

  - **Auto-thinking Mode:** Unleash **auto-thinking** that works across general topics, from simple Q&A to complex scientific analysis. It saves time and computation by thinking only when it matters.
  - **Support Manual Control:**  Explicitly command the model to use its `thinking` or `non-thinking` capabilities, enabling you to make your choices for every job.
- πŸ†  **Strong Performance, Open for Everyone!**
  Our model is now **fully open-source**. It achieves **state-of-the-art performance** among models of comparable size.

## πŸ“’ News

- **[2025.08.20]** πŸš€ **vLLM Support is Here!** Our R-4B model is now fully compatible with [vLLM](https://github.com/vllm-project/vllm) for high-performance inference.
- **[2025.08.18]** πŸ† **Top Rank Achieved!** We are thrilled to announce that R-4B is now ranked #1 among all open-source models on the [OpenCompass Multi-modal Reasoning Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal-reasoning/?m=REALTIME)!
- **[2025.08.11]** πŸ₯‡ **Rank #1!** R-4B ranks first under 20B parameters on the [OpenCompass Multi-modal Academic Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME)!
- **[2025.08.05]** πŸŽ‰ **R-4B is Released!** Our model is now publicly available. You can download it from [Hugging Face](https://huggingface.co/YannQi/R-4B).

## πŸ”₯ Quickstart

Below, we provide simple examples to show how to use R-4B with πŸ€— Transformers.

### Using πŸ€— Transformers to Chat

> [!NOTE]
> Users can dynamically control the model's response by selecting one of three modes (`auto-thinking`, `thinking`, or `non-thinking`) with `thinking_mode`. `thinking_mode=auto` for `auto-thinking` mode; `thinking_mode=long` for `thinking` mode; `thinking_mode=short` for `non-thinking` mode. 
> Default is `auto-thinking`.

```python
import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoProcessor

model_path = "YannQi/R-4B"

# Load model
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.float32,
    trust_remote_code=True,
).to("cuda")

# Load processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Define conversation messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "http://images.cocodataset.org/val2017/000000039769.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Apply chat template
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    thinking_mode="auto"
)

# Load image
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

# Process inputs
inputs = processor(
    images=image,
    text=text,
    return_tensors="pt"
).to("cuda")

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=16384)
output_ids = generated_ids[0][len(inputs.input_ids[0]):]

# Decode output
output_text = processor.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

# Print result
print("Auto-Thinking Output:", output_text)
```

</details>

### Using vLLM for fast R-4B deployment and inference.

- We recommend using vLLM for fast R-4B deployment and inference.

#### Install

The code of R-4B requires the newest vllm now. Please install from local source:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
```

##### Online Serving

> [!TIP]
> The `thinking_mode` switch is also available in APIs created by [vLLM](https://github.com/vllm-project/vllm). 
> Default is `auto-thinking`.

- Serve

```bash
vllm serve \
    yannqi/R-4B \
    --served-model-name r4b \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.8 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code
```

- Openai Chat Completion Client

```python
import base64
from PIL import Image
from openai import OpenAI


# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# image url
image_messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "http://images.cocodataset.org/val2017/000000039769.jpg"
                },
            },
            {"type": "text", "text": "Describe this image."},
        ],
    },
]



chat_response = client.chat.completions.create(
    model="r4b",
    messages=image_messages,
    max_tokens=16384,
    extra_body={
        "chat_template_kwargs": {"thinking_mode": "auto"},
    },
)
print("Chat response:", chat_response)
```

## πŸ“ˆ Experimental Results

<div align="center">
  <img src="asset/performance.png" width="100%" alt="R-4B Performance">
</div>

1. R-4B establishes itself with powerful, state-of-the-art perceptual abilities that are competitive with larger models.
2. In evaluation sets that require complex logical reasoning and mathematical problem-solving, such as WeMath, MathVerse, and LogicVista, R-4B displays a strong performance curve. This highlights its advanced adaptive thinking capacity for logical deduction and solving complex quantitative problems.

## βœ’οΈ Citation

```
@misc{yang2025r4bincentivizinggeneralpurposeautothinking,
      title={R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning}, 
      author={Qi Yang and Bolin Ni and Shiming Xiang and Han Hu and Houwen Peng and Jie Jiang},
      year={2025},
      eprint={2508.21113},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.21113}, 
}
```

## Acknowledgements

R-4B is developed based on the codebases of the following projects: [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT), [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384), [Qwen3](https://github.com/QwenLM/Qwen3), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). We sincerely thank these projects for their outstanding work.