Update README.md
Browse files
README.md
CHANGED
|
@@ -45,7 +45,7 @@ Baichuan-M2 incorporates three core technical innovations: First, through the **
|
|
| 45 |
|
| 46 |
### General Performance
|
| 47 |
|
| 48 |
-
| Benchmark | Baichuan-M2-32B | Qwen3-32B |
|
| 49 |
|-----------|-----------------|-----------|
|
| 50 |
| AIME24 | 83.4 | 81.4 |
|
| 51 |
| AIME25 | 72.9 | 72.9 |
|
|
@@ -75,10 +75,19 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create
|
|
| 75 |
```shell
|
| 76 |
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
| 77 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
- vLLM:
|
| 79 |
```shell
|
| 80 |
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
| 81 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## MTP inference with SGLang
|
| 84 |
|
|
|
|
| 45 |
|
| 46 |
### General Performance
|
| 47 |
|
| 48 |
+
| Benchmark | Baichuan-M2-32B | Qwen3-32B (Thinking) |
|
| 49 |
|-----------|-----------------|-----------|
|
| 50 |
| AIME24 | 83.4 | 81.4 |
|
| 51 |
| AIME25 | 72.9 | 72.9 |
|
|
|
|
| 75 |
```shell
|
| 76 |
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
| 77 |
```
|
| 78 |
+
To turn on kv cache FP8 quantization:
|
| 79 |
+
```shell
|
| 80 |
+
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
- vLLM:
|
| 84 |
```shell
|
| 85 |
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
| 86 |
```
|
| 87 |
+
To turn on kv cache FP8 quantization:
|
| 88 |
+
```shell
|
| 89 |
+
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv_cache_dtype fp8_e4m3
|
| 90 |
+
```
|
| 91 |
|
| 92 |
## MTP inference with SGLang
|
| 93 |
|