|
|
--- |
|
|
license: mit |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- vLLM |
|
|
- AWQ |
|
|
base_model: |
|
|
- cerebras/MiniMax-M2-REAP-162B-A10B |
|
|
base_model_relation: quantized |
|
|
|
|
|
--- |
|
|
# MiniMax-M2-REAP-162B-A10B-AWQ |
|
|
Base model: [cerebras/MiniMax-M2-REAP-162B-A10B](https://www.modelscope.cn/models/cerebras/MiniMax-M2-REAP-162B-A10B) |
|
|
|
|
|
``` |
|
|
Note: Some attention layers are left unquantized to preserve output coherence and consistency; |
|
|
as a result, the file size is reduced by about 24%, rather than the ~30% we might otherwise expect. |
|
|
``` |
|
|
|
|
|
### 【Dependencies / Installation】 |
|
|
<i> Same as the original `MiniMax-M2` </i> |
|
|
|
|
|
As of **2025-11-19**, create a fresh Python environment and run: |
|
|
```bash |
|
|
uv venv |
|
|
source .venv/bin/activate |
|
|
uv pip install 'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \ |
|
|
vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow |
|
|
``` |
|
|
|
|
|
[vLLM Official Guide](https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html) |
|
|
|
|
|
|
|
|
### 【vLLM Startup Command】 |
|
|
<i>Note: When launching with TP=8, include `--enable-expert-parallel`; |
|
|
otherwise the expert tensors wouldn’t be evenly sharded across GPU devices.</i> |
|
|
|
|
|
``` |
|
|
CONTEXT_LENGTH=32768 |
|
|
vllm serve \ |
|
|
tclf90/MiniMax-M2-REAP-162B-A10B-AWQ \ |
|
|
--served-model-name MY_MODEL \ |
|
|
--enable-auto-tool-choice \ |
|
|
--tool-call-parser minimax_m2 \ |
|
|
--reasoning-parser minimax_m2_append_think \ |
|
|
--swap-space 8 \ |
|
|
--max-num-seqs 32 \ |
|
|
--max-model-len $CONTEXT_LENGTH \ |
|
|
--gpu-memory-utilization 0.9 \ |
|
|
--tensor-parallel-size 8 \ |
|
|
--enable-expert-parallel \ |
|
|
--trust-remote-code \ |
|
|
--disable-log-requests \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8000 |
|
|
``` |
|
|
|
|
|
### 【Logs】 |
|
|
``` |
|
|
2025-11-19 |
|
|
1. Initial commit |
|
|
``` |
|
|
|
|
|
### 【Model Files】 |
|
|
| File Size | Last Updated | |
|
|
|-----------|--------------| |
|
|
| `86GiB` | `2025-11-19` | |
|
|
|
|
|
### 【Model Download】 |
|
|
```python |
|
|
from modelscope import snapshot_download |
|
|
snapshot_download('tclf90/MiniMax-M2-REAP-162B-A10B-AWQ', cache_dir="your_local_path") |
|
|
``` |
|
|
|
|
|
### 【Overview】 |
|
|
<p align="center"> |
|
|
<em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br> |
|
|
<img src="https://i.imgur.com/rmzG3gg.png" alt="REAP" width="75%"> |
|
|
</p> |
|
|
|
|
|
# MiniMax-M2-REAP-162B-A10B |
|
|
|
|
|
## ✨ Highlights |
|
|
|
|
|
Introducing **MiniMax-M2-REAP-162B-A10B**, a **memory-efficient compressed variant** of MiniMax-M2 that maintains near-identical performance while being **30% lighter**. |
|
|
|
|
|
This model was created using **REAP (Router-weighted Expert Activation Pruning)**, a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include: |
|
|
|
|
|
- **Near-Lossless Performance**: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 230B model |
|
|
- **30% Memory Reduction**: Compressed from 230B to 162B parameters, significantly lowering deployment costs and memory requirements |
|
|
- **Preserved Capabilities**: Retains all core functionalities including code generation, math & reasoning and tool calling. |
|
|
- **Drop-in Compatibility**: Works with vanilla vLLM - no source modifications or custom patches required |
|
|
- **Optimized for Real-World Use**: Particularly effective for resource-constrained environments, local deployments, and academic research |
|
|
--- |
|
|
## 📋 Model Overview |
|
|
|
|
|
**MiniMax-M2-REAP-162B-A10B** has the following specifications: |
|
|
|
|
|
- **Base Model**: MiniMax-M2 |
|
|
- **Compression Method**: REAP (Router-weighted Expert Activation Pruning) |
|
|
- **Compression Ratio**: 30% expert pruning |
|
|
- **Type**: Sparse Mixture-of-Experts (SMoE) Causal Language Model |
|
|
- **Number of Parameters**: 162B total, 10B activated per token |
|
|
- **Number of Layers**: 62 |
|
|
- **Number of Attention Heads**: 48 |
|
|
- **Number of Experts**: 180 (uniformly pruned from 256) |
|
|
- **Number of Activated Experts**: 8 per token |
|
|
- **Context Length**: 196,608 tokens |
|
|
- **License**: Modified MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluations |
|
|
|
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th align="left">Benchmark</th> |
|
|
<th align="center">MiniMax-M2</th> |
|
|
<th align="center"><a href="https://huggingface.co/cerebras/MiniMax-M2-REAP-172B-A10B">MiniMax-M2-REAP-172B-A10B</a></th> |
|
|
<th align="center"><a href="https://huggingface.co/cerebras/MiniMax-M2-REAP-162B-A10B">MiniMax-M2-REAP-162B-A10B</a></th> |
|
|
<th align="center"><a href="https://huggingface.co/cerebras/MiniMax-M2-REAP-139B-A10B">MiniMax-M2-REAP-139B-A10B</a></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td><strong>Compression</strong></td> |
|
|
<td align="center">—</td> |
|
|
<td align="center">25%</td> |
|
|
<td align="center">30%</td> |
|
|
<td align="center">40%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td colspan="5" align="center"><strong>Coding</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>HumanEval</strong></td> |
|
|
<td align="center">93.9</td> |
|
|
<td align="center">93.9</td> |
|
|
<td align="center">93.3</td> |
|
|
<td align="center">91.5</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>HumanEval+</strong></td> |
|
|
<td align="center">89.0</td> |
|
|
<td align="center">86.6</td> |
|
|
<td align="center">86.6</td> |
|
|
<td align="center">83.5</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>MBPP</strong></td> |
|
|
<td align="center">87.6</td> |
|
|
<td align="center">88.1</td> |
|
|
<td align="center">86.5</td> |
|
|
<td align="center">85.2</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>MBPP+</strong></td> |
|
|
<td align="center">73.0</td> |
|
|
<td align="center">74.9</td> |
|
|
<td align="center">73.0</td> |
|
|
<td align="center">71.4</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td colspan="5" align="center"><strong>Reasoning</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>AIME25</strong></td> |
|
|
<td align="center">76.7</td> |
|
|
<td align="center">83.3</td> |
|
|
<td align="center">73.3</td> |
|
|
<td align="center">73.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>MATH-500</strong></td> |
|
|
<td align="center">91.6</td> |
|
|
<td align="center">89.4</td> |
|
|
<td align="center">89.4</td> |
|
|
<td align="center">93.8</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td colspan="5" align="center"><strong>Agentic / tool calling</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>𝜏²-bench (Telecom, discard think traces)</strong></td> |
|
|
<td align="center">59.1</td> |
|
|
<td align="center">57.6</td> |
|
|
<td align="center">59.1</td> |
|
|
<td align="center">55.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>BFCLv3 (discard think traces)</strong></td> |
|
|
<td align="center">62.6</td> |
|
|
<td align="center">61.5</td> |
|
|
<td align="center">59.9</td> |
|
|
<td align="center">57.9</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
🟩 *This checkpoint maintains almost identical performance while being 30% lighter.* |
|
|
|
|
|
For more details on the evaluation setup, refer to the [REAP arXiv preprint](https://arxiv.org/abs/2510.13999). |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Deployment |
|
|
|
|
|
You can deploy the model directly using the **latest vLLM** (that supports MiniMax-M2), no source modifications or custom patches required. |
|
|
|
|
|
```bash |
|
|
vllm serve cerebras/MiniMax-M2-REAP-162B-A10B \ |
|
|
--tensor-parallel-size 8 \ |
|
|
--tool-call-parser minimax_m2 \ |
|
|
--reasoning-parser minimax_m2_append_think \ |
|
|
--trust-remote-code \ |
|
|
--enable_expert_parallel \ |
|
|
--enable-auto-tool-choice |
|
|
``` |
|
|
|
|
|
If you encounter insufficient memory when running this model, you might need to set a lower value for `--max-num-seqs` flag (e.g. set to 64). For more information, refer to the [official vLLM deployment guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/vllm_deploy_guide.md). |
|
|
|
|
|
## 🧩 Model Creation |
|
|
|
|
|
This checkpoint was created by applying the **REAP (Router-weighted Expert Activation Pruning)** method uniformly across all Mixture-of-Experts (MoE) blocks of **MiniMax-M2**, with a **30% pruning rate**. |
|
|
|
|
|
### How REAP Works |
|
|
|
|
|
REAP selects experts to prune based on a novel **saliency criterion** that considers both: |
|
|
- **Router gate values**: How frequently and strongly the router activates each expert |
|
|
- **Expert activation norms**: The magnitude of each expert's output contributions |
|
|
|
|
|
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations. |
|
|
|
|
|
### Key Advantages |
|
|
|
|
|
- **One-Shot Compression**: No fine-tuning required after pruning - the model is immediately ready for deployment |
|
|
- **Preserved Router Control**: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse" |
|
|
- **Generative Task Superiority**: REAP significantly outperforms expert merging approaches on generative benchmarks (code generation, creative writing, mathematical reasoning) while maintaining competitive performance on discriminative tasks |
|
|
|
|
|
📚 For more details, refer to the following resources: |
|
|
|
|
|
- [🧾 arXiv Preprint](https://arxiv.org/abs/2510.13999) |
|
|
- [🧾 REAP Blog](https://www.cerebras.ai/blog/reap) |
|
|
- [💻 REAP Codebase (GitHub)](https://github.com/CerebrasResearch/reap) |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚖️ License |
|
|
|
|
|
This model is derived from |
|
|
**[`MiniMaxAI/MiniMax-M2`](https://huggingface.co/MiniMaxAI/MiniMax-M2)** |
|
|
and distributed under the **modified MIT license**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧾 Citation |
|
|
|
|
|
If you use this checkpoint, please cite the REAP paper: |
|
|
|
|
|
```bibtex |
|
|
@article{lasby-reap, |
|
|
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}, |
|
|
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, |
|
|
journal={arXiv preprint arXiv:2510.13999}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|