Model Description

MiniMax-M2.5-NVFP4-REAP is an NVFP4-quantized version of cerebras/MiniMax-M2.5-REAP-139B-A10B, a 139B-parameter Mixture-of-Experts language model with 10B active parameters and 154 experts (pruned from 256 via REAP).

The REAP checkpoint's FP8 weights were first dequantized to BF16, then quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

The 40% expert pruning from REAP combined with NVFP4 quantization makes this model small enough to run on a single 96GB GPU.

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. Attention layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a vastly larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.

How to Run

Fits on a single RTX Pro 6000 Blackwell!

SGLang

export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1

python3 -m sglang.launch_server \
  --model lukealonso/MiniMax-M2.5-NVFP4-REAP \
  --served-model-name MiniMax-M2.5 \
  --reasoning-parser minimax \
  --tool-call-parser minimax-m2 \
  --trust-remote-code \
  --tp 1 \
  --mem-fraction-static 0.95 \
  --max-running-requests 32 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype bf16 \
  --enable-flashinfer-allreduce-fusion \
  --host 0.0.0.0 \
  --port 8000

vLLM

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python -m vllm.entrypoints.openai.api_server \
  --model lukealonso/MiniMax-M2.5-NVFP4-REAP \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name MiniMax-M2.5 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --attention-backend FLASH_ATTN \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

Acknowledgments

See Also

Downloads last month
314
Safetensors
Model size
78B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4

Quantized
(54)
this model

Paper for lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4