This repository is primarily for the Transformers framework. If you're using other open-source frameworks, please use the alternative repository: MiniMax-Text-01


MiniMax-Text-01

1. Introduction

MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methodsโ€”such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

2. Model Architecture

The architecture of MiniMax-Text-01 is briefly described as follows:

  • Total Parameters: 456B
  • Activated Parameters per Token: 45.9B
  • Number Layers: 80
  • Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
    • Number of attention heads: 64
    • Attention head dimension: 128
  • Mixture of Experts:
    • Number of experts: 32
    • Expert hidden dimension: 9216
    • Top-2 routing strategy
  • Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
  • Hidden Size: 6144
  • Vocab Size: 200,064

3. Evaluation

Core Academic Benchmarks

Tasks GPT-4o (11-20) Claude-3.5-Sonnet (10-22) Gemini-1.5-Pro (002) Gemini-2.0-Flash (exp) Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
General
MMLU* 85.7 88.3 86.8 86.5 86.1 88.5 88.6 88.5
MMLU-Pro* 74.4 78.0 75.8 76.4 71.1 75.9 73.3 75.7
SimpleQA 39.0 28.1 23.4 26.6 10.3 24.9 23.2 23.7
C-SimpleQA 64.6 56.8 59.4 63.3 52.2 64.8 54.7 67.4
IFEval (avg) 84.1 90.1 89.4 88.4 87.2 87.3 86.4 89.1
Arena-Hard 92.4 87.6 85.3 72.7 81.2 91.4 63.5 89.1
Reasoning
GPQA* (diamond) 46.0 65.0 59.1 62.1 49.0 59.1 50.7 54.4
DROP* (F1) 89.2 88.8 89.2 89.3 85.0 91.0 92.5 87.8
Mathematics
GSM8k* 95.6 96.9 95.2 95.4 95.8 96.7 96.7 94.8
MATH* 76.6 74.1 84.6 83.9 81.8 84.6 73.8 77.4
Coding
MBPP + 76.2 75.1 75.4 75.9 77.0 78.8 73.0 71.7
HumanEval 90.2 93.7 86.6 89.6 86.6 92.1 89.0 86.9

* Evaluated following a 0-shot CoT setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler

Model 4k 8k 16k 32k 64k 128k 256k 512k 1M
GPT-4o (11-20) 0.970 0.921 0.890 0.888 0.884 - - - -
Claude-3.5-Sonnet (10-22) 0.965 0.960 0.957 0.950 0.952 0.938 - - -
Gemini-1.5-Pro (002) 0.962 0.960 0.960 0.958 0.938 0.917 0.916 0.861 0.850
Gemini-2.0-Flash (exp) 0.960 0.960 0.951 0.957 0.937 0.860 0.797 0.709 -
MiniMax-Text-01 0.963 0.961 0.953 0.954 0.943 0.947 0.945 0.928 0.910

LongBench v2

Model overall easy hard short medium long
Human 53.7 100.0 25.1 47.2 59.1 53.7
w/ CoT
GPT-4o (11-20) 51.4 54.2 49.7 59.6 48.6 43.5
Claude-3.5-Sonnet (10-22) 46.7 55.2 41.5 53.9 41.9 44.4
Deepseek-V3 - - - - - -
Qwen2.5-72B-Inst. 43.5 47.9 40.8 48.9 40.9 39.8
MiniMax-Text-01 56.5 66.1 50.5 61.7 56.7 47.2
w/o CoT
GPT-4o (11-20) 50.1 57.4 45.6 53.3 52.4 40.2
Claude-3.5-Sonnet (10-22) 41.0 46.9 37.3 46.1 38.6 37.0
Deepseek-V3 48.7 - - - - -
Qwen2.5-72B-Inst. 42.1 42.7 41.8 45.6 38.1 44.4
MiniMax-Text-01 52.9 60.9 47.9 58.9 52.6 43.5

MTOB

Context Type no context half book full book ฮ” half book ฮ” full book
eng โ†’ kalam (ChrF)
GPT-4o (11-20) 9.90 54.30 - 44.40 -
Claude-3.5-Sonnet (10-22) 20.22 53.62 55.65 33.39 35.42
Gemini-1.5-Pro (002) 16.79 53.68 57.90 36.89 41.11
Gemini-2.0-Flash (exp) 12.20 49.50 53.30 37.30 41.10
Qwen-Long 16.55 48.48 45.94 31.92 29.39
MiniMax-Text-01 6.0 51.74 51.60 45.7 45.6
kalam โ†’ eng (BLEURT)
GPT-4o (11-20) 33.20 58.30 - 25.10 -
Claude-3.5-Sonnet (10-22) 31.42 59.70 62.30 28.28 30.88
Gemini-1.5-Pro (002) 32.02 61.52 63.09 29.50 31.07
Gemini-2.0-Flash (exp) 33.80 57.50 57.00 23.70 23.20
Qwen-Long 30.13 53.14 32.15 23.01 2.02
MiniMax-Text-01 33.65 57.10 58.00 23.45 24.35

4. Quickstart

Here we provide a simple example of loading the tokenizer and model to generate content.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01-hf")

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
        )

# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
# set device map
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01-hf")
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Text-01-hf",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    offload_buffers=True,
)

# generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

5. Deployment Guide

For production deployment, we recommend using vLLM to serve MiniMax-Text-01. vLLM provides excellent performance for serving large language models with the following features:

๐Ÿ”ฅ Outstanding service throughput performance
โšก Efficient and intelligent memory management
๐Ÿ“ฆ Powerful batch request processing capability
โš™๏ธ Deeply optimized underlying performance

For detailed deployment instructions, please refer to our vLLM Deployment Guide.

6. Function Calling

MiniMax-Text-01 supports Function Calling capability, enabling the model to intelligently identify when external functions need to be called and output parameters in structured JSON format. With Function Calling, you can:

  • Let the model recognize implicit function call needs in user requests
  • Receive structured parameter outputs for seamless application integration
  • Support various complex parameter types, including nested objects and arrays Function Calling supports standard OpenAI-compatible format definitions and integrates seamlessly with the Transformers library. For detailed usage instructions, please refer to our Function Call Guide or Chinese Guide.

7. Citation

@misc{minimax2025minimax01scalingfoundationmodels,
      title={MiniMax-01: Scaling Foundation Models with Lightning Attention}, 
      author={MiniMax and Aonian Li and Bangwei Gong and Bo Yang and Boji Shan and Chang Liu and Cheng Zhu and Chunhao Zhang and Congchao Guo and Da Chen and Dong Li and Enwei Jiao and Gengxin Li and Guojun Zhang and Haohai Sun and Houze Dong and Jiadai Zhu and Jiaqi Zhuang and Jiayuan Song and Jin Zhu and Jingtao Han and Jingyang Li and Junbin Xie and Junhao Xu and Junjie Yan and Kaishun Zhang and Kecheng Xiao and Kexi Kang and Le Han and Leyang Wang and Lianfei Yu and Liheng Feng and Lin Zheng and Linbo Chai and Long Xing and Meizhi Ju and Mingyuan Chi and Mozhi Zhang and Peikai Huang and Pengcheng Niu and Pengfei Li and Pengyu Zhao and Qi Yang and Qidi Xu and Qiexiang Wang and Qin Wang and Qiuhui Li and Ruitao Leng and Shengmin Shi and Shuqi Yu and Sichen Li and Songquan Zhu and Tao Huang and Tianrun Liang and Weigao Sun and Weixuan Sun and Weiyu Cheng and Wenkai Li and Xiangjun Song and Xiao Su and Xiaodong Han and Xinjie Zhang and Xinzhu Hou and Xu Min and Xun Zou and Xuyang Shen and Yan Gong and Yingjie Zhu and Yipeng Zhou and Yiran Zhong and Yongyi Hu and Yuanxiang Fan and Yue Yu and Yufeng Yang and Yuhao Li and Yunan Huang and Yunji Li and Yunpeng Huang and Yunzhi Xu and Yuxin Mao and Zehan Li and Zekang Li and Zewei Tao and Zewen Ying and Zhaoyang Cong and Zhen Qin and Zhenhua Fan and Zhihang Yu and Zhuo Jiang and Zijia Wu},
      year={2025},
      eprint={2501.08313},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.08313}, 
}

8. Chatbot & API

For general use and evaluation, we provide a Chatbot with online search capabilities and the online API for developers. For general use and evaluation, we provide the MiniMax MCP Server with video generation, image generation, speech synthesis, and voice cloning for developers.

9. Contact Us

Contact us at [email protected].

Downloads last month
10,341
Safetensors
Model size
456B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including MiniMaxAI/MiniMax-Text-01-hf