MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.
MiMo-V2-Flash creates a new balance between long-context modeling capability and inference efficiency. Key features include:
| Model | Total Params | Active Params | Context Length | Download |
|---|---|---|---|---|
| MiMo-V2-Flash-Base | 309B | 15B | 256k | 🤗 HuggingFace |
| MiMo-V2-Flash | 309B | 15B | 256k | 🤗 HuggingFace |
IMPORTANT
We also open-source the 3-layer MTP weights to foster community research.
MiMo-V2-Flash-Base demonstrates strong performance across standard benchmarks, surpassing models with significantly larger parameter counts.
| Category | Benchmark | Setting/Length | MiMo-V2-Flash Base | Kimi-K2 Base | DeepSeek-V3.1 Base | DeepSeek-V3.2 Exp Base |
|---|---|---|---|---|---|---|
| Params | #Activated / #Total | - | 15B / 309B | 32B / 1043B | 37B / 671B | 37B / 671B |
| General | BBH | 3-shot | 88.5 | 88.7 | 88.2 | 88.7 |
| MMLU | 5-shot | 86.7 | 87.8 | 87.4 | 87.8 | |
| MMLU-Redux | 5-shot | 90.6 | 90.2 | 90.0 | 90.4 | |
| MMLU-Pro | 5-shot | 73.2 | 69.2 | 58.8 | 62.1 | |
| DROP | 3-shot | 84.7 | 83.6 | 86.3 | 86.6 | |
| ARC-Challenge | 25-shot | 95.9 | 96.2 | 95.6 | 95.5 | |
| HellaSwag | 10-shot | 88.5 | 94.6 | 89.2 | 89.4 | |
| WinoGrande | 5-shot | 83.8 | 85.3 | 85.9 | 85.6 | |
| TriviaQA | 5-shot | 80.3 | 85.1 | 83.5 | 83.9 | |
| GPQA-Diamond | 5-shot | 55.1 | 48.1 | 51.0 | 52.0 | |
| SuperGPQA | 5-shot | 41.1 | 44.7 | 42.3 | 43.6 | |
| SimpleQA | 5-shot | 20.6 | 35.3 | 26.3 | 27.0 | |
| Math | GSM8K | 8-shot | 92.3 | 92.1 | 91.4 | 91.1 |
| MATH | 4-shot | 71.0 | 70.2 | 62.6 | 62.5 | |
| AIME 24&25 | 2-shot | 35.3 | 31.6 | 21.6 | 24.8 | |
| Code | HumanEval+ | 1-shot | 70.7 | 84.8 | 64.6 | 67.7 |
| MBPP+ | 3-shot | 71.4 | 73.8 | 72.2 | 69.8 | |
| CRUXEval-I | 1-shot | 67.5 | 74.0 | 62.1 | 63.9 | |
| CRUXEval-O | 1-shot | 79.1 | 83.5 | 76.4 | 74.9 | |
| MultiPL-E HumanEval | 0-shot | 59.5 | 60.5 | 45.9 | 45.7 | |
| MultiPL-E MBPP | 0-shot | 56.7 | 58.8 | 52.5 | 50.6 | |
| BigCodeBench | 0-shot | 70.1 | 61.7 | 63.0 | 62.9 | |
| LiveCodeBench v6 | 1-shot | 30.8 | 26.3 | 24.8 | 24.9 | |
| SWE-Bench (AgentLess) | 3-shot | 30.8 | 28.2 | 24.8 | 9.4* | |
| Chinese | C-Eval | 5-shot | 87.9 | 92.5 | 90.0 | 91.0 |
| CMMLU | 5-shot | 87.4 | 90.9 | 88.8 | 88.9 | |
| C-SimpleQA | 5-shot | 61.5 | 77.6 | 70.9 | 68.0 | |
| Multilingual | GlobalMMLU | 5-shot | 76.6 | 80.7 | 81.9 | 82.0 |
| INCLUDE | 5-shot | 71.4 | 75.3 | 77.2 | 77.2 | |
| Long Context | NIAH-Multi | 32K | 99.3 | 99.8 | 99.7 | 85.6* |
| 64K | 99.9 | 100.0 | 98.6 | 85.9* | ||
| 128K | 98.6 | 99.5 | 97.2 | 94.3* | ||
| 256K | 96.7 | - | - | - | ||
| GSM-Infinite Hard | 16K | 37.7 | 34.6 | 41.5 | 50.4 | |
| 32K | 33.7 | 26.1 | 38.8 | 45.2 | ||
| 64K | 31.5 | 16.0 | 34.7 | 32.6 | ||
| 128K | 29.0 | 8.8 | 28.7 | 25.7 |
* indicates the model may fail to follow the prompt or format.
Following our Post-Training Paradigm with MOPD and Agentic RL, the model achieves SOTA reasoning and agentic performance.
| Benchmark | MiMo-V2 Flash | Kimi-K2 Thinking | DeepSeek-V3.2 Thinking | Gemini-3.0 Pro | Claude Sonnet 4.5 | GPT-5 High |
|---|---|---|---|---|---|---|
| Reasoning | ||||||
| MMLU-Pro | 84.9 | 84.6 | 85.0 | 90.1 | 88.2 | 87.5 |
| GPQA-Diamond | 83.7 | 84.5 | 82.4 | 91.9 | 83.4 | 85.7 |
| HLE (no tools) | 22.1 | 23.9 | 25.1 | 37.5 | 13.7 | 26.3 |
| AIME 2025 | 94.1 | 94.5 | 93.1 | 95.0 | 87.0 | 94.6 |
| HMMT Feb. 2025 | 84.4 | 89.4 | 92.5 | 97.5 | 79.2 | 88.3 |
| LiveCodeBench-v6 | 80.6 | 83.1 | 83.3 | 90.7 | 64.0 | 84.5 |
| General Writing | ||||||
| Arena-Hard (Hard Prompt) | 54.1 | 71.9 | 53.4 | 72.6 | 63.3 | 71.9 |
| Arena-Hard (Creative Writing) | 86.2 | 80.1 | 88.8 | 93.6 | 76.7 | 92.2 |
| Long Context | ||||||
| LongBench V2 | 60.6 | 45.1 | 58.4 | 65.6 | 61.8 | - |
| MRCR | 45.7 | 44.2 | 55.5 | 89.7 | 55.4 | - |
| Code Agent | ||||||
| SWE-Bench Verified | 73.4 | 71.3 | 73.1 | 76.2 | 77.2 | 74.9 |
| SWE-Bench Multilingual | 71.7 | 61.1 | 70.2 | - | 68.0 | 55.3 |
| Terminal-Bench Hard | 30.5 | 30.6 | 35.4 | 39.0 | 33.3 | 30.5 |
| Terminal-Bench 2.0 | 38.5 | 35.7 | 46.4 | 54.2 | 42.8 | 35.2 |
| General Agent | ||||||
| BrowseComp | 45.4 | - | 51.4 | - | 24.1 | 54.9 |
| BrowseComp (w/ Context Manage) | 58.3 | 60.2 | 67.6 | 59.2 | - | - |
| \(\tau^2\)-Bench | 80.3 | 74.3 | 80.3 | 85.4 | 84.7 | 80.2 |
MiMo-V2-Flash addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA).
Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.
MiMo-V2-Flash leverages a post-training pipeline designed to maximize reasoning and agentic capabilities through innovative distillation and reinforcement learning strategies.
We introduce Multi-Teacher On-Policy Distillation (MOPD), a new paradigm that formulates knowledge distillation as a reinforcement learning process.
We significantly scale up the agentic training environments to improve intelligence and generalization.
To support high-throughput RL training for large-scale MoE models, we implemented several infrastructure optimizations on top of SGLang and Megatron-LM.
MiMo-V2-Flash supports FP8 mixed precision inference. We recommend using SGLang for optimal performance.
pip install sglang
# Launch server
python3 -m sglang.launch_server \
--model-path XiaomiMiMo/MiMo-V2-Flash \
--served-model-name mimo-v2-flash \
--pp-size 1 \
--dp-size 2 \
--enable-dp-attention \
--tp-size 8 \
--moe-a2a-backend deepep \
--page-size 1 \
--host 0.0.0.0 \
--port 9001 \
--trust-remote-code \
--mem-fraction-static 0.75 \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--reasoning-parser qwen3 \
--tool-call-parser mimo \
--context-length 262144 \
--attention-backend fa3 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-mtp
# Send request
curl -i http://localhost:9001/v1/chat/completions \
-H 'Content-Type:application/json' \
-d '{
"messages" : [{
"role": "user",
"content": "Nice to meet you MiMo"
}],
"model": "mimo-v2-flash",
"max_tokens": 4096,
"temperature": 0.8,
"top_p": 0.95,
"stream": true,
"chat_template_kwargs": {
"enable_thinking": true
}
}'
IMPORTANT
The following system prompts are HIGHLY recommended, please choose from English and Chinese version.
English
You are MiMo, an AI assistant developed by Xiaomi. Today's date: {date} {week}. Your knowledge cutoff date is December 2024.
Chinese
你是MiMo(中文名称也是MiMo),是小米公司研发的AI智能助手。 今天的日期:{date} {week},你的知识截止日期是2024年12月。
IMPORTANT
Recommended sampling parameters:
top_p=0.95
temperature=0.8 for math, writing, web-dev
temperature=0.3 for agentic taks (e.g., vibe-coding, tool-use)
IMPORTANT
In the thinking mode with multi-turn tool calls, the model returns a reasoning_content field alongside tool_calls. To continue the conversation, the user must persist all history reasoning_content in the messages array of each subsequent request.
If you find our work helpful, please cite our technical report:
@misc{mimo2025flash, title={MiMo-V2-Flash Technical Report}, author={LLM-Core Xiaomi}, year={2025}, url={https://github.com/XiaomiMiMo/MiMo-V2-Flash/paper.pdf} }
Please contact us at mimo@xiaomi.com, join our WeChat group below or open an issue if you have any questions.