Interfaze

logo

Beta

pricing

help

docs

blog

sign in

MiniMax M3 GGUF

MiniMax M3 GGUF by unsloth, a image-text-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureMiniMax M3 GGUFInterfaze
Input Modalities

text, image, video

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingYesYes
Language Support

40 partial

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsYesYes
Context Input Size

1M

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureMiniMax M3 GGUFInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

MiniMax-M3 support in llama.cpp is preliminary and not yet in a released build. To run these GGUFs, build llama.cpp from PR #24523:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server

Then run a quant. The model is large (~428B params), so offload across GPUs with -ngl 99 or keep the weights in CPU RAM:

./build/bin/llama-cli -hf unsloth/MiniMax-M3-GGUF:UD-IQ1_M

Note: MiniMax Sparse Attention is not supported yet, so inference falls back to dense attention.


Highlights:

  • Native Multimodality: M3 undergoes mixed-modality training from the very first step, enabling deeper semantic fusion across text, image, and video.
  • Context Scaling via Sparse Attention: M3 introduces MiniMax Sparse Attention (MSA) to improve long context efficiency. M3 delivers 9× prefill and 15× decode speedups compared to M2 at 1M context, reducing per-token compute to 1/20.
  • Coding & Cowork Capability: M3 achieves frontier-level performance across long-horizon agentic benchmarks, excelling in both coding and cowork.

Model Details

ArchitectureMoE + MSA (MiniMax Sparse Attention)
Total Parameters~428B
Activated Parameters~23B
Experts128 (4 active per token)
Layers60
Context Length1M tokens
ModalitiesText, Image, Video
Precisionbfloat16
Transformers≥ 4.52.4 (trust_remote_code=True)
LicenseMiniMax Community License

How to Use

M3 supports two reasoning modes:

  • thinking — for complex reasoning, agentic tasks, and long-horizon collaboration.
  • non-thinking — for latency-sensitive scenarios such as chat and code completion.

Local Deployment

Download the model:

hf download MiniMaxAI/MiniMax-M3 --local-dir MiniMax-M3

You can also get model weights from ModelScope.

Inference Parameters

We recommend the following parameters for best performance: temperature=1.0, top_p=0.95, top_k=40. Default system prompt:

You are a helpful assistant. Your name is MiniMax-M3 and was built by MiniMax.

Want more deterministic results?