Interfaze

logo

Beta

pricing

help

docs

blog

sign in

Mellum2 12B A2.5B Thinking

Mellum2 12B A2.5B Thinking by JetBrains, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

FeatureMellum2 12B A2.5B ThinkingInterfaze
Input Modalities

text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

131.1K

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureMellum2 12B A2.5B ThinkingInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

[!Note] Use this model when you want explicit chain-of-thought before the final answer — complex debugging, multi-step planning, agentic workflows, and math- or reasoning-heavy tasks. For direct, low-latency answers without reasoning traces, use Instruct instead.

Mellum2 Thinking Highlights

Mellum 2 Thinking is a post-trained reasoning-augmented assistant model trained by JetBrains.

The model uses a Mixture-of-Experts architecture with 64 experts and activates 8 experts per token. It uses a combination of sliding-window and full attention layers, with a context length of 131,072 tokens.

It is produced from Mellum2-12B-A2.5B-Base by supervised fine-tuning (loss computed only on the final assistant turn) followed by reinforcement learning with verifiable rewards (RLVR) on a harder data mix that includes a long-form math subset. The model emits its reasoning inside <think>...</think> blocks before the final answer.

Mellum2 Model Family

This repository contains one checkpoint from the Mellum 2 family.

CheckpointDescription
Base PretrainBase checkpoint before long-context extension
BaseFinal base model
Instruct SFTSupervised instruction-tuned checkpoint
Thinking SFTSupervised thinking checkpoint
InstructRL-tuned instruction model
ThinkingRL-tuned thinking model

Model Overview

Mellum2 Thinking has the following features:

  • Number of Layers: 28
  • Hidden Size: 2304
  • Intermediate Size: 7168
  • MoE Intermediate Size: 896
  • Number of Experts: 64
  • Number of Activated Experts: 8
  • Number of Attention Heads (GQA): 32 for Q and 4 for KV
  • Context Length: 131,072
  • Sliding Window: 1,024
  • Vocabulary Size: 98,304
  • Precision: bfloat16
  • License: Apache 2.0

Serving with vLLM

vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
  --max-model-len 131072 \
  --reasoning-parser qwen3


vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
  --max-model-len 131072 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Quickstart

Text-Only Input

from openai import OpenAI

client = OpenAI()

messages = [
    {"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."},
]

chat_response = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Thinking",
    messages=messages,
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    },
)
print("Chat response:", chat_response)

Evaluation

Post-training evaluation for the thinking/reasoning variants. All values are percentages; higher is better except HarmBench, where lower is better. All values self-reported by JetBrains.

BenchmarkMellum2 Thinking SFTMellum2 ThinkingQwen3.5 (4B)Qwen3.5 (9B)OLMo-3 (7B)Ministral 3 (14B)
Coding
LiveCodeBench v675.169.959.468.359.842.7
Tool Use
BFCL v438.845.642.942.735.9
BFCL v360.569.473.968.552.2
Math
AIME20.058.468.373.461.738.3
GSM-Plus62.687.089.390.788.186.5
Knowledge
MMLU-Redux84.886.288.391.771.384.4
GPQA Diamond39.957.676.881.329.346.0
Conversational
IFEval69.176.587.189.884.759.7
JetBrains pairwise64.469.540.556.732.263.8
MixEval63.466.971.976.067.070.8
BS-Bench14.015.063.070.023.09.0
Safety
HarmBench (↓)12.220.615.96.648.770.0
XSTest90.889.696.897.693.296.8

Notes:

  • AIME is the mean of AIME 2025 and AIME 2026 (30 questions each).
  • BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, memory.
  • JetBrains pairwise is win rate against Qwen2.5-7B-Instruct on an internal benchmark.
  • indicates the model lacks native tool calling (OLMo-3-7B-Thinking).

For more details, see the Mellum2 Technical Report.

License

Released under the Apache 2.0 license.

Want more deterministic results?