Model leaderboard

This leaderboard tracks the latest public benchmark results for SOTA models in a similar pricing tier and feature set like flash/mini series models, along with task-specific models for OCR and object detection. These range of models are optimized to squeeze the most performance out of the model at the fastest speed, while keeping cost low at scale. Each model is compared head-to-head across nine benchmarks: OCRBench V2, olmOCR, RefCOCO, VoxPopuli-Cleaned-AA, SOB Value, Spider-2.0-Lite, GPQA Diamond, MMMLU, and MMMU-Pro. The data comes from independently run evaluations and model providers.

Read the blog post ->Benchmark code ->

Top models per task

Each chart presents the best five models on a single benchmark. The leader is always on the left.

Full comparison

Every benchmark, every model, every score in a single table.

Benchmark	Interfaze	Gemini-3-Flash	Claude-Sonnet-4.6	GPT-5.4-Mini	Grok-4.3
OCRBench V2Native OCR	70.7%	55.8%	54.7%	52.7%	54.7%
olmOCRComplex document processing	85.7%	75.3%	73.9%	80.1%	81.9%
RefCOCOObject detection (NL prompts)	82.1%	75.2%	75.5%	67.0%	25.0%
VoxPopuli-Cleaned-AAASR (speech recognition)	2.4%	4.0%	—	—	—
SOB Value AccStructured output	79.5%	77.3%	77.9%	75.1%	78.4%
Spider-2.0-LiteText-to-SQL	52.9%	45.2%	49.6%	26.7%	45.9%
GPQA DiamondPhD-level problem solving	89.9%	88.5%	89.9%	82.8%	73.6%
MMMLUMultilingual Q&A	90.9%	88.7%	84.9%	75.3%	89.7%
MMMU-ProMultimodal understanding	71.1%	67.6%	46.3%	40.4%	68.7%

Bold cells mark the leader for each benchmark. Models without native audio show '—' on VoxPopuli.

Benchmark deep-dive

One full-width chart per benchmark with the task it measures and what the score actually means.

OCRBench V2

Full breakdown

Native OCR

Reading text directly from images: multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots. The benchmark that matters when the document never reaches a parser.

olmOCR

Full breakdown

Complex document processing

End-to-end document understanding on long, layout-rich PDFs with tables, footnotes, equations, headers, and multi-column flows. Tests whether the model preserves reading order, not just characters.

Includes general-purpose LLMs and purpose-built OCR systems. * Self-reported score from the model's own announcement.

RefCOCO

Full breakdown

Object detection (NL prompts)

Visual grounding: given a free-form natural-language description, the model must return the exact bounding box of the object referenced — not classify it, but locate it.

VoxPopuli-Cleaned-AA

Full breakdown

ASR (speech recognition)

Speech-to-text Word Error Rate on the cleaned Audio-AA subset of VoxPopuli (multilingual European Parliament speeches). Lower is better. Only models with native audio input are scored.

SOB Value Acc

Full breakdown

Structured output

SOB Value (Structured Output Benchmark — Value Accuracy). Measures whether extracted leaf values are exactly correct, not just whether the JSON parses.

Spider-2.0-Lite

Full breakdown

Text-to-SQL

Natural-language to SQL on real warehouse-scale schemas. The lite track focuses on multi-step queries against a single database, where the model has to pick the right tables, joins, and filters.

GPQA Diamond

Full breakdown

PhD-level problem solving

The hardest split of GPQA across physics, chemistry, biology, and graduate-level reasoning. Designed so domain experts score ~65% and laypeople ~30% even with web access.

MMMLU

Full breakdown

Multilingual Q&A

Massively Multilingual MMLU. Knowledge and reasoning across 14 languages — exposes which models actually generalize beyond English.

MMMU-Pro

Full breakdown

Multimodal understanding

Hard subset of MMMU for college-level multimodal problems. Removes shortcut-prone questions to isolate true vision-language reasoning over diagrams, charts, and figures.