This leaderboard tracks the latest public benchmark results for SOTA models in a similar pricing tier and feature set like flash/mini series models, along with task-specific models for OCR and object detection. These range of models are optimized to squeeze the most performance out of the model at the fastest speed, while keeping cost low at scale. Each model is compared head-to-head across nine benchmarks: OCRBench V2, olmOCR, RefCOCO, VoxPopuli-Cleaned-AA, SOB Value, Spider-2.0-Lite, GPQA Diamond, MMMLU, and MMMU-Pro. The data comes from independently run evaluations and model providers.
Each chart presents the best five models on a single benchmark. The leader is always on the left.
Every benchmark, every model, every score in a single table.
| Benchmark | Interfaze | Gemini-3-Flash | Claude-Sonnet-4.6 | GPT-5.4-Mini | Grok-4.3 |
|---|---|---|---|---|---|
OCRBench V2Native OCR | 70.7% | 55.8% | 54.7% | 52.7% | 54.7% |
olmOCRComplex document processing | 85.7% | 75.3% | 73.9% | 80.1% | 81.9% |
RefCOCOObject detection (NL prompts) | 82.1% | 75.2% | 75.5% | 67.0% | 25.0% |
VoxPopuli-Cleaned-AAASR (speech recognition) | 2.4% | 4.0% | — | — | — |
SOB Value AccStructured output | 79.5% | 77.3% | 77.9% | 75.1% | 78.4% |
Spider-2.0-LiteText-to-SQL | 52.9% | 45.2% | 49.6% | 26.7% | 45.9% |
GPQA DiamondPhD-level problem solving | 89.9% | 88.5% | 89.9% | 82.8% | 73.6% |
MMMLUMultilingual Q&A | 90.9% | 88.7% | 84.9% | 75.3% | 89.7% |
MMMU-ProMultimodal understanding | 71.1% | 67.6% | 46.3% | 40.4% | 68.7% |
Bold cells mark the leader for each benchmark. Models without native audio show '—' on VoxPopuli.
One full-width chart per benchmark with the task it measures and what the score actually means.
Native OCR
Reading text directly from images: multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots. The benchmark that matters when the document never reaches a parser.
Complex document processing
End-to-end document understanding on long, layout-rich PDFs with tables, footnotes, equations, headers, and multi-column flows. Tests whether the model preserves reading order, not just characters.
Includes general-purpose LLMs and purpose-built OCR systems. * Self-reported score from the model's own announcement.
Object detection (NL prompts)
Visual grounding: given a free-form natural-language description, the model must return the exact bounding box of the object referenced — not classify it, but locate it.
ASR (speech recognition)
Speech-to-text Word Error Rate on the cleaned Audio-AA subset of VoxPopuli (multilingual European Parliament speeches). Lower is better. Only models with native audio input are scored.
Structured output
SOB Value (Structured Output Benchmark — Value Accuracy). Measures whether extracted leaf values are exactly correct, not just whether the JSON parses.
Text-to-SQL
Natural-language to SQL on real warehouse-scale schemas. The lite track focuses on multi-step queries against a single database, where the model has to pick the right tables, joins, and filters.
PhD-level problem solving
The hardest split of GPQA across physics, chemistry, biology, and graduate-level reasoning. Designed so domain experts score ~65% and laypeople ~30% even with web access.
Multilingual Q&A
Massively Multilingual MMLU. Knowledge and reasoning across 14 languages — exposes which models actually generalize beyond English.
Multimodal understanding
Hard subset of MMMU for college-level multimodal problems. Removes shortcut-prone questions to isolate true vision-language reasoning over diagrams, charts, and figures.