Interfaze

Beta

pricing

docs

blog

Structured Output Benchmark (SOB) Leaderboard

Full breakdown

Rank	Model	Overall	Value Acc	Faithfulness	JSON Pass	Path Recall	Structure	Type Safety	Perfect
1	GPT-5.4	87.0%	79.8%	86.9%	99.3%	98.8%	98.1%	99.3%	46.9%
2	Gemini-3.1-Pro	86.9%	82.0%	87.6%	96.6%	96.0%	95.8%	96.6%	54.2%
3	GLM-5.1	86.6%	80.6%	87.2%	97.5%	96.9%	96.7%	97.5%	49.8%
4	Claude-Opus-4.7	86.4%	78.7%	87.7%	99.3%	98.8%	98.3%	99.3%	42.4%
5	GLM-4.7	86.1%	80.4%	86.8%	96.5%	95.9%	95.7%	96.5%	50.8%
6	Qwen3.5-35B	86.1%	80.1%	86.3%	96.9%	96.2%	96.0%	96.9%	50.0%
7	GPT-5.5	86.0%	79.5%	86.8%	97.8%	97.1%	96.8%	97.8%	46.4%
8	Gemini-2.5-Flash	86.0%	79.6%	85.6%	97.2%	96.7%	96.1%	97.2%	49.8%
9	Qwen3-235B	85.7%	78.6%	85.4%	97.8%	97.0%	96.8%	97.8%	46.3%
10	Interfaze-Beta	85.5%	79.5%	85.8%	96.7%	96.2%	95.7%	96.7%	48.0%
11	Claude-Sonnet-4.6	85.4%	77.9%	85.8%	97.9%	97.5%	96.9%	97.9%	44.2%
12	Claude-Opus-4.6	85.3%	77.9%	86.0%	97.7%	97.3%	96.8%	97.7%	43.7%
13	DeepSeek-V4-Pro	85.3%	79.6%	85.8%	96.0%	95.2%	95.3%	96.0%	49.0%
14	Kimi-2.6	85.3%	79.1%	85.6%	96.4%	95.8%	95.4%	96.4%	48.2%
15	GPT-4.1	85.0%	78.3%	85.3%	96.9%	96.3%	95.9%	96.9%	45.4%
16	GPT-5	84.9%	76.9%	85.9%	98.3%	97.8%	97.2%	98.3%	39.8%
17	Gemma-3-27B	84.7%	77.7%	84.2%	96.9%	96.1%	95.8%	96.9%	45.4%
18	Qwen3-30B	84.2%	75.3%	83.2%	98.3%	97.4%	97.0%	98.3%	40.1%
19	Nemotron-3-Nano-30B	84.1%	74.7%	81.7%	98.7%	97.5%	97.1%	98.7%	40.0%
20	GPT-5-Mini	83.5%	75.1%	83.7%	97.2%	96.6%	96.0%	97.2%	38.8%
21	Gemma-4-31B	83.3%	77.8%	84.3%	94.3%	93.4%	93.4%	94.3%	46.1%
22	Gemini-3-Flash-Preview	83.3%	77.3%	83.1%	93.9%	93.5%	92.9%	93.9%	48.4%
23	Schematron-8B	83.2%	73.1%	80.7%	98.7%	97.6%	96.9%	98.7%	37.0%
24	IBM-Granite-4.0	83.2%	73.6%	81.2%	98.3%	96.5%	96.7%	98.3%	38.1%
25	Phi-4	83.1%	78.7%	84.9%	96.9%	96.1%	96.1%	96.9%	45.2%
26	DS-R1-Distill-32B	82.7%	74.7%	81.9%	96.0%	94.5%	94.7%	96.0%	41.1%
27	Ministral-3-14B	77.8%	70.0%	77.3%	90.6%	89.8%	89.6%	90.6%	36.8%
28	GPT-OSS-20B	73.2%	66.7%	73.0%	84.5%	83.8%	83.6%	84.5%	36.2%

Overall ranking

Models sorted by difficulty-weighted average across all seven metrics (28 models on text and image, 27 on audio; Phi-4 excluded from audio due to its 16K context limit). Run at temperature 0.0, max output 2,048 tokens, no reasoning/thinking, so the score reflects pure structured-output capability.

Reasoning-locked models

These models cannot have reasoning fully turned off, so they ran in their lowest-reasoning configuration. They are scored with a small reasoning advantage the others do not get, yet several non-reasoning models still beat them on Value Accuracy.

Model	Why reasoning can't be fully disabled
GPT-5, GPT-5-Mini	API only exposes a minimum reasoning effort, not a full disable.
Gemini-3.1-Pro, Gemini-3-Flash-Preview	Thinking is built in and can be set to its lowest budget but not switched off.
DS-R1-Distill-32B	Chain-of-thought is intrinsic to the model, baked in during distillation.

Top 5 across every metric

Structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling. Value Accuracy and Perfect Response are where the real differences appear.

By modality

The same model scores very differently across text, image, and audio, even when every model gets the same text-normalized context. Audio is the hardest by far: transcripts are long (~7,300 tokens on average) and full of overlapping speakers, so models struggle to pull out the right values.

Best Value Accuracy by modality across all valid models:

Modality	Best Value Accuracy	Leader
Text	84.5%	Gemini-3.1-Pro
Image	67.2%	Gemma-4-31B
Audio	23.7%	Gemini-2.5-Flash

Text

HotpotQA passages. Top-tier models cluster within ~5 points of each other.

Image

olmOCR-bench documents normalized to text. Spread widens.

Audio

AMI multi-speaker meetings. Scores collapse and the ranking reshuffles entirely.

No single model wins all three. GPT-5.4 ranks 5th on text but 13th on images. Schematron-8B ranks 26th on text but 15th on images. Gemma-4-31B ranks 18th on text but 1st on images.

JSON Pass vs Value Accuracy

The single most important view: most models clear 95%+ on JSON Pass, but Value Accuracy sits 15 to 30 points lower. That gap is the space where schema-only benchmarks have been lying to us.

Per-metric rankings

Each chart re-sorts all 28 models on that single metric. Each x-axis starts from a floor appropriate to that metric so the top cluster doesn't look identical.

Value Accuracy

Exact leaf-value match against the verified ground truth. The metric production systems care about.

Faithfulness

How often values are grounded in the source context instead of hallucinated.

JSON Pass Rate

Whether the response is parseable JSON. Almost every modern model clears 95%+, which is why pass-rate-only benchmarks can't separate them anymore.

Path Recall

Whether all required keys appear in the output.

Structure Coverage

Whether nested objects and arrays are present with the correct shape.

Type Safety

Whether leaf values respect the declared JSON Schema types (no strings where numbers are expected).

Perfect Response Rate

The fraction of records where every single leaf value is exactly right. The hardest metric: it collapses to roughly half even for the best models.

How to read this leaderboard

Pick by metric, not by overall. The top six models are within 1 point overall but trade leadership across metrics. Choose the model that wins on the metric your workload depends on.

JSON Pass is table stakes. Every frontier model clears 95%+. The interesting question is what happens after parse: Value Accuracy, Faithfulness, and Perfect Response.

Modality matters more than size. A 35B open model can beat a frontier proprietary model on text and lose on audio. Test on your input distribution.

Schema-constrained decoding isn't a free win. Forcing the schema at decode time helps JSON Pass for some models and hurts it for others, while Value Accuracy barely moves. It doesn't fix the value-extraction gap.

Methodology

SOB scores three modalities (text, image, audio) on the same harness. Image and audio records are converted to text-normalized context before scoring so the score isolates structured-output capability from raw vision or speech-processing quality.

Modality	Source dataset	Eval records
Text	HotpotQA context passages	5,000
Image	olmOCR-bench documents	209
Audio	AMI Meeting Corpus conversations	115

Hardening gate: if JSON parse fails, downstream semantic metrics are zeroed for that record.

Coverage gate: Value Accuracy is only credited on fields the model actually returned, with missing paths counting as wrong.

Schemas are tagged easy, medium, or hard. The final leaderboard is schema-complexity-weighted (easy = 1.0, medium = 2.0, hard = 3.0) so hard schemas contribute more than medium ones.

For the full methodology, scoring details, and analysis, read the introducing SOB blog post.

Run it yourself

Paper: arXiv

Dataset: Hugging Face

Code: GitHub

Try Interfaze: Structured output docs · Playground

If you're benchmarking a new model, open a PR with the metric breakdown and we'll add it to the next refresh.