Interfaze

Beta

pricing

docs

blog

Structured Output Benchmark (SOB) Leaderboard

Full breakdown

Rank	Model	Overall	Value Acc	Faithfulness	JSON Pass	Path Recall	Structure	Type Safety	Perfect
1	GPT-5.4	87.0%	79.8%	86.9%	99.3%	98.8%	98.1%	99.3%	46.9%
2	GLM-4.7	86.1%	80.4%	86.8%	96.5%	95.9%	95.7%	96.5%	50.8%
3	Qwen3.5-35B	86.1%	80.1%	86.3%	96.9%	96.2%	96.0%	96.9%	50.0%
4	Gemini-2.5-Flash	86.0%	79.6%	85.6%	97.2%	96.7%	96.1%	97.2%	49.8%
5	Qwen3-235B	85.7%	78.6%	85.4%	97.8%	97.0%	96.8%	97.8%	46.3%
6	Interfaze-Beta	85.5%	79.5%	85.8%	96.7%	96.2%	95.7%	96.7%	48.0%
7	Claude-Sonnet-4.6	85.4%	77.9%	85.8%	97.9%	97.5%	96.9%	97.9%	44.2%
8	GPT-4.1	85.0%	78.3%	85.3%	96.9%	96.3%	95.9%	96.9%	45.4%
9	GPT-5	84.9%	76.9%	85.9%	98.3%	97.8%	97.2%	98.3%	39.8%
10	Gemma-3-27B	84.7%	77.7%	84.2%	96.9%	96.1%	95.8%	96.9%	45.4%
11	Qwen3-30B	84.2%	75.3%	83.2%	98.3%	97.4%	97.0%	98.3%	40.1%
12	Nemotron-3-Nano-30B	84.1%	74.7%	81.7%	98.7%	97.5%	97.1%	98.7%	40.0%
13	GPT-5-Mini	83.5%	75.1%	83.7%	97.2%	96.6%	96.0%	97.2%	38.8%
14	Gemma-4-31B	83.3%	77.8%	84.3%	94.3%	93.4%	93.4%	94.3%	46.1%
15	Gemini-3-Flash-Preview	83.3%	77.3%	83.1%	93.9%	93.5%	92.9%	93.9%	48.4%
16	Schematron-8B	83.2%	73.1%	80.7%	98.7%	97.6%	96.9%	98.7%	37.0%
17	IBM-Granite-4.0	83.2%	73.6%	81.2%	98.3%	96.5%	96.7%	98.3%	38.1%
18	Phi-4	83.1%	78.7%	84.9%	96.9%	96.1%	96.1%	96.9%	45.2%
19	DS-R1-Distill-32B	82.7%	74.7%	81.9%	96.0%	94.5%	94.7%	96.0%	41.1%
20	Ministral-3-14B	77.8%	70.0%	77.3%	90.6%	89.8%	89.6%	90.6%	36.8%
21	GPT-OSS-20B	73.2%	66.7%	73.0%	84.5%	83.8%	83.6%	84.5%	36.2%

Overall ranking

Models sorted by difficulty-weighted average across all seven metrics (20 on audio; Phi-4 excluded due to its 16K context limit). Run at temperature 0.0, max output 2,048 tokens, no reasoning/thinking, so the score reflects pure structured-output capability.

Top 5 across every metric

Structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling. Value Accuracy and Perfect Response are where the real differences appear.

By modality

The same model scores very differently across text, image, and audio, even when every model gets the same text-normalized context. Audio is the hardest by far: transcripts are long (~7,300 tokens on average) and full of overlapping speakers, so models struggle to pull out the right values.

Best Value Accuracy by modality across all valid models:

Modality	Best Value Accuracy	Leader
Text	83.0%	GLM-4.7
Image	67.2%	Gemma-4-31B
Audio	23.7%	Gemini-2.5-Flash

Text

HotpotQA passages. Top-tier models cluster within ~5 points of each other.

Image

olmOCR-bench documents normalized to text. Spread widens to ~11 points; vision pretraining matters.

Audio

AMI multi-speaker meetings. Scores collapse and the ranking reshuffles entirely.

No single model wins all three. GPT-5.4 ranks 3rd on text but 9th on images. Schematron-8B ranks 19th on text but 10th on images. Gemma-4-31B ranks 11th on text but 1st on images.

JSON Pass vs Value Accuracy

The single most important view: most models clear 95%+ on JSON Pass, but Value Accuracy sits 15 to 30 points lower. That gap is the space where schema-only benchmarks have been lying to us.

Per-metric rankings

Each chart re-sorts all 21 models on that single metric. Each x-axis starts from a floor appropriate to that metric so the top cluster doesn't look identical.

Value Accuracy

Exact leaf-value match against the verified ground truth. The metric production systems care about.

Faithfulness

How often values are grounded in the source context instead of hallucinated.

JSON Pass Rate

Whether the response is parseable JSON. Almost every modern model clears 95%+, which is why pass-rate-only benchmarks can't separate them anymore.

Path Recall

Whether all required keys appear in the output.

Structure Coverage

Whether nested objects and arrays are present with the correct shape.

Type Safety

Whether leaf values respect the declared JSON Schema types (no strings where numbers are expected).

Perfect Response Rate

The fraction of records where every single leaf value is exactly right. The hardest metric: it collapses to roughly half even for the best models.

How to read this leaderboard

Pick by metric, not by overall. The top four are within 1 point overall but trade leadership across metrics. Choose the model that wins on the metric your workload depends on.

JSON Pass is table stakes. Every frontier model clears 95%+. The interesting question is what happens after parse: Value Accuracy, Faithfulness, and Perfect Response.

Modality matters more than size. A 35B open model can beat a frontier proprietary model on text and lose on audio. Test on your input distribution.

Schema-constrained decoding isn't a free win. Forcing the schema at decode time helps JSON Pass for some models and hurts it for others, while Value Accuracy barely moves. It doesn't fix the value-extraction gap.

Methodology

SOB scores three modalities (text, image, audio) on the same harness. Image and audio records are converted to text-normalized context before scoring so the score isolates structured-output capability from raw vision or speech-processing quality.

Modality	Source dataset	Eval records
Text	HotpotQA context passages	5,000
Image	olmOCR-bench documents	209
Audio	AMI Meeting Corpus conversations	115

Hardening gate: if JSON parse fails, downstream semantic metrics are zeroed for that record.

Coverage gate: Value Accuracy is only credited on fields the model actually returned, with missing paths counting as wrong.

Schemas are tagged easy, medium, or hard. The final leaderboard is schema-complexity-weighted (easy = 1.0, medium = 2.0, hard = 3.0) so hard schemas contribute more than medium ones.

For the full methodology, scoring details, and analysis, read the introducing SOB blog post.

Run it yourself

Paper: arXiv preprint

Dataset: Hugging Face

Code: GitHub

Try Interfaze: Structured output docs · Playground

If you're benchmarking a new model, open a PR with the metric breakdown and we'll add it to the next refresh.