Interfaze

logo

Beta

pricing

docs

blog

sign in

Structured Output Benchmark (SOB) Leaderboard

Read the SOB paper on arXivRead the introducing SOB blog postView SOB dataset on Hugging FaceView SOB benchmark on GitHub

Full breakdown

Rank
Model
Overall
Value Acc
Faithfulness
JSON Pass
Path Recall
Structure
Type Safety
Perfect
1GPT-5.487.0%79.8%86.9%99.3%98.8%98.1%99.3%46.9%
2GLM-4.786.1%80.4%86.8%96.5%95.9%95.7%96.5%50.8%
3Qwen3.5-35B86.1%80.1%86.3%96.9%96.2%96.0%96.9%50.0%
4Gemini-2.5-Flash86.0%79.6%85.6%97.2%96.7%96.1%97.2%49.8%
5Qwen3-235B85.7%78.6%85.4%97.8%97.0%96.8%97.8%46.3%
6Interfaze-Beta85.5%79.5%85.8%96.7%96.2%95.7%96.7%48.0%
7Claude-Sonnet-4.685.4%77.9%85.8%97.9%97.5%96.9%97.9%44.2%
8GPT-4.185.0%78.3%85.3%96.9%96.3%95.9%96.9%45.4%
9GPT-584.9%76.9%85.9%98.3%97.8%97.2%98.3%39.8%
10Gemma-3-27B84.7%77.7%84.2%96.9%96.1%95.8%96.9%45.4%
11Qwen3-30B84.2%75.3%83.2%98.3%97.4%97.0%98.3%40.1%
12Nemotron-3-Nano-30B84.1%74.7%81.7%98.7%97.5%97.1%98.7%40.0%
13GPT-5-Mini83.5%75.1%83.7%97.2%96.6%96.0%97.2%38.8%
14Gemma-4-31B83.3%77.8%84.3%94.3%93.4%93.4%94.3%46.1%
15Gemini-3-Flash-Preview83.3%77.3%83.1%93.9%93.5%92.9%93.9%48.4%
16Schematron-8B83.2%73.1%80.7%98.7%97.6%96.9%98.7%37.0%
17IBM-Granite-4.083.2%73.6%81.2%98.3%96.5%96.7%98.3%38.1%
18Phi-483.1%78.7%84.9%96.9%96.1%96.1%96.9%45.2%
19DS-R1-Distill-32B82.7%74.7%81.9%96.0%94.5%94.7%96.0%41.1%
20Ministral-3-14B77.8%70.0%77.3%90.6%89.8%89.6%90.6%36.8%
21GPT-OSS-20B73.2%66.7%73.0%84.5%83.8%83.6%84.5%36.2%

Overall ranking

Models sorted by difficulty-weighted average across all seven metrics (20 on audio; Phi-4 excluded due to its 16K context limit). Run at temperature 0.0, max output 2,048 tokens, no reasoning/thinking, so the score reflects pure structured-output capability.

Top 5 across every metric

Structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling. Value Accuracy and Perfect Response are where the real differences appear.

By modality

The same model scores very differently across text, image, and audio, even when every model gets the same text-normalized context. Audio is the hardest by far: transcripts are long (~7,300 tokens on average) and full of overlapping speakers, so models struggle to pull out the right values.

Best Value Accuracy by modality across all valid models:

ModalityBest Value AccuracyLeader
Text83.0%GLM-4.7
Image67.2%Gemma-4-31B
Audio23.7%Gemini-2.5-Flash

Text

HotpotQA passages. Top-tier models cluster within ~5 points of each other.

Image

olmOCR-bench documents normalized to text. Spread widens to ~11 points; vision pretraining matters.

Audio

AMI multi-speaker meetings. Scores collapse and the ranking reshuffles entirely.

No single model wins all three. GPT-5.4 ranks 3rd on text but 9th on images. Schematron-8B ranks 19th on text but 10th on images. Gemma-4-31B ranks 11th on text but 1st on images.

JSON Pass vs Value Accuracy

The single most important view: most models clear 95%+ on JSON Pass, but Value Accuracy sits 15 to 30 points lower. That gap is the space where schema-only benchmarks have been lying to us.

Per-metric rankings

Each chart re-sorts all 21 models on that single metric. Each x-axis starts from a floor appropriate to that metric so the top cluster doesn't look identical.

Value Accuracy

Exact leaf-value match against the verified ground truth. The metric production systems care about.

Faithfulness

How often values are grounded in the source context instead of hallucinated.

JSON Pass Rate

Whether the response is parseable JSON. Almost every modern model clears 95%+, which is why pass-rate-only benchmarks can't separate them anymore.

Path Recall

Whether all required keys appear in the output.

Structure Coverage

Whether nested objects and arrays are present with the correct shape.

Type Safety

Whether leaf values respect the declared JSON Schema types (no strings where numbers are expected).

Perfect Response Rate

The fraction of records where every single leaf value is exactly right. The hardest metric: it collapses to roughly half even for the best models.

How to read this leaderboard

Pick by metric, not by overall. The top four are within 1 point overall but trade leadership across metrics. Choose the model that wins on the metric your workload depends on.

JSON Pass is table stakes. Every frontier model clears 95%+. The interesting question is what happens after parse: Value Accuracy, Faithfulness, and Perfect Response.

Modality matters more than size. A 35B open model can beat a frontier proprietary model on text and lose on audio. Test on your input distribution.

Schema-constrained decoding isn't a free win. Forcing the schema at decode time helps JSON Pass for some models and hurts it for others, while Value Accuracy barely moves. It doesn't fix the value-extraction gap.

Methodology

SOB scores three modalities (text, image, audio) on the same harness. Image and audio records are converted to text-normalized context before scoring so the score isolates structured-output capability from raw vision or speech-processing quality.

ModalitySource datasetEval records
TextHotpotQA context passages5,000
ImageolmOCR-bench documents209
AudioAMI Meeting Corpus conversations115

Hardening gate: if JSON parse fails, downstream semantic metrics are zeroed for that record.

Coverage gate: Value Accuracy is only credited on fields the model actually returned, with missing paths counting as wrong.

Schemas are tagged easy, medium, or hard. The final leaderboard is schema-complexity-weighted (easy = 1.0, medium = 2.0, hard = 3.0) so hard schemas contribute more than medium ones.

For the full methodology, scoring details, and analysis, read the introducing SOB blog post.

Run it yourself

If you're benchmarking a new model, open a PR with the metric breakdown and we'll add it to the next refresh.