| Rank | Model | Overall | Value Acc | Faithfulness | JSON Pass | Path Recall | Structure | Type Safety | Perfect |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 | 87.0% | 79.8% | 86.9% | 99.3% | 98.8% | 98.1% | 99.3% | 46.9% |
| 2 | GLM-4.7 | 86.1% | 80.4% | 86.8% | 96.5% | 95.9% | 95.7% | 96.5% | 50.8% |
| 3 | Qwen3.5-35B | 86.1% | 80.1% | 86.3% | 96.9% | 96.2% | 96.0% | 96.9% | 50.0% |
| 4 | Gemini-2.5-Flash | 86.0% | 79.6% | 85.6% | 97.2% | 96.7% | 96.1% | 97.2% | 49.8% |
| 5 | Qwen3-235B | 85.7% | 78.6% | 85.4% | 97.8% | 97.0% | 96.8% | 97.8% | 46.3% |
| 6 | Interfaze-Beta | 85.5% | 79.5% | 85.8% | 96.7% | 96.2% | 95.7% | 96.7% | 48.0% |
| 7 | Claude-Sonnet-4.6 | 85.4% | 77.9% | 85.8% | 97.9% | 97.5% | 96.9% | 97.9% | 44.2% |
| 8 | GPT-4.1 | 85.0% | 78.3% | 85.3% | 96.9% | 96.3% | 95.9% | 96.9% | 45.4% |
| 9 | GPT-5 | 84.9% | 76.9% | 85.9% | 98.3% | 97.8% | 97.2% | 98.3% | 39.8% |
| 10 | Gemma-3-27B | 84.7% | 77.7% | 84.2% | 96.9% | 96.1% | 95.8% | 96.9% | 45.4% |
| 11 | Qwen3-30B | 84.2% | 75.3% | 83.2% | 98.3% | 97.4% | 97.0% | 98.3% | 40.1% |
| 12 | Nemotron-3-Nano-30B | 84.1% | 74.7% | 81.7% | 98.7% | 97.5% | 97.1% | 98.7% | 40.0% |
| 13 | GPT-5-Mini | 83.5% | 75.1% | 83.7% | 97.2% | 96.6% | 96.0% | 97.2% | 38.8% |
| 14 | Gemma-4-31B | 83.3% | 77.8% | 84.3% | 94.3% | 93.4% | 93.4% | 94.3% | 46.1% |
| 15 | Gemini-3-Flash-Preview | 83.3% | 77.3% | 83.1% | 93.9% | 93.5% | 92.9% | 93.9% | 48.4% |
| 16 | Schematron-8B | 83.2% | 73.1% | 80.7% | 98.7% | 97.6% | 96.9% | 98.7% | 37.0% |
| 17 | IBM-Granite-4.0 | 83.2% | 73.6% | 81.2% | 98.3% | 96.5% | 96.7% | 98.3% | 38.1% |
| 18 | Phi-4 | 83.1% | 78.7% | 84.9% | 96.9% | 96.1% | 96.1% | 96.9% | 45.2% |
| 19 | DS-R1-Distill-32B | 82.7% | 74.7% | 81.9% | 96.0% | 94.5% | 94.7% | 96.0% | 41.1% |
| 20 | Ministral-3-14B | 77.8% | 70.0% | 77.3% | 90.6% | 89.8% | 89.6% | 90.6% | 36.8% |
| 21 | GPT-OSS-20B | 73.2% | 66.7% | 73.0% | 84.5% | 83.8% | 83.6% | 84.5% | 36.2% |
Models sorted by difficulty-weighted average across all seven metrics (20 on audio; Phi-4 excluded due to its 16K context limit). Run at temperature 0.0, max output 2,048 tokens, no reasoning/thinking, so the score reflects pure structured-output capability.
Structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling. Value Accuracy and Perfect Response are where the real differences appear.
The same model scores very differently across text, image, and audio, even when every model gets the same text-normalized context. Audio is the hardest by far: transcripts are long (~7,300 tokens on average) and full of overlapping speakers, so models struggle to pull out the right values.
Best Value Accuracy by modality across all valid models:
| Modality | Best Value Accuracy | Leader |
|---|---|---|
| Text | 83.0% | GLM-4.7 |
| Image | 67.2% | Gemma-4-31B |
| Audio | 23.7% | Gemini-2.5-Flash |
HotpotQA passages. Top-tier models cluster within ~5 points of each other.
olmOCR-bench documents normalized to text. Spread widens to ~11 points; vision pretraining matters.
AMI multi-speaker meetings. Scores collapse and the ranking reshuffles entirely.
No single model wins all three. GPT-5.4 ranks 3rd on text but 9th on images. Schematron-8B ranks 19th on text but 10th on images. Gemma-4-31B ranks 11th on text but 1st on images.
The single most important view: most models clear 95%+ on JSON Pass, but Value Accuracy sits 15 to 30 points lower. That gap is the space where schema-only benchmarks have been lying to us.
Each chart re-sorts all 21 models on that single metric. Each x-axis starts from a floor appropriate to that metric so the top cluster doesn't look identical.
Exact leaf-value match against the verified ground truth. The metric production systems care about.
How often values are grounded in the source context instead of hallucinated.
Whether the response is parseable JSON. Almost every modern model clears 95%+, which is why pass-rate-only benchmarks can't separate them anymore.
Whether all required keys appear in the output.
Whether nested objects and arrays are present with the correct shape.
Whether leaf values respect the declared JSON Schema types (no strings where numbers are expected).
The fraction of records where every single leaf value is exactly right. The hardest metric: it collapses to roughly half even for the best models.
Pick by metric, not by overall. The top four are within 1 point overall but trade leadership across metrics. Choose the model that wins on the metric your workload depends on.
JSON Pass is table stakes. Every frontier model clears 95%+. The interesting question is what happens after parse: Value Accuracy, Faithfulness, and Perfect Response.
Modality matters more than size. A 35B open model can beat a frontier proprietary model on text and lose on audio. Test on your input distribution.
Schema-constrained decoding isn't a free win. Forcing the schema at decode time helps JSON Pass for some models and hurts it for others, while Value Accuracy barely moves. It doesn't fix the value-extraction gap.
SOB scores three modalities (text, image, audio) on the same harness. Image and audio records are converted to text-normalized context before scoring so the score isolates structured-output capability from raw vision or speech-processing quality.
| Modality | Source dataset | Eval records |
|---|---|---|
| Text | HotpotQA context passages | 5,000 |
| Image | olmOCR-bench documents | 209 |
| Audio | AMI Meeting Corpus conversations | 115 |
Hardening gate: if JSON parse fails, downstream semantic metrics are zeroed for that record.
Coverage gate: Value Accuracy is only credited on fields the model actually returned, with missing paths counting as wrong.
Schemas are tagged easy, medium, or hard. The final leaderboard is schema-complexity-weighted (easy = 1.0, medium = 2.0, hard = 3.0) so hard schemas contribute more than medium ones.
For the full methodology, scoring details, and analysis, read the introducing SOB blog post.
Paper: arXiv preprint
Dataset: Hugging Face
Code: GitHub
Try Interfaze: Structured output docs · Playground
If you're benchmarking a new model, open a PR with the metric breakdown and we'll add it to the next refresh.