| Rank | Model | Overall | Value Acc | Faithfulness | JSON Pass | Path Recall | Structure | Type Safety | Perfect |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 | 87.0% | 79.8% | 86.9% | 99.3% | 98.8% | 98.1% | 99.3% | 46.9% |
| 2 | Gemini-3.1-Pro | 86.9% | 82.0% | 87.6% | 96.6% | 96.0% | 95.8% | 96.6% | 54.2% |
| 3 | GLM-5.1 | 86.6% | 80.6% | 87.2% | 97.5% | 96.9% | 96.7% | 97.5% | 49.8% |
| 4 | Claude-Opus-4.7 | 86.4% | 78.7% | 87.7% | 99.3% | 98.8% | 98.3% | 99.3% | 42.4% |
| 5 | GLM-4.7 | 86.1% | 80.4% | 86.8% | 96.5% | 95.9% | 95.7% | 96.5% | 50.8% |
| 6 | Qwen3.5-35B | 86.1% | 80.1% | 86.3% | 96.9% | 96.2% | 96.0% | 96.9% | 50.0% |
| 7 | GPT-5.5 | 86.0% | 79.5% | 86.8% | 97.8% | 97.1% | 96.8% | 97.8% | 46.4% |
| 8 | Gemini-2.5-Flash | 86.0% | 79.6% | 85.6% | 97.2% | 96.7% | 96.1% | 97.2% | 49.8% |
| 9 | Qwen3-235B | 85.7% | 78.6% | 85.4% | 97.8% | 97.0% | 96.8% | 97.8% | 46.3% |
| 10 | Interfaze-Beta | 85.5% | 79.5% | 85.8% | 96.7% | 96.2% | 95.7% | 96.7% | 48.0% |
| 11 | Claude-Sonnet-4.6 | 85.4% | 77.9% | 85.8% | 97.9% | 97.5% | 96.9% | 97.9% | 44.2% |
| 12 | Claude-Opus-4.6 | 85.3% | 77.9% | 86.0% | 97.7% | 97.3% | 96.8% | 97.7% | 43.7% |
| 13 | DeepSeek-V4-Pro | 85.3% | 79.6% | 85.8% | 96.0% | 95.2% | 95.3% | 96.0% | 49.0% |
| 14 | Kimi-2.6 | 85.3% | 79.1% | 85.6% | 96.4% | 95.8% | 95.4% | 96.4% | 48.2% |
| 15 | GPT-4.1 | 85.0% | 78.3% | 85.3% | 96.9% | 96.3% | 95.9% | 96.9% | 45.4% |
| 16 | GPT-5 | 84.9% | 76.9% | 85.9% | 98.3% | 97.8% | 97.2% | 98.3% | 39.8% |
| 17 | Gemma-3-27B | 84.7% | 77.7% | 84.2% | 96.9% | 96.1% | 95.8% | 96.9% | 45.4% |
| 18 | Qwen3-30B | 84.2% | 75.3% | 83.2% | 98.3% | 97.4% | 97.0% | 98.3% | 40.1% |
| 19 | Nemotron-3-Nano-30B | 84.1% | 74.7% | 81.7% | 98.7% | 97.5% | 97.1% | 98.7% | 40.0% |
| 20 | GPT-5-Mini | 83.5% | 75.1% | 83.7% | 97.2% | 96.6% | 96.0% | 97.2% | 38.8% |
| 21 | Gemma-4-31B | 83.3% | 77.8% | 84.3% | 94.3% | 93.4% | 93.4% | 94.3% | 46.1% |
| 22 | Gemini-3-Flash-Preview | 83.3% | 77.3% | 83.1% | 93.9% | 93.5% | 92.9% | 93.9% | 48.4% |
| 23 | Schematron-8B | 83.2% | 73.1% | 80.7% | 98.7% | 97.6% | 96.9% | 98.7% | 37.0% |
| 24 | IBM-Granite-4.0 | 83.2% | 73.6% | 81.2% | 98.3% | 96.5% | 96.7% | 98.3% | 38.1% |
| 25 | Phi-4 | 83.1% | 78.7% | 84.9% | 96.9% | 96.1% | 96.1% | 96.9% | 45.2% |
| 26 | DS-R1-Distill-32B | 82.7% | 74.7% | 81.9% | 96.0% | 94.5% | 94.7% | 96.0% | 41.1% |
| 27 | Ministral-3-14B | 77.8% | 70.0% | 77.3% | 90.6% | 89.8% | 89.6% | 90.6% | 36.8% |
| 28 | GPT-OSS-20B | 73.2% | 66.7% | 73.0% | 84.5% | 83.8% | 83.6% | 84.5% | 36.2% |
Models sorted by difficulty-weighted average across all seven metrics (28 models on text and image, 27 on audio; Phi-4 excluded from audio due to its 16K context limit). Run at temperature 0.0, max output 2,048 tokens, no reasoning/thinking, so the score reflects pure structured-output capability.
These models cannot have reasoning fully turned off, so they ran in their lowest-reasoning configuration. They are scored with a small reasoning advantage the others do not get, yet several non-reasoning models still beat them on Value Accuracy.
| Model | Why reasoning can't be fully disabled |
|---|---|
| GPT-5, GPT-5-Mini | API only exposes a minimum reasoning effort, not a full disable. |
| Gemini-3.1-Pro, Gemini-3-Flash-Preview | Thinking is built in and can be set to its lowest budget but not switched off. |
| DS-R1-Distill-32B | Chain-of-thought is intrinsic to the model, baked in during distillation. |
Structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling. Value Accuracy and Perfect Response are where the real differences appear.
The same model scores very differently across text, image, and audio, even when every model gets the same text-normalized context. Audio is the hardest by far: transcripts are long (~7,300 tokens on average) and full of overlapping speakers, so models struggle to pull out the right values.
Best Value Accuracy by modality across all valid models:
| Modality | Best Value Accuracy | Leader |
|---|---|---|
| Text | 84.5% | Gemini-3.1-Pro |
| Image | 67.2% | Gemma-4-31B |
| Audio | 23.7% | Gemini-2.5-Flash |
HotpotQA passages. Top-tier models cluster within ~5 points of each other.
olmOCR-bench documents normalized to text. Spread widens.
AMI multi-speaker meetings. Scores collapse and the ranking reshuffles entirely.
No single model wins all three. GPT-5.4 ranks 5th on text but 13th on images. Schematron-8B ranks 26th on text but 15th on images. Gemma-4-31B ranks 18th on text but 1st on images.
The single most important view: most models clear 95%+ on JSON Pass, but Value Accuracy sits 15 to 30 points lower. That gap is the space where schema-only benchmarks have been lying to us.
Each chart re-sorts all 28 models on that single metric. Each x-axis starts from a floor appropriate to that metric so the top cluster doesn't look identical.
Exact leaf-value match against the verified ground truth. The metric production systems care about.
How often values are grounded in the source context instead of hallucinated.
Whether the response is parseable JSON. Almost every modern model clears 95%+, which is why pass-rate-only benchmarks can't separate them anymore.
Whether all required keys appear in the output.
Whether nested objects and arrays are present with the correct shape.
Whether leaf values respect the declared JSON Schema types (no strings where numbers are expected).
The fraction of records where every single leaf value is exactly right. The hardest metric: it collapses to roughly half even for the best models.
Pick by metric, not by overall. The top six models are within 1 point overall but trade leadership across metrics. Choose the model that wins on the metric your workload depends on.
JSON Pass is table stakes. Every frontier model clears 95%+. The interesting question is what happens after parse: Value Accuracy, Faithfulness, and Perfect Response.
Modality matters more than size. A 35B open model can beat a frontier proprietary model on text and lose on audio. Test on your input distribution.
Schema-constrained decoding isn't a free win. Forcing the schema at decode time helps JSON Pass for some models and hurts it for others, while Value Accuracy barely moves. It doesn't fix the value-extraction gap.
SOB scores three modalities (text, image, audio) on the same harness. Image and audio records are converted to text-normalized context before scoring so the score isolates structured-output capability from raw vision or speech-processing quality.
| Modality | Source dataset | Eval records |
|---|---|---|
| Text | HotpotQA context passages | 5,000 |
| Image | olmOCR-bench documents | 209 |
| Audio | AMI Meeting Corpus conversations | 115 |
Hardening gate: if JSON parse fails, downstream semantic metrics are zeroed for that record.
Coverage gate: Value Accuracy is only credited on fields the model actually returned, with missing paths counting as wrong.
Schemas are tagged easy, medium, or hard. The final leaderboard is schema-complexity-weighted (easy = 1.0, medium = 2.0, hard = 3.0) so hard schemas contribute more than medium ones.
For the full methodology, scoring details, and analysis, read the introducing SOB blog post.
If you're benchmarking a new model, open a PR with the metric breakdown and we'll add it to the next refresh.