Complex document processing
End-to-end document understanding on long, layout-rich PDFs with tables, footnotes, equations, headers, and multi-column flows. Tests whether the model preserves reading order, not just characters.
Mean accuracy on long, layout-rich PDFs — graded against the original document, including reading order. Higher is better. Hover a bar to reveal the exact score.
Includes general-purpose LLMs and purpose-built OCR systems.
Beyond general-purpose LLMs, Interfaze also outperforms purpose-built OCR systems on the same benchmark — the models you'd reach for if you were going to wire up a dedicated document pipeline.
Every model's overall score and per-task accuracy across the eight olmOCR-bench task categories. Click any column header to sort. Bold cells mark the leader for each task.
| # | Model | Overall | ArXiv | Old Scans Math | Tables | Old Scans | Headers | Multi-Column | Long Tiny Text | Base |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Interfaze | 85.7% | 87.2% | 88.9% | 86.4% | 53.1% | 91.7% | 83.6% | 94.8% | 99.8% |
| 2 | Chandra OCR 2 | 84.3% | 86.5% | 83.0% | 87.9% | 49.2% | 92.5% | 81.3% | 93.9% | 99.9% |
| 3 | olmOCR v0.4.0 | 82.4% | 83.0% | 82.3% | 84.9% | 47.7% | 96.1% | 83.7% | 81.9% | 99.7% |
| 4 | Grok-4.3 | 81.9% | 79.6% | 77.5% | 81.5% | 47.3% | 95.8% | 81.6% | 92.1% | 99.6% |
| 5 | GPT-5.4-Mini | 80.1% | 79.1% | 78.6% | 81.1% | 43.9% | 90.9% | 79.4% | 87.6% | 99.9% |
| 6 | PaddleOCR-VL* | 80.0% | 85.7% | 71.0% | 84.1% | 37.8% | 97.0% | 79.9% | 85.7% | 98.5% |
| 7 | Reducto | 76.2% | 68.7% | 68.8% | 92.5% | 45.8% | 79.6% | 68.0% | 86.5% | 99.5% |
| 8 | DeepSeek-OCR | 75.7% | 77.2% | 73.6% | 80.2% | 33.3% | 96.1% | 66.4% | 79.4% | 99.8% |
| 9 | Gemini-3-Flash | 75.3% | 78.6% | 70.2% | 79.8% | 33.9% | 93.4% | 73.1% | 79.5% | 94.0% |
| 10 | Claude-Sonnet-4.6 | 73.9% | 76.4% | 63.5% | 73.2% | 31.7% | 92.5% | 69.8% | 76.1% | 98.0% |
| 11 | Mistral OCR | 72.0% | 77.2% | 67.5% | 60.6% | 29.3% | 93.6% | 71.3% | 77.1% | 99.4% |
* Self-reported score from the model's own announcement.
How each model performs on individual olmOCR-bench task categories. Each chart describes what the task measures.