Multimodal understanding
Hard subset of MMMU for college-level multimodal problems. Removes shortcut-prone questions to isolate true vision-language reasoning over diagrams, charts, and figures.
Accuracy on hard college-level multimodal questions — diagrams, charts, figures + text. Higher is better. Hover a bar to reveal the exact score.
Headline MMMU-Pro score is the mean of two tracks: standard (text + image, n=1729) and vision-only (image-only, n=1730).
Each model's combined MMMU-Pro score alongside the two track scores it averages. Click any column header to sort. Bold cells mark the leader for each column.
| # | Model | Overall | Standard | Vision-only |
|---|---|---|---|---|
| 1 | Interfaze | 71.1% | 72.5% | 69.7% |
| 2 | Grok-4.3 | 68.7% | 69.4% | 67.9% |
| 3 | Gemini-3-Flash | 67.6% | 68.0% | 67.2% |
| 4 | Claude-Sonnet-4.6 | 46.3% | 47.1% | 45.5% |
| 5 | GPT-5.4-Mini | 40.4% | 42.0% | 38.8% |
Standard MMMU-Pro track (n=1729). Models receive both the rendered question image and surrounding text. Per-subject accuracy across 30 college-level subjects. Click any column header to sort. Bold cells mark the leader for each subject.
| # | Model | Overall | Accounting | Agriculture | Architecture & Engineering | Art | Art Theory | Basic Medical Science | Biology | Chemistry | Clinical Medicine | Computer Science | Design | Diagnostics & Lab Medicine | Economics | Electronics | Energy & Power | Finance | Geography | History | Literature | Manage | Marketing | Materials | Math | Mechanical Engineering | Music | Pharmacy | Physics | Psychology | Public Health | Sociology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Interfaze | 72.5% | 65.5% | 68.3% | 65.0% | 90.6% | 83.6% | 82.7% | 72.9% | 61.7% | 67.8% | 65.0% | 75.0% | 53.3% | 89.1% | 90.0% | 77.6% | 86.4% | 71.2% | 76.8% | 78.8% | 70.0% | 88.1% | 73.3% | 60.0% | 52.5% | 35.0% | 68.4% | 78.3% | 71.7% | 82.8% | 79.6% |
| 2 | Grok-4.3 | 69.4% | 82.8% | 58.3% | 66.7% | 77.4% | 81.8% | 59.6% | 64.4% | 70.0% | 61.0% | 65.0% | 80.0% | 40.0% | 91.5% | 70.0% | 56.9% | 88.3% | 73.1% | 71.4% | 82.7% | 70.0% | 86.4% | 55.0% | 70.0% | 49.1% | 31.7% | 84.2% | 78.3% | 70.0% | 74.1% | 77.8% |
| 3 | Gemini-3-Flash | 68.0% | 63.8% | 70.0% | 48.3% | 83.0% | 85.5% | 84.6% | 76.3% | 70.0% | 72.9% | 63.3% | 75.0% | 51.7% | 84.7% | 75.0% | 44.8% | 65.0% | 75.0% | 80.4% | 80.8% | 56.0% | 69.5% | 46.7% | 63.3% | 61.0% | 28.3% | 71.9% | 71.7% | 75.0% | 79.3% | 75.9% |
| 4 | Claude-Sonnet-4.6 | 47.1% | 48.3% | 57.1% | 13.3% | 83.9% | 81.8% | 51.9% | 40.7% | 51.7% | 47.5% | 45.0% | 71.7% | 40.0% | 59.3% | 16.7% | 17.2% | 31.7% | 55.8% | 69.6% | 75.0% | 38.0% | 45.6% | 21.7% | 33.3% | 15.3% | 33.3% | 59.7% | 30.0% | 56.7% | 70.7% | 64.8% |
| 5 | GPT-5.4-Mini | 42.0% | 27.6% | 40.0% | 21.7% | 75.5% | 65.5% | 57.7% | 40.7% | 31.7% | 47.5% | 50.0% | 70.0% | 38.3% | 30.5% | 55.0% | 22.4% | 23.3% | 36.5% | 53.6% | 71.2% | 30.0% | 43.2% | 25.0% | 26.7% | 37.3% | 26.7% | 54.4% | 31.7% | 43.3% | 36.2% | 57.4% |
Vision-only MMMU-Pro track (n=1730). The model only sees the rendered image — no text context — so VLM grounding becomes the bottleneck. Same 30 subjects, same sortable layout.
| # | Model | Overall | Accounting | Agriculture | Architecture & Engineering | Art | Art Theory | Basic Medical Science | Biology | Chemistry | Clinical Medicine | Computer Science | Design | Diagnostics & Lab Medicine | Economics | Electronics | Energy & Power | Finance | Geography | History | Literature | Manage | Marketing | Materials | Math | Mechanical Engineering | Music | Pharmacy | Physics | Psychology | Public Health | Sociology |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Interfaze | 69.7% | 62.3% | 63.7% | 45.3% | 83.1% | 89.3% | 88.5% | 69.8% | 73.7% | 78.3% | 68.7% | 75.3% | 57.0% | 90.1% | 73.7% | 53.7% | 63.7% | 73.2% | 75.2% | 82.8% | 66.0% | 73.2% | 52.0% | 70.3% | 57.9% | 43.7% | 79.2% | 75.3% | 60.3% | 76.1% | 77.9% |
| 2 | Grok-4.3 | 67.9% | 82.8% | 53.3% | 68.3% | 67.9% | 83.6% | 57.7% | 55.9% | 70.0% | 55.9% | 71.7% | 78.3% | 40.0% | 88.1% | 68.3% | 62.1% | 83.3% | 67.3% | 71.4% | 82.7% | 66.0% | 86.4% | 45.0% | 75.0% | 54.2% | 41.7% | 73.7% | 80.0% | 58.3% | 81.0% | 70.4% |
| 3 | Gemini-3-Flash | 67.2% | 62.1% | 66.7% | 38.3% | 81.1% | 87.3% | 80.8% | 67.8% | 68.3% | 74.6% | 66.7% | 73.3% | 56.7% | 79.7% | 76.7% | 51.7% | 65.0% | 78.8% | 73.2% | 84.6% | 56.0% | 71.2% | 41.7% | 63.3% | 54.2% | 36.7% | 75.4% | 70.0% | 68.3% | 79.3% | 74.1% |
| 4 | Claude-Sonnet-4.6 | 45.5% | 50.0% | 45.0% | 18.3% | 73.6% | 85.5% | 63.5% | 50.8% | 43.3% | 52.5% | 48.3% | 68.3% | 41.7% | 59.3% | 15.0% | 19.0% | 28.3% | 51.9% | 69.6% | 75.0% | 28.0% | 40.7% | 15.0% | 31.7% | 17.0% | 21.7% | 47.4% | 36.7% | 48.3% | 63.8% | 70.4% |
| 5 | GPT-5.4-Mini | 38.8% | 19.0% | 28.3% | 20.0% | 52.8% | 60.0% | 48.1% | 40.7% | 35.0% | 54.2% | 35.0% | 66.7% | 36.7% | 30.5% | 46.7% | 15.5% | 16.7% | 44.2% | 57.1% | 61.5% | 30.0% | 39.0% | 21.7% | 36.7% | 35.6% | 26.7% | 52.6% | 45.0% | 36.7% | 25.9% | 53.7% |