PhD-level problem solving
The hardest split of GPQA across physics, chemistry, biology, and graduate-level reasoning. Designed so domain experts score ~65% and laypeople ~30% even with web access.
Percent correct on PhD-level science questions — physics, chemistry, biology, graduate reasoning. Higher is better. Hover a bar to reveal the exact score.
Every model's overall score and per-domain accuracy across physics, chemistry, and biology. Click any column header to sort. Bold cells mark the leader for each domain.
| # | Model | Overall | Physics | Chemistry | Biology |
|---|---|---|---|---|---|
| 1 | Interfaze | 89.9% | 95.3% | 88.2% | 73.7% |
| 2 | Claude-Sonnet-4.6 | 89.9% | 93.0% | 89.0% | 80.0% |
| 3 | Gemini-3-Flash | 88.5% | 96.1% | 84.6% | 73.4% |
| 4 | GPT-5.4-Mini | 82.8% | 90.7% | 75.3% | 84.2% |
| 5 | Grok-4.3 | 73.6% | 79.1% | 67.7% | 77.8% |
How each model performs on individual GPQA Diamond science domains. Each chart describes what the domain measures.