Interfaze

logo

Beta

pricing

docs

blog

sign in

All leaderboards

PhD-level problem solving

GPQA Diamond

The hardest split of GPQA across physics, chemistry, biology, and graduate-level reasoning. Designed so domain experts score ~65% and laypeople ~30% even with web access.

Percent correct on PhD-level science questions — physics, chemistry, biology, graduate reasoning. Higher is better. Hover a bar to reveal the exact score.

Model rankings

Per-domain breakdown

Every model's overall score and per-domain accuracy across physics, chemistry, and biology. Click any column header to sort. Bold cells mark the leader for each domain.

#Model
Overall
Physics
Chemistry
Biology
1Interfaze89.9%95.3%88.2%73.7%
2Claude-Sonnet-4.689.9%93.0%89.0%80.0%
3Gemini-3-Flash88.5%96.1%84.6%73.4%
4GPT-5.4-Mini82.8%90.7%75.3%84.2%
5Grok-4.373.6%79.1%67.7%77.8%

Domain details

How each model performs on individual GPQA Diamond science domains. Each chart describes what the domain measures.