Text-to-SQL
Natural-language to SQL on real warehouse-scale schemas. The lite track focuses on multi-step queries against a single database, where the model has to pick the right tables, joins, and filters.
Execution accuracy — fraction of generated SQL queries that return the correct result against the live DB. Higher is better. Hover a bar to reveal the exact score.
Every model evaluated on Spider-2.0-Lite, ranked highest to lowest.
| # | Model | Score |
|---|---|---|
| 1 | Interfaze | 52.9% |
| 2 | Claude-Sonnet-4.6 | 49.6% |
| 3 | Grok-4.3 | 45.9% |
| 4 | Gemini-3-Flash | 45.2% |
| 5 | GPT-5.4-Mini | 26.7% |