The green region highlights the "most attractive quadrant"
Mean quality across the 9 benchmarks and blended $/M tokens at a 1:1 input:output weighting, sorted by quality. Gemini-3-Flash and Interfaze sit in the most attractive quadrant on the chart above.
| Model | Mean quality | Blended $/MTok |
|---|---|---|
| Interfaze | 80.4% | $2.500 |
| Gemini-3.5-Flash | 77.1% | $5.250 |
| Gemini-3-Flash | 74.4% | $1.750 |
| Claude-Sonnet-4.6 | 69.1% | $9.000 |
| Grok-4.3 | 64.7% | $1.875 |
| GPT-5.4-Mini | 62.5% | $2.625 |
Public list prices used to compute the blended axis. Blended price weights input 50% and output 50% — a 1:1 baseline that weights input and output costs equally.
| Model | Input $/MTok | Output $/MTok |
|---|---|---|
| Interfaze | $1.50 | $3.50 |
| Gemini-3-Flash | $0.50 | $3.00 |
| Gemini-3.5-Flash | $1.50 | $9.00 |
| Claude-Sonnet-4.6 | $3.00 | $15.00 |
| GPT-5.4-Mini | $0.75 | $4.50 |
| Grok-4.3 | $1.25 | $2.50 |
Quality. Each model's score on each of the 9 benchmarks on this leaderboard, averaged. VoxPopuli is flipped from WER to (1 − WER) so direction is consistent across the cohort. Models without an audio modality contribute their mean across the other 8 benchmarks — they're not penalized for missing VoxPopuli.
Price. Public list prices in USD per million tokens, blended at 50% input / 50% output. Caching, batching, volume discounts, and per-modality pricing (e.g. Gemini's separate audio rate, image token packing) are excluded because they vary per workload — list prices keep the comparison apples-to-apples. Effective per-task cost can differ for reasoning-heavy or multimodal workloads.