Quality vs Price

The green region highlights the "most attractive quadrant"

Model summary

Mean quality across the 9 benchmarks and blended $/M tokens at a 1:1 input:output weighting, sorted by quality. Gemini-3-Flash and Interfaze sit in the most attractive quadrant on the chart above.

Model	Mean quality	Blended $/MTok
Interfaze	80.4%	$2.500
Gemini-3.5-Flash	77.1%	$5.250
Gemini-3-Flash	74.4%	$1.750
Claude-Sonnet-5	70.1%	$9.000
Claude-Sonnet-4.6	69.1%	$9.000
Grok-4.3	64.7%	$1.875
GPT-5.4-Mini	62.5%	$2.625

Pricing reference

Public list prices used to compute the blended axis. Blended price weights input 50% and output 50% — a 1:1 baseline that weights input and output costs equally.

Model	Input $/MTok	Output $/MTok
Interfaze	$1.50	$3.50
Gemini-3-Flash	$0.50	$3.00
Gemini-3.5-Flash	$1.50	$9.00
Claude-Sonnet-4.6	$3.00	$15.00
Claude-Sonnet-5	$3.00	$15.00
GPT-5.4-Mini	$0.75	$4.50
Grok-4.3	$1.25	$2.50

Methodology

Quality. Each model's score on each of the 9 benchmarks on this leaderboard, averaged. VoxPopuli is flipped from WER to (1 − WER) so direction is consistent across the cohort. Models without an audio modality contribute their mean across the other 8 benchmarks — they're not penalized for missing VoxPopuli.

Price. Public list prices in USD per million tokens, blended at 50% input / 50% output. Caching, batching, volume discounts, and per-modality pricing (e.g. Gemini's separate audio rate, image token packing) are excluded because they vary per workload — list prices keep the comparison apples-to-apples. Effective per-task cost can differ for reasoning-heavy or multimodal workloads.

Back to all leaderboards