VoxPopuli-Cleaned-AA

Name: VoxPopuli-Cleaned-AA — AI model leaderboard
Creator: Interfaze
License: https://creativecommons.org/licenses/by/4.0/
Keywords: VoxPopuli-Cleaned-AA, ASR (speech recognition), AI benchmark, model leaderboard, Interfaze

Speech-to-text Word Error Rate on the cleaned Audio-AA subset of VoxPopuli (multilingual European Parliament speeches). Lower is better. Only models with native audio input are scored.

Solid bars: WER (Word Error Rate) — % of words mis-transcribed. Lower is better. Striped bars: compute time per second of audio (ms). Lower is faster.

Model rankings

Inference speed

Real-time multiplier — how many seconds of audio each model transcribes in 1 second of compute. The compute time per audio second (in ms) is just 1000 / multiplier.

Scores

Word Error Rate (WER), real-time factor, and compute time per second of audio. Click any column header to sort. Bold cells mark the leader for each metric.

#	Model	WER	Speed factor	Compute (ms / audio sec)
1	Scribe v2	1.7%	26.0x	38.5
2	Interfaze	2.4%	209.4x	4.8
3	Deepgram Nova-3	2.9%	143.3x	7.0
4	Whisper Large v3	3.5%	76.3x	13.1
5	Gemini-3-Flash	4.0%	17.8x	56.2

Back to all leaderboards