Interfaze

logo

Beta

pricing

docs

blog

sign in

All leaderboards

ASR (speech recognition)

VoxPopuli-Cleaned-AA

Speech-to-text Word Error Rate on the cleaned Audio-AA subset of VoxPopuli (multilingual European Parliament speeches). Lower is better. Only models with native audio input are scored.

Solid bars: WER (Word Error Rate) — % of words mis-transcribed. Lower is better. Striped bars: compute time per second of audio (ms). Lower is faster.

Model rankings

Inference speed

Real-time multiplier — how many seconds of audio each model transcribes in 1 second of compute. The compute time per audio second (in ms) is just 1000 / multiplier.

Scores

Word Error Rate (WER), real-time factor, and compute time per second of audio. Click any column header to sort. Bold cells mark the leader for each metric.

#Model
WER
Speed factor
Compute (ms / audio sec)
1Scribe v21.7%26.0x38.5
2Interfaze2.4%209.4x4.8
3Deepgram Nova-32.9%143.3x7.0
4Whisper Large v33.5%76.3x13.1
5Gemini-3-Flash4.0%17.8x56.2