ASR (speech recognition)
Speech-to-text Word Error Rate on the cleaned Audio-AA subset of VoxPopuli (multilingual European Parliament speeches). Lower is better. Only models with native audio input are scored.
Solid bars: WER (Word Error Rate) — % of words mis-transcribed. Lower is better. Striped bars: compute time per second of audio (ms). Lower is faster.
Real-time multiplier — how many seconds of audio each model transcribes in 1 second of compute. The compute time per audio second (in ms) is just 1000 / multiplier.
Word Error Rate (WER), real-time factor, and compute time per second of audio. Click any column header to sort. Bold cells mark the leader for each metric.
| # | Model | WER | Speed factor | Compute (ms / audio sec) |
|---|---|---|---|---|
| 1 | Scribe v2 | 1.7% | 26.0x | 38.5 |
| 2 | Interfaze | 2.4% | 209.4x | 4.8 |
| 3 | Deepgram Nova-3 | 2.9% | 143.3x | 7.0 |
| 4 | Whisper Large v3 | 3.5% | 76.3x | 13.1 |
| 5 | Gemini-3-Flash | 4.0% | 17.8x | 56.2 |