Interfaze

logo

Beta

pricing

docs

blog

sign in

Get Started

Introduction

Examples

Vision

Concepts

Resources

Integrations

Speech to Text (STT)

copy markdown

Transcribe and diarize audio files of multiple speakers and languages at blazing fast speeds.

  • Over 100+ languages with mixed language support. View all supported languages
  • Blazing fast transcription speeds (1hr 30mins of audio in 50 seconds)
  • Automatically de-noise low-quality audio for better transcription
  • Intelligent audio analysis for intent, sentiment, and more
  • Speaker diarization up to 50 speakers. Learn more about speaker diarization

Basic audio transcription


OpenAI SDK

Vercel AI SDK

LangChain SDK

...

JSON output

...

Blazing fast STT

Running STT as a single task with <task>speech_to_text</task> in the system message makes it significantly faster and cheaper with a fixed structured output that's pre-defined.

Learn more about running a task.

OpenAI SDK

Vercel AI SDK

LangChain SDK

...

Note how the URL is passed in the prompt instead of in the file object. This is another way to pass files to the model which has a marginal speed increase.

JSON output

...

Audio translation

Translate any audio or text to over 100+ languages while maintaining the original meaning and context.

OpenAI SDK

Vercel AI SDK

LangChain SDK

...

JSON output

...

You can reference the precontext to get the raw results from the model for both the STT and translation processes.

Noisy bad quality audio transcription


Automatically de-noise low-quality enhance audio for better transcription.

OpenAI SDK

Vercel AI SDK

LangChain SDK

...

JSON output

...

Audio summary and understanding


OpenAI SDK

Vercel AI SDK

LangChain SDK

...

JSON output

...

Long audio transcription (1hr+)


To get the best performance with long audio file is to use run task with the <task>speech_to_text</task> in the system prompt, this only activates a part of the model used for audio.

OpenAI SDK

Vercel AI SDK

LangChain SDK

...

This took 50s to transcribe a 1hr and 35min audio file.

JSON output

...

The output is truncated for this example.

Speaker diarization

Check out how to perform speaker diarization here.