Interfaze

logo

Beta

pricing

docs

blog

sign in

Get Started

Introduction

Examples

Vision

Concepts

Resources

Projects

Integrations

API Reference

STT Speaker Diarization

copy markdown

Diarize multiple speakers on long and short audio files with multilingual support.

  • Over 100+ languages with mixed language support. View all supported languages
  • Speaker diarization for up to 50 speakers
  • Speaker based intent, sentiment, and other audio analysis

Basic speaker diarization


OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const DiarizationSchema = z.object({
	full_text: z.string(),
	chunks: z.array(
		z.object({
			speaker_id: z.string(),
			text: z.string(),
			start_time: z.number(),
			end_time: z.number(),
		})
	),
	number_of_speakers: z.number(),
});

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe and identify the speakers in the audio file" },
				{
					type: "file",
					file: {
						filename: "stt_multispeaker.mp3",
						file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_multispeaker.mp3",
					},
				},
			],
		},
	],
	response_format: zodResponseFormat(DiarizationSchema, "diarization_schema"),
});

console.log(response.choices[0].message.content);

//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);

JSON output

Sentiment analysis by speaker


OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const SentimentSchema = z.object({
	full_text: z.string(),
	chunks: z.array(
		z.object({
			speaker_id: z.string(),
			text: z.string(),
			sentiment: z.enum(["positive", "negative", "neutral"]).describe("sentiment of the audio chunk"),
			start_time: z.number(),
			end_time: z.number(),
		})
	),
	number_of_speakers: z.number(),
});

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe the audio file, identify the speakers, and analyze the sentiment of each speaker" },
				{
					type: "file",
					file: {
						filename: "stt_call.mp3",
						file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_call.mp3",
					},
				},
			],
		},
	],
	response_format: zodResponseFormat(SentimentSchema, "sentiment_schema"),
});

console.log(response.choices[0].message.content);

//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);

JSON output

Blazing fast diarization on long audio files

To get the best performance with long audio file is to use run task with the <task>speech_to_text</task> in the system prompt, this only activates a part of the model used for audio.


OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "system",
			content: "<task>speech_to_text</task>",
		},
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe and identify the speakers in the audio file https://r2public.jigsawstack.com/interfaze/examples/stt_long_audio_sample_3.mp3" },
			],
		},
	],
	response_format: zodResponseFormat(z.any(), "empty_schema"),
});

console.log(response.choices[0].message.content);

This took 1m10s to transcribe and diarize a 1hr and 35min audio file.

JSON output

The output is truncated for this example.

Previous

Speech-to-Text (STT)

Next

Translation