Interfaze

Beta

pricing

help

docs

blog

Get Started

Introduction

Examples

Vision

OCR (Image & Document)

Object Detection

GUI Detection

Web

Scraping

Audio

Speech-to-Text (STT)

Speaker Diarization

Translation

Code Sandboxing

Guardrails

Concepts

Precontext

Run Tasks

Structured Outputs

Reasoning

Streaming

Function Calling

Handling Files

Resources

Lowering costs & improving speed

Limits

Security

Supported Languages

FAQs

Projects

Interfaze as tools

Postgres LLM

Integrations

OpenAI SDK

Vercel AI SDK

Langchain SDK

n8n Integration

API Reference

Chat Completion API

STT Speaker Diarization

copy markdown

Diarize multiple speakers on long and short audio files with multilingual support.

Over 100+ languages with mixed language support. View all supported languages
Speaker diarization for up to 50 speakers
Speaker based intent, sentiment, and other audio analysis

Basic speaker diarization

OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const DiarizationSchema = z.object({
	full_text: z.string(),
	chunks: z.array(
		z.object({
			speaker_id: z.string(),
			text: z.string(),
			start_time: z.number(),
			end_time: z.number(),
		})
	),
	number_of_speakers: z.number(),
});

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe and identify the speakers in the audio file" },
				{
					type: "file",
					file: {
						filename: "stt_multispeaker.mp3",
						file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_multispeaker.mp3",
					},
				},
			],
		},
	],
	response_format: zodResponseFormat(DiarizationSchema, "diarization_schema"),
});

console.log(response.choices[0].message.content);

//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);

JSON output

Sentiment analysis by speaker

OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const SentimentSchema = z.object({
	full_text: z.string(),
	chunks: z.array(
		z.object({
			speaker_id: z.string(),
			text: z.string(),
			sentiment: z.enum(["positive", "negative", "neutral"]).describe("sentiment of the audio chunk"),
			start_time: z.number(),
			end_time: z.number(),
		})
	),
	number_of_speakers: z.number(),
});

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe the audio file, identify the speakers, and analyze the sentiment of each speaker" },
				{
					type: "file",
					file: {
						filename: "stt_call.mp3",
						file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_call.mp3",
					},
				},
			],
		},
	],
	response_format: zodResponseFormat(SentimentSchema, "sentiment_schema"),
});

console.log(response.choices[0].message.content);

//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);

JSON output

Blazing fast diarization on long audio files

To get the best performance with long audio file is to use run task with the <task>speech_to_text</task> in the system prompt, this only activates a part of the model used for audio.

OpenAI SDK

Vercel AI SDK

LangChain SDK

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const response = await interfaze.chat.completions.create({
	model: "interfaze-beta",
	messages: [
		{
			role: "system",
			content: "<task>speech_to_text</task>",
		},
		{
			role: "user",
			content: [
				{ type: "text", text: "Transcribe and identify the speakers in the audio file https://r2public.jigsawstack.com/interfaze/examples/stt_long_audio_sample_3.mp3" },
			],
		},
	],
	response_format: zodResponseFormat(z.any(), "empty_schema"),
});

console.log(response.choices[0].message.content);

This took 1m10s to transcribe and diarize a 1hr and 35min audio file.

JSON output

The output is truncated for this example.

Speech-to-Text (STT)

Translation