Get Started
Examples
Concepts
Resources
Projects
Integrations
API Reference
copy markdown
Diarize multiple speakers on long and short audio files with multilingual support.
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const DiarizationSchema = z.object({
full_text: z.string(),
chunks: z.array(
z.object({
speaker_id: z.string(),
text: z.string(),
start_time: z.number(),
end_time: z.number(),
})
),
number_of_speakers: z.number(),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Transcribe and identify the speakers in the audio file" },
{
type: "file",
file: {
filename: "stt_multispeaker.mp3",
file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_multispeaker.mp3",
},
},
],
},
],
response_format: zodResponseFormat(DiarizationSchema, "diarization_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);JSON output
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const SentimentSchema = z.object({
full_text: z.string(),
chunks: z.array(
z.object({
speaker_id: z.string(),
text: z.string(),
sentiment: z.enum(["positive", "negative", "neutral"]).describe("sentiment of the audio chunk"),
start_time: z.number(),
end_time: z.number(),
})
),
number_of_speakers: z.number(),
});
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Transcribe the audio file, identify the speakers, and analyze the sentiment of each speaker" },
{
type: "file",
file: {
filename: "stt_call.mp3",
file_data: "https://r2public.jigsawstack.com/interfaze/examples/stt_call.mp3",
},
},
],
},
],
response_format: zodResponseFormat(SentimentSchema, "sentiment_schema"),
});
console.log(response.choices[0].message.content);
//@ts-expect-error precontext is not typed
const precontext = response.precontext;
console.log("STT Results:", precontext?.[0]?.result);JSON output
To get the best performance with long audio file is to use run task with the <task>speech_to_text</task> in the system prompt, this only activates a part of the model used for audio.
OpenAI SDK
Vercel AI SDK
LangChain SDK
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const response = await interfaze.chat.completions.create({
model: "interfaze-beta",
messages: [
{
role: "system",
content: "<task>speech_to_text</task>",
},
{
role: "user",
content: [
{ type: "text", text: "Transcribe and identify the speakers in the audio file https://r2public.jigsawstack.com/interfaze/examples/stt_long_audio_sample_3.mp3" },
],
},
],
response_format: zodResponseFormat(z.any(), "empty_schema"),
});
console.log(response.choices[0].message.content);This took 1m10s to transcribe and diarize a 1hr and 35min audio file.
JSON output
The output is truncated for this example.