Diarization

Identify and separate multiple speakers in audio recordings with speaker diarization.

Overview

Speaker diarization answers the question “who spoke when?” It segments audio by speaker, making it invaluable for:

Meeting transcripts — Attribute statements to participants
Interviews — Separate interviewer and interviewee
Podcasts — Identify hosts and guests
Call recordings — Distinguish callers

Granite Speech speaker-turn transcripts are documented separately as Speaker Attributed ASR. Use diarization when you need timestamped speaker segments.

Getting Started

Download Diarization Pipeline Models

For best results, use a diarization + ASR + aligner pipeline:

izwi pull diar_streaming_sortformer_4spk-v2.1
izwi pull Parakeet-TDT-0.6B-v3
izwi pull Qwen3-ForcedAligner-0.6B

Start the Server

izwi serve

Using the Web UI

Navigate to Transcription in the sidebar
Switch to Diarization mode
Upload an audio file with multiple speakers
Click Analyze
View the speaker-segmented transcript

Output

The diarization view shows:

Speaker labels — Speaker 1, Speaker 2, etc.
Timestamps — When each speaker talks
Transcript — What each speaker said

Example output:

[00:00 - 00:05] Speaker 1: Welcome to the meeting.
[00:05 - 00:12] Speaker 2: Thanks for having me.
[00:12 - 00:20] Speaker 1: Let's start with the agenda.

Using the API

Endpoint

POST /v1/speech-to-text/jobs?job_kind=diarization

Request (multipart/form-data)

Field	Type	Description
`file`	File	Audio file to analyze
`model`	String	Diarization model (for example `diar_streaming_sortformer_4spk-v2.1`)
`asr_model`	String	Optional ASR model override
`aligner_model`	String	Optional forced aligner model override
`llm_model`	String	Optional transcript refinement model
`min_speakers`	Integer	Optional minimum expected speakers
`max_speakers`	Integer	Optional maximum expected speakers
`min_speech_duration_ms`	Number	Optional VAD speech-duration tuning
`min_silence_duration_ms`	Number	Optional VAD silence-duration tuning
`enable_llm_refinement`	Boolean/String	Enable optional transcript refinement

Example (curl)

curl -X POST "http://localhost:8080/v1/speech-to-text/jobs?job_kind=diarization" \
  -F "file=@meeting.wav" \
  -F "model=diar_streaming_sortformer_4spk-v2.1" \
  -F "asr_model=Parakeet-TDT-0.6B-v3" \
  -F "aligner_model=Qwen3-ForcedAligner-0.6B" \
  -F "min_speakers=2" \
  -F "max_speakers=2"

Response

The create response is a persisted diarization job. Poll the returned id with GET /v1/speech-to-text/jobs/{record_id}?job_kind=diarization until processing_status is ready.

{
  "id": "diarization_...",
  "kind": "diarization",
  "processing_status": "pending"
}

Ready job records include segments, words, utterances, speaker_count, duration_secs, alignment_coverage, LLM refinement status, processing metrics, and the formatted transcript. See the API Reference for JSON input and exact response shapes.

Configuration

Number of Speakers

If you know how many speakers are in the audio, specify it for better accuracy:

# Via API
curl -X POST "http://localhost:8080/v1/speech-to-text/jobs?job_kind=diarization" \
  -F "file=@meeting.wav" \
  -F "min_speakers=3" \
  -F "max_speakers=3"

Speaker Labels

By default, speakers are labeled “Speaker 1”, “Speaker 2”, etc. You can rename them in the UI after processing.

Tips for Best Results

Quality audio — Clear recordings with minimal background noise
Distinct voices — Works best when speakers have different voice characteristics
Minimal overlap — Speakers talking over each other reduces accuracy
Specify speaker count — If known, helps the algorithm
Longer segments — Short utterances are harder to attribute

Limitations

Similar voices — May confuse speakers with very similar voices
Overlapping speech — Simultaneous talking is challenging
Background noise — Reduces speaker detection accuracy
Very short clips — Need enough audio to identify speaker patterns

Use Cases

Meeting Minutes

Upload a meeting recording to get a transcript with speaker attribution:

Record your meeting
Upload to Diarization
Export the speaker-labeled transcript
Edit speaker names as needed

Interview Transcription

Perfect for journalist interviews or research:

Record the interview
Process with diarization
Get clean Q&A format output

Podcast Production

Identify speakers for editing and show notes:

Upload raw podcast audio
See who spoke when
Use timestamps for editing

​Overview

​Getting Started

​Download Diarization Pipeline Models

​Start the Server

​Using the Web UI

​Output

​Using the API

​Endpoint

​Request (multipart/form-data)

​Example (curl)

​Response

​Configuration

​Number of Speakers

​Speaker Labels

​Tips for Best Results

​Limitations

​Use Cases

​Meeting Minutes

​Interview Transcription

​Podcast Production

​See Also