Overview
Speaker diarization answers the question “who spoke when?” It segments audio by speaker, making it invaluable for:- Meeting transcripts — Attribute statements to participants
- Interviews — Separate interviewer and interviewee
- Podcasts — Identify hosts and guests
- Call recordings — Distinguish callers
Granite Speech speaker-turn transcripts are documented separately as Speaker Attributed ASR. Use diarization when you need timestamped speaker segments.
Getting Started
Download Diarization Pipeline Models
For best results, use a diarization + ASR + aligner pipeline:Start the Server
Using the Web UI
- Navigate to Transcription in the sidebar
- Switch to Diarization mode
- Upload an audio file with multiple speakers
- Click Analyze
- View the speaker-segmented transcript
Output
The diarization view shows:- Speaker labels — Speaker 1, Speaker 2, etc.
- Timestamps — When each speaker talks
- Transcript — What each speaker said
Using the API
Endpoint
Request (multipart/form-data)
| Field | Type | Description |
|---|---|---|
file | File | Audio file to analyze |
model | String | Diarization model (for example diar_streaming_sortformer_4spk-v2.1) |
asr_model | String | Optional ASR model override |
aligner_model | String | Optional forced aligner model override |
llm_model | String | Optional transcript refinement model |
min_speakers | Integer | Optional minimum expected speakers |
max_speakers | Integer | Optional maximum expected speakers |
min_speech_duration_ms | Number | Optional VAD speech-duration tuning |
min_silence_duration_ms | Number | Optional VAD silence-duration tuning |
enable_llm_refinement | Boolean/String | Enable optional transcript refinement |
Example (curl)
Response
The create response is a persisted diarization job. Poll the returnedid with
GET /v1/speech-to-text/jobs/{record_id}?job_kind=diarization until
processing_status is ready.
segments, words, utterances, speaker_count,
duration_secs, alignment_coverage, LLM refinement status, processing metrics,
and the formatted transcript.
See the API Reference for JSON input and exact response shapes.
Configuration
Number of Speakers
If you know how many speakers are in the audio, specify it for better accuracy:Speaker Labels
By default, speakers are labeled “Speaker 1”, “Speaker 2”, etc. You can rename them in the UI after processing.Tips for Best Results
- Quality audio — Clear recordings with minimal background noise
- Distinct voices — Works best when speakers have different voice characteristics
- Minimal overlap — Speakers talking over each other reduces accuracy
- Specify speaker count — If known, helps the algorithm
- Longer segments — Short utterances are harder to attribute
Limitations
- Similar voices — May confuse speakers with very similar voices
- Overlapping speech — Simultaneous talking is challenging
- Background noise — Reduces speaker detection accuracy
- Very short clips — Need enough audio to identify speaker patterns
Use Cases
Meeting Minutes
Upload a meeting recording to get a transcript with speaker attribution:- Record your meeting
- Upload to Diarization
- Export the speaker-labeled transcript
- Edit speaker names as needed
Interview Transcription
Perfect for journalist interviews or research:- Record the interview
- Process with diarization
- Get clean Q&A format output
Podcast Production
Identify speakers for editing and show notes:- Upload raw podcast audio
- See who spoke when
- Use timestamps for editing
See Also
- Transcription — Single-speaker transcription
- Speaker Attributed ASR — Granite speaker-turn transcripts
- Voice Mode — Real-time conversations
- CLI Reference — Command documentation