Overview
Izwi’s transcription feature converts spoken audio into written text. Capabilities include:- High accuracy — State-of-the-art speech recognition
- Multiple formats — Support for WAV, MP3, M4A, FLAC, and more
- Language detection — Automatic language identification
- Timestamps — Optional word-level timing through a forced aligner
- Summaries — Optional AI summaries for persisted speech-text jobs
- Local processing — Complete privacy, no cloud
Getting Started
Download an ASR Model
Transcribe Audio
Using the CLI
Basic Usage
Options
| Option | Description | Default |
|---|---|---|
--model, -m | ASR model to use | parakeet-tdt-0.6b-v3 |
--language, -l | Language hint | auto-detect |
--format, -f | Output format | text |
--output, -o | Output file | stdout |
--word-timestamps | Include word timing | — |
Examples
Basic transcription:Using the Web UI
- Navigate to Transcription in the sidebar
- Choose a mode
- Upload an audio file or record directly
- Select the required model
- Click Transcribe or submit the selected workflow
- View, copy, summarize, or download the result
| Mode | Best for | Notes |
|---|---|---|
| Transcription | Single-speaker or general ASR | Streaming is enabled by default. Word timestamps require a ready Qwen3-ForcedAligner-0.6B aligner and disable streaming. |
| Speaker Attributed ASR | Granite Speech speaker-turn transcripts | Requires Granite-Speech-4.1-2B-Plus, accepts speaker expectation hints, and disables streaming/timestamps. |
| Diarization | Speaker-separated timelines with start/end times | Uses the diarization pipeline and optional ASR/aligner models. |
Features
- Drag and drop — Upload files easily
- Record — Transcribe directly from microphone
- Mode switch — Choose Transcription, Speaker Attributed ASR, or Diarization
- Summaries — Generate or regenerate AI summaries for saved records
- Copy — One-click copy to clipboard
- Download — Save as text or JSON
Using the API
Endpoint
Request (multipart/form-data)
| Field | Type | Description |
|---|---|---|
file | File | Audio file to transcribe |
model | String | Model name |
language | String | Language code (optional) |
response_format | String | text, json, verbose_json, srt, or vtt |
stream | Boolean/String | Enable SSE transcript events (true, 1, yes, or on) |
Example (curl)
Response (JSON)
Response (verbose_json)
type values such as
transcript.text.delta, transcript.text.done, and error.
See the API Reference for JSON input,
streaming events, upload limits, and exact response shapes.
Persisted Speech-Text Jobs
The web UI uses/v1/speech-to-text/jobs for saved transcription records,
Speaker Attributed ASR records, and diarization jobs:
job_kind=speaker_attributed_asr or job_kind=saa for Granite speaker-turn
transcripts, and job_kind=diarization for speaker timelines.
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Best quality, recommended |
| MP3 | .mp3 | Widely compatible |
| M4A | .m4a | Apple format |
| FLAC | .flac | Lossless |
| OGG | .ogg | Open format |
| WebM | .webm | Web recordings |
Available Models
| Model | Size | Accuracy | Speed |
|---|---|---|---|
Parakeet-TDT-0.6B-v3 | 9.4 GB | Strong baseline | Medium |
Whisper-Large-v3-Turbo | 1.5 GB | Strong multilingual baseline | Medium |
Qwen3-ASR-0.6B-GGUF | 1.0 GB | Good | Fast |
Qwen3-ASR-1.7B-GGUF | 2.5 GB | Better | Medium |
VibeVoice-ASR | 16.2 GB | Long-form Microsoft ASR checkpoint | Medium |
Nemotron-3.5-ASR-Streaming-0.6B | 2.37 GB | 40-locale NVIDIA FastConformer-RNNT; native offline transcription with prompt-conditioned language control | Medium |
Granite-Speech-4.1-2B-Plus | 4.2 GB | IBM rich transcription model with prompt guidance, speaker-attributed output, and word timestamps | Medium |
LFM2.5-Audio-1.5B-GGUF | 1.2 GB | Good integrated speech model | Medium |
Voxtral-Mini-4B-Realtime-2602 | 8 GB | Rust/Candle offline transcription; realtime planned | Medium |
- Noisy audio
- Accented speech
- Technical vocabulary
Output Formats
Text
Plain text transcript:JSON
Verbose JSON
Includes language, duration, processing-time, realtime-factor, and optional runtime diagnostics. Word-level timestamps are not currently returned by this endpoint.Tips for Best Results
- Use quality audio — Clear recordings transcribe better
- Minimize noise — Background noise reduces accuracy
- Proper format — WAV files work best
- Right model size — Larger models for difficult audio
- Language hints — Specify language if known
See Also
- Diarization — Identify multiple speakers
- Speaker Attributed ASR — Granite speaker-turn transcripts
- Voice Mode — Real-time transcription
- CLI Reference — Full command documentation