Convert audio to text with high accuracy using automatic speech recognition (ASR).

Overview

Izwi’s transcription feature converts spoken audio into written text. Capabilities include:
  • High accuracy — State-of-the-art speech recognition
  • Multiple formats — Support for WAV, MP3, M4A, FLAC, and more
  • Language detection — Automatic language identification
  • Timestamps — Optional word-level timing through a forced aligner
  • Summaries — Optional AI summaries for persisted speech-text jobs
  • Local processing — Complete privacy, no cloud

Getting Started

Download an ASR Model

izwi pull Parakeet-TDT-0.6B-v3

Transcribe Audio

izwi transcribe audio.wav

Using the CLI

Basic Usage

izwi transcribe <audio-file>

Options

OptionDescriptionDefault
--model, -mASR model to useparakeet-tdt-0.6b-v3
--language, -lLanguage hintauto-detect
--format, -fOutput formattext
--output, -oOutput filestdout
--word-timestampsInclude word timing

Examples

Basic transcription:
izwi transcribe meeting.wav
Save to file:
izwi transcribe meeting.wav --output transcript.txt
JSON output with metadata:
izwi transcribe meeting.wav --format json --output transcript.json
With word timestamps:
izwi transcribe meeting.wav --format verbose_json --word-timestamps
Specify language:
izwi transcribe audio.wav --language en
izwi transcribe audio.wav --language es

Using the Web UI

  1. Navigate to Transcription in the sidebar
  2. Choose a mode
  3. Upload an audio file or record directly
  4. Select the required model
  5. Click Transcribe or submit the selected workflow
  6. View, copy, summarize, or download the result
The Transcription workspace has three modes:
ModeBest forNotes
TranscriptionSingle-speaker or general ASRStreaming is enabled by default. Word timestamps require a ready Qwen3-ForcedAligner-0.6B aligner and disable streaming.
Speaker Attributed ASRGranite Speech speaker-turn transcriptsRequires Granite-Speech-4.1-2B-Plus, accepts speaker expectation hints, and disables streaming/timestamps.
DiarizationSpeaker-separated timelines with start/end timesUses the diarization pipeline and optional ASR/aligner models.

Features

  • Drag and drop — Upload files easily
  • Record — Transcribe directly from microphone
  • Mode switch — Choose Transcription, Speaker Attributed ASR, or Diarization
  • Summaries — Generate or regenerate AI summaries for saved records
  • Copy — One-click copy to clipboard
  • Download — Save as text or JSON

Using the API

Endpoint

POST /v1/audio/transcriptions

Request (multipart/form-data)

FieldTypeDescription
fileFileAudio file to transcribe
modelStringModel name
languageStringLanguage code (optional)
response_formatStringtext, json, verbose_json, srt, or vtt
streamBoolean/StringEnable SSE transcript events (true, 1, yes, or on)

Example (curl)

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=Parakeet-TDT-0.6B-v3" \
  -F "response_format=json"

Response (JSON)

{
  "text": "Hello, this is a transcription test."
}

Response (verbose_json)

{
  "text": "Hello, this is a transcription test.",
  "language": "en",
  "duration": 3.5,
  "processing_time_ms": 812.4,
  "rtf": 0.23,
  "izwi_asr_diagnostics": null
}
Streaming responses emit SSE payloads with type values such as transcript.text.delta, transcript.text.done, and error. See the API Reference for JSON input, streaming events, upload limits, and exact response shapes.

Persisted Speech-Text Jobs

The web UI uses /v1/speech-to-text/jobs for saved transcription records, Speaker Attributed ASR records, and diarization jobs:
curl -X POST "http://localhost:8080/v1/speech-to-text/jobs?job_kind=transcription" \
  -F "file=@meeting.wav" \
  -F "model_id=Parakeet-TDT-0.6B-v3" \
  -F "generate_summary=true"
Use job_kind=speaker_attributed_asr or job_kind=saa for Granite speaker-turn transcripts, and job_kind=diarization for speaker timelines.

Supported Audio Formats

FormatExtensionNotes
WAV.wavBest quality, recommended
MP3.mp3Widely compatible
M4A.m4aApple format
FLAC.flacLossless
OGG.oggOpen format
WebM.webmWeb recordings

Available Models

ModelSizeAccuracySpeed
Parakeet-TDT-0.6B-v39.4 GBStrong baselineMedium
Whisper-Large-v3-Turbo1.5 GBStrong multilingual baselineMedium
Qwen3-ASR-0.6B-GGUF1.0 GBGoodFast
Qwen3-ASR-1.7B-GGUF2.5 GBBetterMedium
VibeVoice-ASR16.2 GBLong-form Microsoft ASR checkpointMedium
Nemotron-3.5-ASR-Streaming-0.6B2.37 GB40-locale NVIDIA FastConformer-RNNT; native offline transcription with prompt-conditioned language controlMedium
Granite-Speech-4.1-2B-Plus4.2 GBIBM rich transcription model with prompt guidance, speaker-attributed output, and word timestampsMedium
LFM2.5-Audio-1.5B-GGUF1.2 GBGood integrated speech modelMedium
Voxtral-Mini-4B-Realtime-26028 GBRust/Candle offline transcription; realtime plannedMedium
Use larger models for:
  • Noisy audio
  • Accented speech
  • Technical vocabulary

Output Formats

Text

Plain text transcript:
Hello, this is a transcription test.

JSON

{
  "text": "Hello, this is a transcription test."
}

Verbose JSON

Includes language, duration, processing-time, realtime-factor, and optional runtime diagnostics. Word-level timestamps are not currently returned by this endpoint.
{
  "text": "Hello, this is a transcription test.",
  "language": "en",
  "duration": 3.5,
  "processing_time_ms": 812.4,
  "rtf": 0.23,
  "izwi_asr_diagnostics": null
}

Tips for Best Results

  1. Use quality audio — Clear recordings transcribe better
  2. Minimize noise — Background noise reduces accuracy
  3. Proper format — WAV files work best
  4. Right model size — Larger models for difficult audio
  5. Language hints — Specify language if known

See Also