Convert audio to text.

Synopsis

izwi transcribe <FILE> [OPTIONS]

Description

Transcribes audio files to text using automatic speech recognition (ASR). Supports multiple audio formats and output options.

Arguments

ArgumentDescription
<FILE>Audio file to transcribe

Options

OptionDescriptionDefault
-m, --model <MODEL>ASR model to useparakeet-tdt-0.6b-v3
-l, --language <LANG>Language hint (e.g., en, es)Auto-detect
--prompt <TEXT>Initial ASR prompt or keyword guidance
--max-tokens <N>Maximum number of ASR decoder tokens to generateModel default
-f, --format <FORMAT>Output format: text, json, verbose_jsontext
-o, --output <PATH>Output file (default: stdout)
--word-timestampsRequest word-level timestamps; automatically uses verbose_json response format

Examples

Basic transcription

izwi transcribe audio.wav

Save to file

izwi transcribe audio.wav --output transcript.txt

JSON output

izwi transcribe audio.wav --format json

Prompt-guided Granite Speech transcription

izwi transcribe meeting.wav \
  --model Granite-Speech-4.1-2B-Plus \
  --prompt "keywords: Izwi, Granite Speech" \
  --max-tokens 256 \
  --format verbose-json

Word timestamps

izwi transcribe audio.wav \
  --model Granite-Speech-4.1-2B-Plus \
  --word-timestamps

With timestamps

izwi transcribe audio.wav --format verbose_json --word-timestamps

Specify language

izwi transcribe audio.wav --language en
izwi transcribe audio.wav --language es

Use larger model

izwi transcribe audio.wav --model Qwen3-ASR-1.7B-GGUF

Use Voxtral

izwi transcribe audio.wav --model Voxtral-Mini-4B-Realtime-2602 --format verbose_json

Output Formats

Text

Plain text transcript:
Hello, this is a transcription test.

JSON

{
  "text": "Hello, this is a transcription test."
}

Verbose JSON

{
  "text": "Hello, this is a transcription test.",
  "language": "en",
  "duration": 3.5,
  "processing_time_ms": 812.4,
  "rtf": 0.23,
  "izwi_asr_diagnostics": null
}
--word-timestamps requests verbose_json output with a words array when the selected model or forced aligner can provide word timing metadata.

Supported Audio Formats

  • WAV (.wav)
  • MP3 (.mp3)
  • M4A (.m4a)
  • FLAC (.flac)
  • OGG (.ogg)
  • WebM (.webm)

Models

ModelSizeSpeedAccuracy
Parakeet-TDT-0.6B-v39.4 GBMediumStrong baseline (default)
Whisper-Large-v3-Turbo1.5 GBMediumStrong multilingual baseline
Qwen3-ASR-0.6B-GGUF1.0 GBFastGood
Qwen3-ASR-1.7B-GGUF2.5 GBMediumBetter
VibeVoice-ASR16.2 GBMediumMicrosoft long-form ASR checkpoint
Nemotron-3.5-ASR-Streaming-0.6B2.37 GBMedium40-locale NVIDIA FastConformer-RNNT; native offline transcription with prompt-conditioned language control
Granite-Speech-4.1-2B-Plus4.2 GBMediumIBM rich transcription model with prompt guidance and word timestamps
LFM2.5-Audio-1.5B-GGUF1.2 GBMediumUnified audio model with transcription support
Voxtral-Mini-4B-Realtime-26028 GBMediumRust/Candle offline transcription; realtime planned

See Also