Synopsis
Description
Transcribes audio files to text using automatic speech recognition (ASR). Supports multiple audio formats and output options.Arguments
| Argument | Description |
|---|---|
<FILE> | Audio file to transcribe |
Options
| Option | Description | Default |
|---|---|---|
-m, --model <MODEL> | ASR model to use | parakeet-tdt-0.6b-v3 |
-l, --language <LANG> | Language hint (e.g., en, es) | Auto-detect |
--prompt <TEXT> | Initial ASR prompt or keyword guidance | — |
--max-tokens <N> | Maximum number of ASR decoder tokens to generate | Model default |
-f, --format <FORMAT> | Output format: text, json, verbose_json | text |
-o, --output <PATH> | Output file (default: stdout) | — |
--word-timestamps | Request word-level timestamps; automatically uses verbose_json response format | — |
Examples
Basic transcription
Save to file
JSON output
Prompt-guided Granite Speech transcription
Word timestamps
With timestamps
Specify language
Use larger model
Use Voxtral
Output Formats
Text
Plain text transcript:JSON
Verbose JSON
--word-timestamps requests verbose_json output with a words array when the
selected model or forced aligner can provide word timing metadata.
Supported Audio Formats
- WAV (
.wav) - MP3 (
.mp3) - M4A (
.m4a) - FLAC (
.flac) - OGG (
.ogg) - WebM (
.webm)
Models
| Model | Size | Speed | Accuracy |
|---|---|---|---|
Parakeet-TDT-0.6B-v3 | 9.4 GB | Medium | Strong baseline (default) |
Whisper-Large-v3-Turbo | 1.5 GB | Medium | Strong multilingual baseline |
Qwen3-ASR-0.6B-GGUF | 1.0 GB | Fast | Good |
Qwen3-ASR-1.7B-GGUF | 2.5 GB | Medium | Better |
VibeVoice-ASR | 16.2 GB | Medium | Microsoft long-form ASR checkpoint |
Nemotron-3.5-ASR-Streaming-0.6B | 2.37 GB | Medium | 40-locale NVIDIA FastConformer-RNNT; native offline transcription with prompt-conditioned language control |
Granite-Speech-4.1-2B-Plus | 4.2 GB | Medium | IBM rich transcription model with prompt guidance and word timestamps |
LFM2.5-Audio-1.5B-GGUF | 1.2 GB | Medium | Unified audio model with transcription support |
Voxtral-Mini-4B-Realtime-2602 | 8 GB | Medium | Rust/Candle offline transcription; realtime planned |