Forced alignment — align text to audio at word level.

Synopsis

izwi align <FILE> <TEXT> [OPTIONS]

Description

Aligns reference text to audio, producing word-level timestamps. Useful for:
  • Subtitle generation
  • Karaoke timing
  • Audio editing
  • Pronunciation analysis

Arguments

ArgumentDescription
<FILE>Audio file to align
<TEXT>Reference text to align

Options

OptionDescriptionDefault
-m, --model <MODEL>Alignment modelqwen3-forcedaligner-0.6b
-f, --format <FORMAT>Output format: text, json, verbose_jsonjson
-o, --output <PATH>Output file (default: stdout)

Examples

Basic alignment

izwi align audio.wav "Hello world, this is a test." --model Qwen3-ForcedAligner-0.6B

Save to file

izwi align audio.wav "Hello world" --output alignment.json

Text output

izwi align audio.wav "Hello world" --format text

Output Formats

JSON (default)

{
  "alignments": [
    {"word": "Hello", "start": 0.0, "end": 0.45},
    {"word": "world", "start": 0.50, "end": 0.95},
    {"word": "this", "start": 1.10, "end": 1.30},
    {"word": "is", "start": 1.35, "end": 1.45},
    {"word": "a", "start": 1.50, "end": 1.55},
    {"word": "test", "start": 1.60, "end": 2.00}
  ],
  "duration": 2.0
}

Text

Hello     0.00 - 0.45
world     0.50 - 0.95
this      1.10 - 1.30
is        1.35 - 1.45
a         1.50 - 1.55
test      1.60 - 2.00

Use Cases

Subtitle Generation

Generate precise timestamps for subtitles:
izwi align video_audio.wav "$(cat script.txt)" --output subtitles.json

Audio Editing

Find exact word boundaries for editing:
izwi align podcast.wav "um actually" --format json

Pronunciation Analysis

Analyze timing of spoken words:
izwi align recording.wav "The quick brown fox" --format verbose_json

Available Models

ModelDescription
qwen3-forcedaligner-0.6bCLI default alias
Qwen3-ForcedAligner-0.6BCanonical forced aligner model ID
Qwen3-ForcedAligner-0.6B-4bitLower-memory variant

See Also