Overview
Izwi’s text-to-speech converts written text into spoken audio. Features include:- Natural voices — High-quality, expressive speech
- Local audio output — WAV for files and raw PCM for low-level API clients
- Speed control — Adjust playback speed
- Streaming — Real-time audio generation
- Local processing — No cloud, complete privacy
Getting Started
Download a TTS Model
Kokoro-82M Prerequisite (espeak-ng)
If you plan to use Kokoro-82M, install espeak-ng on your system first.
Izwi uses it for Kokoro phonemization and will return an error if it is missing.
- macOS: see macOS Installation
- Linux: see Linux Installation
- Windows: see Windows Installation
Generate Speech
Command line:Using the CLI
Basic Usage
Options
| Option | Description | Default |
|---|---|---|
--model, -m | TTS model to use | qwen3-tts-0.6b-base |
--output, -o | Output file path | stdout |
--format, -f | Audio format | wav |
--speed, -r | Speech speed (0.5-2.0) | 1.0 |
--speaker, -s | Voice/speaker ID | default |
--saved-voice-id | Reuse a saved reference voice | — |
--reference-audio | Reference audio file for cloning | — |
--reference-text | Transcript for the reference audio | — |
--reference-text-file | File containing the reference transcript | — |
--instructions | Voice-design direction prompt | — |
--temperature, -t | Sampling temperature | 0.7 |
--play, -p | Request playback after generation; current CLI reports playback as not implemented | — |
--stream | Stream output in real-time | — |
--allow-format-fallback | Accept WAV bytes when a compressed format is unavailable | — |
Examples
WAV output:Using the Web UI
- Navigate to Text to Speech in the sidebar
- Enter your text in the input field
- Select a voice (if available)
- Click Generate
- Play or download the audio
Features
- Live preview — Hear audio as it generates
- Download — Save audio files locally
- History — Access recent generations
Using the API
Endpoint
Request
Response
Binary audio data with appropriateContent-Type header.
Set stream to true or stream_format to sse to receive server-sent
audio events instead of one binary response. See the
API Reference for streaming event shapes,
voice-cloning fields, saved voices, and model-specific controls.
Example (curl)
Available Models
| Model | Size | Quality | Speed |
|---|---|---|---|
Kokoro-82M | ~0.4 GB | Good | Fast |
Qwen3-TTS-12Hz-0.6B-Base | ~2.3 GB | Good | Fast |
Qwen3-TTS-12Hz-1.7B-Base | ~4.2 GB | Better | Medium |
Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit | ~1.6 GB | Good | Fast |
Qwen3-TTS-12Hz-1.7B-VoiceDesign-4bit | ~2.2 GB | Better | Medium |
VibeVoice-1.5B | ~5.0 GB | Better long-form/reference voice | Medium |
Voxtral-4B-TTS-2603 | ~8.1 GB | Better | Medium |
VibeVoice-1.5B.For built-in voice presets, use CustomVoice variants.
For prompt-based voice design, use VoiceDesign variants.
Kokoro-82M requires espeak-ng to be installed separately.
Voxtral-4B-TTS-2603 supports 20 preset voices and emits 24 kHz audio. Its
model and bundled voice assets inherit a CC BY-NC 4.0 license.
For built-in speaker IDs, see Voice Presets.
Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Uncompressed, highest quality |
| PCM | .pcm | Raw PCM for low-level playback pipelines |
| MP3, OPUS, OGG, FLAC, AAC | Matching extension | Recognized request names. The OSS server does not bundle compressed encoders yet, so API clients must set allow_format_fallback: true if they intentionally want WAV bytes returned for these names. |
Tips
- Punctuation matters — Use proper punctuation for natural pauses
- Break long text — Split very long text into paragraphs
- Test different speeds — Find the right pace for your use case
- Use appropriate models — Larger models = better quality but slower
See Also
- Voice Cloning — Clone custom voices
- Voice Design — Create voices from descriptions
- Voices — Manage saved and built-in voices
- CLI Reference — Full command documentation