Generate natural, human-like speech from text using state-of-the-art TTS models.

Overview

Izwi’s text-to-speech converts written text into spoken audio. Features include:
  • Natural voices — High-quality, expressive speech
  • Local audio output — WAV for files and raw PCM for low-level API clients
  • Speed control — Adjust playback speed
  • Streaming — Real-time audio generation
  • Local processing — No cloud, complete privacy

Getting Started

Download a TTS Model

izwi pull Qwen3-TTS-12Hz-0.6B-Base

Kokoro-82M Prerequisite (espeak-ng)

If you plan to use Kokoro-82M, install espeak-ng on your system first. Izwi uses it for Kokoro phonemization and will return an error if it is missing.

Generate Speech

Command line:
izwi tts "Hello, welcome to Izwi!" --output hello.wav
Request playback:
izwi tts "Hello, welcome to Izwi!" --play
The current CLI saves the audio and reports playback as not implemented. Use your system audio player to open the generated file.

Using the CLI

Basic Usage

izwi tts "Your text here" --output output.wav

Options

OptionDescriptionDefault
--model, -mTTS model to useqwen3-tts-0.6b-base
--output, -oOutput file pathstdout
--format, -fAudio formatwav
--speed, -rSpeech speed (0.5-2.0)1.0
--speaker, -sVoice/speaker IDdefault
--saved-voice-idReuse a saved reference voice
--reference-audioReference audio file for cloning
--reference-textTranscript for the reference audio
--reference-text-fileFile containing the reference transcript
--instructionsVoice-design direction prompt
--temperature, -tSampling temperature0.7
--play, -pRequest playback after generation; current CLI reports playback as not implemented
--streamStream output in real-time
--allow-format-fallbackAccept WAV bytes when a compressed format is unavailable

Examples

WAV output:
izwi tts "Hello world" --format wav --output hello.wav
Adjust speed:
# Slower (0.5x - 1.0x)
izwi tts "Speaking slowly" --speed 0.75 --output slow.wav

# Faster (1.0x - 2.0x)
izwi tts "Speaking quickly" --speed 1.5 --output fast.wav
Read from stdin:
echo "Text from pipe" | izwi tts - --output piped.wav
cat article.txt | izwi tts - --output article.wav
Streaming output:
izwi tts "Long text for streaming" --stream --play
Clone from reference audio:
izwi tts "This line uses the reference voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --reference-audio samples/reference.wav \
  --reference-text "This is the text spoken in the reference sample." \
  --output cloned.wav
Reuse a saved voice:
izwi tts "This line uses a saved voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --saved-voice-id voice_abc123 \
  --output saved-voice.wav
Design a prompted voice:
izwi tts "This line uses a designed voice." \
  --model Qwen3-TTS-12Hz-1.7B-VoiceDesign \
  --instructions "A friendly audiobook narrator with warm pacing" \
  --output designed.wav

Using the Web UI

  1. Navigate to Text to Speech in the sidebar
  2. Enter your text in the input field
  3. Select a voice (if available)
  4. Click Generate
  5. Play or download the audio

Features

  • Live preview — Hear audio as it generates
  • Download — Save audio files locally
  • History — Access recent generations

Using the API

Endpoint

POST /v1/audio/speech

Request

{
  "model": "Qwen3-TTS-12Hz-0.6B-Base",
  "input": "Hello, this is a test.",
  "voice": "default",
  "speed": 1.0,
  "response_format": "wav"
}

Response

Binary audio data with appropriate Content-Type header. Set stream to true or stream_format to sse to receive server-sent audio events instead of one binary response. See the API Reference for streaming event shapes, voice-cloning fields, saved voices, and model-specific controls.

Example (curl)

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3-TTS-12Hz-0.6B-Base", "input": "Hello world"}' \
  --output speech.wav

Available Models

ModelSizeQualitySpeed
Kokoro-82M~0.4 GBGoodFast
Qwen3-TTS-12Hz-0.6B-Base~2.3 GBGoodFast
Qwen3-TTS-12Hz-1.7B-Base~4.2 GBBetterMedium
Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit~1.6 GBGoodFast
Qwen3-TTS-12Hz-1.7B-VoiceDesign-4bit~2.2 GBBetterMedium
VibeVoice-1.5B~5.0 GBBetter long-form/reference voiceMedium
Voxtral-4B-TTS-2603~8.1 GBBetterMedium
For reference-audio cloning, use Base variants or VibeVoice-1.5B.
For built-in voice presets, use CustomVoice variants.
For prompt-based voice design, use VoiceDesign variants. Kokoro-82M requires espeak-ng to be installed separately. Voxtral-4B-TTS-2603 supports 20 preset voices and emits 24 kHz audio. Its model and bundled voice assets inherit a CC BY-NC 4.0 license. For built-in speaker IDs, see Voice Presets.

Audio Formats

FormatExtensionNotes
WAV.wavUncompressed, highest quality
PCM.pcmRaw PCM for low-level playback pipelines
MP3, OPUS, OGG, FLAC, AACMatching extensionRecognized request names. The OSS server does not bundle compressed encoders yet, so API clients must set allow_format_fallback: true if they intentionally want WAV bytes returned for these names.

Tips

  1. Punctuation matters — Use proper punctuation for natural pauses
  2. Break long text — Split very long text into paragraphs
  3. Test different speeds — Find the right pace for your use case
  4. Use appropriate models — Larger models = better quality but slower

See Also