Clone any voice from a short audio sample and use it for text-to-speech generation.

Overview

Voice cloning creates a custom voice from a reference audio sample. Use it to:
  • Personalize TTS — Generate speech in a specific voice
  • Create characters — Unique voices for games or media
  • Accessibility — Preserve a person’s voice
  • Localization — Maintain voice consistency across languages

Getting Started

Download a Voice Cloning Model

izwi pull Qwen3-TTS-12Hz-0.6B-Base

Clone a Voice

  1. Prepare a reference audio file (5-30 seconds of clear speech)
  2. Use the voice for TTS generation

Using the Web UI

Voice cloning now lives inside the unified Voices workspace.

Step 1: Upload Reference Audio

  1. Navigate to Voices in the sidebar and choose the clone flow
  2. Upload a reference audio file
  3. The audio should be:
    • 5-30 seconds long
    • Clear speech, minimal background noise
    • Single speaker

Step 2: Generate Speech

  1. Enter the text you want to speak
  2. Click Generate
  3. Listen to the output in the cloned voice

Step 3: Save and Reuse

  • Download generated audio
  • Save the voice profile for future use

Using the CLI

Use izwi tts with a Base or VibeVoice TTS model and provide both the reference audio and its transcript:
izwi tts "This sentence will use the cloned voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --reference-audio samples/reference.wav \
  --reference-text "This is the exact text spoken in the reference sample." \
  --output cloned.wav
For longer reference transcripts, store the transcript in a file:
izwi tts "This sentence uses a transcript file." \
  --model Qwen3-TTS-12Hz-1.7B-Base \
  --reference-audio samples/reference.wav \
  --reference-text-file samples/reference.txt \
  --output cloned-from-file.wav
Saved voices created in Voice Studio or through /v1/voices can be reused from the CLI:
izwi tts "This sentence uses my saved voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --saved-voice-id voice_abc123 \
  --output saved-voice.wav
--saved-voice-id is mutually exclusive with direct --reference-audio and --reference-text input.

Using the API

Endpoint

POST /v1/audio/speech

Request (JSON)

FieldTypeDescription
modelStringBase model ID (for example Qwen3-TTS-12Hz-0.6B-Base)
inputStringText to synthesize
reference_audioStringBase64-encoded reference audio
reference_textStringTranscript of reference audio
saved_voice_idStringOptional saved voice reference to reuse instead of resending audio

Example (curl)

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS-12Hz-0.6B-Base",
    "input": "Hello, this is my cloned voice",
    "reference_audio": "<base64-audio>",
    "reference_text": "Hello, this is my cloned voice sample."
  }' \
  --output cloned.wav
Saved voices can also be managed through /v1/voices and reused from /v1/audio/speech. See the API Reference for the saved voice routes and exact fields.

Reference Audio Guidelines

Ideal Reference Audio

AspectRecommendation
Duration5-30 seconds
QualityHigh quality, clear audio
ContentNatural speech, varied intonation
BackgroundMinimal noise
SpeakerSingle speaker only

Good Examples

  • Podcast clips
  • Interview segments
  • Voice memos
  • Audiobook excerpts

Poor Examples

  • Music with vocals
  • Multiple speakers
  • Heavy background noise
  • Very short clips (under 3 seconds)
  • Whispered or distorted speech

Tips for Best Results

  1. Quality over quantity — A clear 10-second clip beats a noisy 30-second one
  2. Natural speech — Avoid monotone or exaggerated delivery
  3. Match content — Reference emotion should match desired output
  4. Consistent volume — Avoid clips with volume changes
  5. No music — Background music interferes with cloning

Available Models

ModelSizeQuality
Qwen3-TTS-12Hz-0.6B-Base~2.3 GBGood
Qwen3-TTS-12Hz-1.7B-Base~4.2 GBBetter
VibeVoice-1.5B~5.0 GBBetter long-form/reference voice
Larger models produce more accurate voice clones.

Ethical Considerations

Voice cloning is a powerful technology. Please use it responsibly:
  • Get consent — Only clone voices with permission
  • Don’t impersonate — Never use cloned voices to deceive
  • Respect privacy — Don’t clone voices without authorization
  • Legal compliance — Follow applicable laws and regulations

Limitations

  • Accent accuracy — May not perfectly capture all accents
  • Emotional range — Cloned voices may have limited expressiveness
  • Unique characteristics — Some voice qualities are hard to replicate
  • Language — Best results in the model’s primary language

See Also