Voice Cloning

Clone any voice from a short audio sample and use it for text-to-speech generation.

Overview

Voice cloning creates a custom voice from a reference audio sample. Use it to:

Personalize TTS — Generate speech in a specific voice
Create characters — Unique voices for games or media
Accessibility — Preserve a person’s voice
Localization — Maintain voice consistency across languages

Getting Started

Download a Voice Cloning Model

izwi pull Qwen3-TTS-12Hz-0.6B-Base

Clone a Voice

Prepare a reference audio file (5-30 seconds of clear speech)
Use the voice for TTS generation

Using the Web UI

Voice cloning now lives inside the unified Voices workspace.

Step 1: Upload Reference Audio

Navigate to Voices in the sidebar and choose the clone flow
Upload a reference audio file
The audio should be:
- 5-30 seconds long
- Clear speech, minimal background noise
- Single speaker

Step 2: Generate Speech

Enter the text you want to speak
Click Generate
Listen to the output in the cloned voice

Step 3: Save and Reuse

Download generated audio
Save the voice profile for future use

Using the CLI

Use izwi tts with a Base or VibeVoice TTS model and provide both the reference audio and its transcript:

izwi tts "This sentence will use the cloned voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --reference-audio samples/reference.wav \
  --reference-text "This is the exact text spoken in the reference sample." \
  --output cloned.wav

For longer reference transcripts, store the transcript in a file:

izwi tts "This sentence uses a transcript file." \
  --model Qwen3-TTS-12Hz-1.7B-Base \
  --reference-audio samples/reference.wav \
  --reference-text-file samples/reference.txt \
  --output cloned-from-file.wav

Saved voices created in Voice Studio or through /v1/voices can be reused from the CLI:

izwi tts "This sentence uses my saved voice." \
  --model Qwen3-TTS-12Hz-0.6B-Base \
  --saved-voice-id voice_abc123 \
  --output saved-voice.wav

--saved-voice-id is mutually exclusive with direct --reference-audio and --reference-text input.

Using the API

Endpoint

POST /v1/audio/speech

Request (JSON)

Field	Type	Description
`model`	String	Base model ID (for example `Qwen3-TTS-12Hz-0.6B-Base`)
`input`	String	Text to synthesize
`reference_audio`	String	Base64-encoded reference audio
`reference_text`	String	Transcript of reference audio
`saved_voice_id`	String	Optional saved voice reference to reuse instead of resending audio

Example (curl)

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-TTS-12Hz-0.6B-Base",
    "input": "Hello, this is my cloned voice",
    "reference_audio": "<base64-audio>",
    "reference_text": "Hello, this is my cloned voice sample."
  }' \
  --output cloned.wav

Saved voices can also be managed through /v1/voices and reused from /v1/audio/speech. See the API Reference for the saved voice routes and exact fields.

Reference Audio Guidelines

Ideal Reference Audio

Aspect	Recommendation
Duration	5-30 seconds
Quality	High quality, clear audio
Content	Natural speech, varied intonation
Background	Minimal noise
Speaker	Single speaker only

Good Examples

Podcast clips
Interview segments
Voice memos
Audiobook excerpts

Poor Examples

Music with vocals
Multiple speakers
Heavy background noise
Very short clips (under 3 seconds)
Whispered or distorted speech

Tips for Best Results

Quality over quantity — A clear 10-second clip beats a noisy 30-second one
Natural speech — Avoid monotone or exaggerated delivery
Match content — Reference emotion should match desired output
Consistent volume — Avoid clips with volume changes
No music — Background music interferes with cloning

Available Models

Model	Size	Quality
`Qwen3-TTS-12Hz-0.6B-Base`	~2.3 GB	Good
`Qwen3-TTS-12Hz-1.7B-Base`	~4.2 GB	Better
`VibeVoice-1.5B`	~5.0 GB	Better long-form/reference voice

Larger models produce more accurate voice clones.

Ethical Considerations

Voice cloning is a powerful technology. Please use it responsibly:

Get consent — Only clone voices with permission
Don’t impersonate — Never use cloned voices to deceive
Respect privacy — Don’t clone voices without authorization
Legal compliance — Follow applicable laws and regulations

Limitations

Accent accuracy — May not perfectly capture all accents
Emotional range — Cloned voices may have limited expressiveness
Unique characteristics — Some voice qualities are hard to replicate
Language — Best results in the model’s primary language

​Overview

​Getting Started

​Download a Voice Cloning Model

​Clone a Voice

​Using the Web UI

​Step 1: Upload Reference Audio

​Step 2: Generate Speech

​Step 3: Save and Reuse

​Using the CLI

​Using the API

​Endpoint

​Request (JSON)

​Example (curl)

​Reference Audio Guidelines

​Ideal Reference Audio

​Good Examples

​Poor Examples

​Tips for Best Results

​Available Models

​Ethical Considerations

​Limitations

​See Also