Overview
Voice cloning creates a custom voice from a reference audio sample. Use it to:- Personalize TTS — Generate speech in a specific voice
- Create characters — Unique voices for games or media
- Accessibility — Preserve a person’s voice
- Localization — Maintain voice consistency across languages
Getting Started
Download a Voice Cloning Model
Clone a Voice
- Prepare a reference audio file (5-30 seconds of clear speech)
- Use the voice for TTS generation
Using the Web UI
Voice cloning now lives inside the unified Voices workspace.Step 1: Upload Reference Audio
- Navigate to Voices in the sidebar and choose the clone flow
- Upload a reference audio file
- The audio should be:
- 5-30 seconds long
- Clear speech, minimal background noise
- Single speaker
Step 2: Generate Speech
- Enter the text you want to speak
- Click Generate
- Listen to the output in the cloned voice
Step 3: Save and Reuse
- Download generated audio
- Save the voice profile for future use
Using the CLI
Useizwi tts with a Base or VibeVoice TTS model and provide both the
reference audio and its transcript:
/v1/voices can be reused from
the CLI:
--saved-voice-id is mutually exclusive with direct --reference-audio and
--reference-text input.
Using the API
Endpoint
Request (JSON)
| Field | Type | Description |
|---|---|---|
model | String | Base model ID (for example Qwen3-TTS-12Hz-0.6B-Base) |
input | String | Text to synthesize |
reference_audio | String | Base64-encoded reference audio |
reference_text | String | Transcript of reference audio |
saved_voice_id | String | Optional saved voice reference to reuse instead of resending audio |
Example (curl)
/v1/voices and reused from
/v1/audio/speech. See the API Reference for the
saved voice routes and exact fields.
Reference Audio Guidelines
Ideal Reference Audio
| Aspect | Recommendation |
|---|---|
| Duration | 5-30 seconds |
| Quality | High quality, clear audio |
| Content | Natural speech, varied intonation |
| Background | Minimal noise |
| Speaker | Single speaker only |
Good Examples
- Podcast clips
- Interview segments
- Voice memos
- Audiobook excerpts
Poor Examples
- Music with vocals
- Multiple speakers
- Heavy background noise
- Very short clips (under 3 seconds)
- Whispered or distorted speech
Tips for Best Results
- Quality over quantity — A clear 10-second clip beats a noisy 30-second one
- Natural speech — Avoid monotone or exaggerated delivery
- Match content — Reference emotion should match desired output
- Consistent volume — Avoid clips with volume changes
- No music — Background music interferes with cloning
Available Models
| Model | Size | Quality |
|---|---|---|
Qwen3-TTS-12Hz-0.6B-Base | ~2.3 GB | Good |
Qwen3-TTS-12Hz-1.7B-Base | ~4.2 GB | Better |
VibeVoice-1.5B | ~5.0 GB | Better long-form/reference voice |
Ethical Considerations
Voice cloning is a powerful technology. Please use it responsibly:- Get consent — Only clone voices with permission
- Don’t impersonate — Never use cloned voices to deceive
- Respect privacy — Don’t clone voices without authorization
- Legal compliance — Follow applicable laws and regulations
Limitations
- Accent accuracy — May not perfectly capture all accents
- Emotional range — Cloned voices may have limited expressiveness
- Unique characteristics — Some voice qualities are hard to replicate
- Language — Best results in the model’s primary language
See Also
- Voices — Manage and reuse saved voices
- Voice Design — Create voices from descriptions
- Text-to-Speech — Standard TTS
- Models — Download models