Core Features
| Feature | Description | Guide |
|---|---|---|
| Voice | Real-time voice conversations with AI | Voice Guide |
| Chat | Text-based AI conversations | Chat Guide |
| Transcription | Convert audio to text | Transcription Guide |
| Speaker Attributed ASR | Generate Granite speaker-turn transcripts | SAA Guide |
| Voices | Create, clone, design, preview, manage, and reuse voice assets | Voice Studio Guide |
| Text-to-Speech | Generate natural speech from text | TTS Guide |
| Studio | Manage long-form TTS projects and exports | Studio Guide |
| Settings | Configure appearance, updates, privacy, onboarding, and desktop behavior | Settings Guide |
| Diarization | Identify multiple speakers | Diarization Guide |
| Voice Cloning | Clone voices from audio samples inside Voice Studio | Voice Cloning Guide |
| Voice Design | Create voices from descriptions inside Voice Studio | Voice Design Guide |
Feature Comparison
| Feature | Web UI | Desktop | CLI | API |
|---|---|---|---|---|
| Voice | ✓ | ✓ | — | ✓ |
| Chat | ✓ | ✓ | ✓ | ✓ |
| Transcription | ✓ | ✓ | ✓ | ✓ |
| Speaker Attributed ASR | ✓ | ✓ | — | ✓ |
| Voices | ✓ | ✓ | ✓ | ✓ |
| Text-to-Speech | ✓ | ✓ | ✓ | ✓ |
| Studio | ✓ | ✓ | — | ✓ |
| Settings | ✓ | ✓ | — | ✓ |
| Diarization | ✓ | ✓ | — | ✓ |
| Voice Cloning | ✓ | ✓ | ✓ | ✓ |
| Voice Design | ✓ | ✓ | ✓ | ✓ |
Getting Started
-
Start the server:
-
Open the web UI:
-
Download required models:
Model Requirements
Different features require different models:| Feature | Required Models |
|---|---|
| Voice | TTS + ASR + Chat model (or unified LFM2.5-Audio-1.5B-GGUF) |
| Chat | Chat model (Qwen3, Qwen3.5, LFM2.5, or Gemma) |
| Speaker Attributed ASR | Granite-Speech-4.1-2B-Plus |
| Voices | Built-in voice model for presets; Base or VibeVoice model for cloning; VoiceDesign model for design |
| Text-to-Speech | TTS model |
| Studio | TTS model |
| Settings | No model required |
| Transcription | ASR model (Parakeet-TDT-0.6B-v3 default; Qwen3/Whisper/Granite Speech/LFM2.5 also supported) |
| Diarization | diar_streaming_sortformer_4spk-v2.1 (+ optional ASR and aligner models) |
| Forced Alignment | Qwen3-ForcedAligner-0.6B (or -4bit) |
| Voice Cloning | Qwen3 TTS Base model (Qwen3-TTS-12Hz-*-Base*) |
| Voice Design | Qwen3 TTS VoiceDesign model (Qwen3-TTS-12Hz-1.7B-VoiceDesign*) |
Next Steps
Choose a feature to learn more:- Voice Mode — Real-time conversations
- Voices — Manage and create reusable voices
- Text-to-Speech — Generate speech
- Studio — Build long-form TTS projects
- Settings — Configure app preferences
- Transcription — Convert audio to text
- Speaker Attributed ASR — Granite speaker-turn transcripts
- API Reference — Integrate with HTTP, SSE, and WebSocket APIs