/v1. The exceptions are /docs, /openapi.json, /livez, and /readyz.
When izwi serve is running:
http://localhost:8080/docsopens the local Scalar reference.http://localhost:8080/openapi.jsonreturns the generated OpenAPI document. It includes Scalar sidebar entries for preview first-party, operator, and realtime route families with lightweight summaries.- This page is the detailed contract guide for the broader first-party, preview, operator, and realtime route surface.
Surface Maturity
| Surface | Status | Notes |
|---|---|---|
/v1/models, /v1/models/{model} | Stable | OpenAI-compatible model catalog. |
/v1/chat/completions | Stable | OpenAI-compatible chat completions. |
/v1/audio/speech | Stable | OpenAI-compatible text-to-speech endpoint with Izwi voice extensions. |
/v1/audio/transcriptions | Stable | OpenAI-compatible transcription endpoint with local streaming support. |
/livez, /readyz, /v1/live, /v1/ready, /v1/health | Stable | Operational health and readiness endpoints. |
/v1/responses* | Preview | OpenAI-compatible Responses API shape with process-local response retention. |
| First-party workflow routes | Preview | Persisted local product APIs used by the web UI and desktop app. |
/v1/admin/* | Preview | Local model-management APIs. Bind carefully on shared hosts. |
| WebSocket realtime routes | Preview | Browser-facing low-latency protocols that may evolve. |
/internal/* aliases | Internal | Compatibility aliases for tooling. Prefer /v1/* or root probes where available. |
Common Conventions
Request IDs
Clients may sendx-request-id. If absent, the server generates one. Responses include the same header and structured logs use it as the correlation ID.
Errors
JSON errors use this envelope:| Status | Meaning |
|---|---|
400 | Invalid request shape, model id, unsupported field, or invalid option. |
404 | Resource, model, media object, or process-local record was not found. |
413 | Body exceeded an upload limit before the handler could parse it. |
415 | Endpoint expected JSON or multipart content but received another content type. |
500 | Runtime, model, storage, or server failure. |
503 | Readiness endpoint reports the server is alive but not ready. |
401, 403, or 500 before the route handler runs.
Security And CORS
Community builds do not require API keys by default. Treat the server as a local trusted process unless you deliberately expose it.izwi servedefaults to port8080.--host 0.0.0.0binds beyond loopback. Use it only on trusted networks or behind your own access controls.--corsenables wildcard browser CORS responses.- Desktop origins such as
tauri://localhostare allowed for the native app.
Pagination
Preview list APIs use cursor pagination where the response includes apagination object.
Query parameters:
| Parameter | Description |
|---|---|
limit | Page size. Values are clamped by each store, usually up to 500. |
cursor | Opaque cursor returned from the previous page. |
Limits And Runtime Controls
| Control | Default | Notes |
|---|---|---|
--max-concurrent | 100 | Maximum concurrent runtime requests. |
--timeout | 300 seconds | Request timeout for regular HTTP routes. Long ASR streaming avoids a hard wall-clock cutoff while active. |
IZWI_OPENAI_AUDIO_UPLOAD_LIMIT_MB | 25 strict, 64 relaxed | Upload limit for OpenAI audio routes. |
| First-party audio upload limit | 64 MiB | Applies to persisted transcription, diarization, TTS, voice design, voice clone, and saved voice creation routes. |
IZWI_AUDIO_STREAM_EVENT_QUEUE_CAPACITY | 32 | Buffered SSE audio events for /v1/audio/speech. |
IZWI_MAX_RESPONSE_STORE_ENTRIES | 512 | Process-local /v1/responses retention cap. |
IZWI_MAX_AGENT_SESSION_STORE_ENTRIES | 512 | Process-local agent session metadata cap. |
Streaming
HTTP streaming routes use server-sent events (SSE):- Response content type is
text/event-stream. - Each payload is sent as a
data:frame. - OpenAI-compatible chat and Responses streams end with
data: [DONE]. - Some preview first-party streams emit JSON objects with an
eventfield and close after the terminal event. - Client disconnects cancel delivery; some model work may finish internally before cleanup.
OpenAI-Compatible APIs
Models
GET /v1/models
Returns enabled model variants in OpenAI list format.
GET /v1/models/{model}
Returns one enabled model in the same object shape. Unknown or disabled variants return 404.
Use /v1/admin/models when you need download, load, unload, local path, status, or speech-capability details.
Chat Completions
POST /v1/chat/completions
Basic request:
| Field | Notes |
|---|---|
model | Required model variant. Must resolve to a chat-capable model. |
messages | Required array. Roles: system, user, assistant, tool. |
max_tokens, max_completion_tokens | Optional output budgets. |
stream | true returns SSE chat chunks. |
stream_options.include_usage | Adds usage to the terminal stream chunk. |
temperature, top_p, presence_penalty | Passed to runtime where supported. |
frequency_penalty, stop | Rejected in strict OpenAI compatibility mode when non-default. |
n | Only 1 is supported. |
tools, tool_choice | Accepted for tool-call prompting. Strict mode only allows tool_choice as auto, none, or null. |
enable_thinking | Izwi extension for thinking-capable local models. |
user | Accepted for compatibility; not used for local auth. |
- Assistant messages may include
tool_calls. - Tool responses can be sent with role
tool. - Model-emitted tool calls are returned with
finish_reason: "tool_calls"when detected.
Audio Speech
POST /v1/audio/speech
Generates audio bytes. JSON request:
| Field | Notes |
|---|---|
model | Required TTS model variant. |
input | Required text to synthesize. |
voice | Built-in voice/speaker name where the model supports presets. |
response_format | Native OSS formats: wav (default), pcm, pcm16, pcm_i16, raw_i16, raw_f32, pcm_f32. Recognized compressed names mp3, opus, ogg, aac, and flac require explicit fallback opt-in because compressed encoders are not bundled. |
allow_format_fallback | Optional boolean. When true, recognized compressed response_format values return WAV bytes with X-Actual-Response-Format: wav, X-Response-Format-Fallback, and Warning headers. When omitted or false, compressed formats return 400. |
speed | Optional model-dependent speed control. |
language | Optional language hint such as English, Chinese, or Auto. |
temperature, top_k | Optional sampling controls. |
max_tokens, max_output_tokens | Optional output token budget aliases. |
instructions | Voice-design prompt for voice-design models. |
reference_audio | Base64 audio for voice cloning. |
reference_text | Transcript of the reference audio. |
saved_voice_id | Server-side saved voice reference to reuse. |
stream, stream_format | stream: true or stream_format: "sse" enables SSE audio chunks. |
- Body is binary audio.
- Content type follows the actual generated format.
X-Requested-Response-FormatandX-Actual-Response-Formatare exposed. Explicit fallbacks also includeX-Response-Format-Fallbackand an HTTPWarningheader.
| Event | Fields |
|---|---|
audio.started | request_id, sample_rate, audio_format, optional explicit fallback note in error. |
audio.chunk | request_id, sequence, audio_base64, sample_count, is_final. |
audio.done | request_id, tokens_generated, generation_time_ms, audio_duration_secs, rtf. |
audio.failed | request_id, error. |
Audio Transcriptions
POST /v1/audio/transcriptions
Accepts JSON or multipart input.
JSON request:
| Field | Notes |
|---|---|
file or audio | Uploaded audio file. |
audio_base64 | Base64 audio alternative. |
model | Optional ASR, Granite Speech, Voxtral offline transcription, or audio-chat model variant. Voxtral realtime is planned separately. |
language | Optional language hint. |
response_format | json, verbose_json, text, srt, or vtt. Default json. |
stream | true, 1, yes, or on enables SSE. |
timestamp_granularities[] | Optional word, segment, or both. Requires response_format=verbose_json; model-provided timestamps are used before forced alignment fallback. |
aligner_model | Optional forced-aligner model for timestamp generation. Defaults to Qwen3-ForcedAligner-0.6B. |
prompt | Optional ASR prompt/context. Granite Speech uses this for prompt guidance and keyword biasing. |
max_tokens | Optional ASR decoder token budget. |
temperature | Accepted for compatibility; currently ignored by native ASR models. |
json response:
verbose_json response:
| Type | Fields |
|---|---|
transcript.text.delta | delta |
transcript.text.done | text, language, audio_duration_secs |
error | error.message |
Audio Alignment
POST /v1/audio/align
Forced alignment accepts JSON or multipart input and aligns reference text to audio at word level.
JSON request:
| Field | Notes |
|---|---|
file or audio | Uploaded audio file. |
audio_base64 | Base64 audio alternative. |
text or reference_text | Required reference text to align. |
model | Optional forced-aligner model variant. Defaults to Qwen3-ForcedAligner-0.6B. |
language | Optional language hint. |
response_format | json, verbose_json, or text. Default json. |
json response:
verbose_json adds model, language, word_count, and processing_time_ms.
Example multipart request:
Responses
POST /v1/responses
Preview OpenAI-compatible Responses API shape.
| Field | Notes |
|---|---|
model | Required chat-capable model variant. |
input | Text, one input item, or an array of input items. |
instructions | Optional system instruction. Required if input is empty. |
max_output_tokens | Optional output limit. |
stream | true returns SSE events. |
metadata, user | Stored or accepted for compatibility. |
temperature, top_p | Optional runtime controls. |
store | false skips process-local retention. Default retains completed records. |
tools, tool_choice, enable_thinking | Same behavior as chat completions. |
- They are lost on server restart.
- They can be evicted after
IZWI_MAX_RESPONSE_STORE_ENTRIES. - Streaming records are stored only after a terminal completion or failure.
canceldoes not provide durable active-response cancellation semantics.
| Method | Path | Notes |
|---|---|---|
GET | /v1/responses/{response_id} | Fetch retained process-local response. |
DELETE | /v1/responses/{response_id} | Delete retained process-local response. |
POST | /v1/responses/{response_id}/cancel | Mark retained process-local response canceled. |
GET | /v1/responses/{response_id}/input_items | Return normalized retained input items. |
First-Party Workflow APIs
These routes are preview APIs used by the web UI and desktop app. They are local, SQLite-backed stores unless otherwise noted.Route Rename Migration
The following preview route names were replaced by canonical names. The old runtime routes were removed.| Removed route family | Current route family |
|---|---|
/v1/text-to-speech-generations | /v1/text-to-speech |
/v1/voice-design-generations | /v1/voice-designs |
/v1/voice-clone-generations | /v1/voice-clones |
/v1/transcriptions | /v1/speech-to-text/jobs?job_kind=transcription |
/v1/transcriptions/{record_id} | /v1/speech-to-text/jobs/{record_id}?job_kind=transcription |
/v1/transcriptions/{record_id}/audio | /v1/speech-to-text/jobs/{record_id}/audio?job_kind=transcription |
/v1/transcriptions/{record_id}/summary/regenerate | /v1/speech-to-text/jobs/{record_id}/summary/regenerate?job_kind=transcription |
/v1/transcriptions/jobs | /v1/speech-to-text/jobs |
/v1/transcription/realtime/ws | /v1/speech-to-text/realtime/ws |
/v1/audio/diarize | /v1/speech-to-text/jobs?job_kind=diarization |
/v1/audio/diarizations | /v1/speech-to-text/jobs?job_kind=diarization |
job_kind=transcription on the
persisted speech-to-text job flow. The removed direct audio diarization routes
use the same job flow with job_kind=diarization: create a job, poll the
returned record until processing_status is ready, and then read the
diarization fields from that job record.
Direct /v1/diarizations* routes remain supported first-party APIs. Use
/v1/speech-to-text/jobs?job_kind=diarization when an app wants a unified
speech-text list across transcription and diarization. Use /v1/diarizations*
when an app wants diarization-specific resource names and does not need to mix
transcription records into the same collection.
Speech-Text Jobs
Canonical saved transcription, speaker-attributed ASR, and diarization job routes:| Method | Path | Notes | ||||
|---|---|---|---|---|---|---|
GET | /v1/speech-to-text/jobs | List jobs. Supports limit, cursor, and `job_kind=transcription | speaker_attributed_asr | saa | diarization | all`. |
POST | /v1/speech-to-text/jobs | Create transcription, speaker-attributed ASR, or diarization job. Multipart uploads allowed. | ||||
GET | /v1/speech-to-text/jobs/{record_id} | Fetch one job. job_kind can disambiguate. | ||||
PATCH, PUT | /v1/speech-to-text/jobs/{record_id} | Update editable metadata such as title, transcript fields, speaker labels, or summary state depending on job kind. | ||||
DELETE | /v1/speech-to-text/jobs/{record_id} | Delete job and associated stored media. | ||||
GET | /v1/speech-to-text/jobs/{record_id}/audio | Fetch stored source audio. | ||||
POST | /v1/speech-to-text/jobs/{record_id}/reruns | Re-run diarization from stored source audio. | ||||
POST | /v1/speech-to-text/jobs/{record_id}/cancel | Cancel an in-flight diarization job. | ||||
POST | /v1/speech-to-text/jobs/{record_id}/summary/regenerate | Regenerate transcription, speaker-attributed ASR, or diarization summary. |
job_kind query parameter is important for shared IDs and for clients that
want a specific record family. speaker_attributed_asr and saa select the
Granite Speech speaker-turn transcript mode.
For transcription job creation, JSON and multipart requests accept
generate_summary. It defaults to false; set it to true to generate an AI
summary automatically after the transcript finishes. Records created without an
automatic summary can still use
POST /v1/speech-to-text/jobs/{record_id}/summary/regenerate?job_kind=transcription
later.
For speaker-attributed ASR, use:
Granite-Speech-4.1-2B-Plus. JSON and multipart requests accept
model_id/model, language, generate_summary, min_speakers, and
max_speakers. SAA does not support streaming, timestamp alignment, or
include_timestamps; the server clears aligner/timestamp fields for this mode.
Diarization Records
Persisted diarization routes:| Method | Path | Notes |
|---|---|---|
GET, POST | /v1/diarizations | List or create saved diarization records. |
GET, PATCH, PUT, DELETE | /v1/diarizations/{record_id} | Fetch, update, or delete a saved diarization record. |
GET | /v1/diarizations/{record_id}/audio | Fetch source audio. |
POST | /v1/diarizations/{record_id}/reruns | Re-run diarization. |
POST | /v1/diarizations/{record_id}/cancel | Cancel in-flight diarization. |
POST | /v1/diarizations/{record_id}/summary/regenerate | Regenerate the LLM summary. |
Speech History
All three speech history families share list/create, member, audio, pagination, and deletion behavior. Create routes can generate audio and persist the resulting record.| Route family | Purpose |
|---|---|
/v1/text-to-speech | Plain TTS history. |
/v1/voice-designs | Voice-design prompt records. |
/v1/voice-clones | Reference-audio voice clone records. |
| Method | Path pattern |
|---|---|
GET, POST | /v1/{family} |
GET, DELETE | /v1/{family}/{record_id} |
GET | /v1/{family}/{record_id}/audio |
event field:
| Event | Notes |
|---|---|
created | Includes the persisted record shell. |
start | Includes request_id, sample_rate, and audio_format. |
chunk | Includes request_id, sequence, audio_base64, and sample_count. |
final | Includes generation stats and the completed record. |
error | Includes an error string. |
done | Terminal stream marker. |
Saved Voices
Reusable voice clone references:| Method | Path | Notes |
|---|---|---|
GET | /v1/voices | List saved voices with cursor pagination. |
POST | /v1/voices | Create a saved voice from reference audio/text or a generated voice source. |
GET | /v1/voices/{voice_id} | Fetch saved voice metadata. |
DELETE | /v1/voices/{voice_id} | Delete saved voice and audio. |
GET | /v1/voices/{voice_id}/audio | Fetch saved reference audio. |
saved_voice_id on /v1/audio/speech or first-party generation routes to reuse a saved voice without resending reference audio.
Studio
Studio is the long-form TTS project API. Project and folder routes:| Method | Path |
|---|---|
GET, POST | /v1/studio/folders |
GET, POST | /v1/studio/projects |
GET, PATCH, DELETE | /v1/studio/projects/{project_id} |
GET | /v1/studio/projects/{project_id}/audio |
GET, PATCH | /v1/studio/projects/{project_id}/meta |
| Parameter | Notes | ||
|---|---|---|---|
download=true | Prefer attachment-style download headers. | ||
| `format=wav | raw_i16 | raw_f32` | Requested export format. |
segment_ids=a,b,c | Export selected segments in order. |
| Method | Path |
|---|---|
GET, POST | /v1/studio/projects/{project_id}/pronunciations |
DELETE | /v1/studio/projects/{project_id}/pronunciations/{pronunciation_id} |
GET, POST | /v1/studio/projects/{project_id}/snapshots |
POST | /v1/studio/projects/{project_id}/snapshots/{snapshot_id}/restore |
| Method | Path |
|---|---|
GET, POST | /v1/studio/projects/{project_id}/render-jobs |
PATCH | /v1/studio/projects/{project_id}/render-jobs/{job_id} |
| Method | Path |
|---|---|
POST | /v1/studio/projects/{project_id}/segments |
GET, PATCH, DELETE | /v1/studio/projects/{project_id}/segments/{segment_id} |
POST | /v1/studio/projects/{project_id}/segments/{segment_id}/split |
POST | /v1/studio/projects/{project_id}/segments/{segment_id}/merge-next |
PATCH | /v1/studio/projects/{project_id}/segments/reorder |
POST | /v1/studio/projects/{project_id}/segments/bulk-delete |
POST | /v1/studio/projects/{project_id}/segments/{segment_id}/render |
Chat Threads
Durable local chat history:| Method | Path | Notes |
|---|---|---|
GET, POST | /v1/chat/threads | List or create threads. |
GET, PATCH, DELETE | /v1/chat/threads/{thread_id} | Fetch, rename, or delete a thread. |
GET, POST | /v1/chat/threads/{thread_id}/messages | List messages or send a new user message. |
| Field | Notes |
|---|---|
model | Optional chat model. |
content | User text. |
content_parts | Multimodal content parts in OpenAI-like shape. |
max_tokens | Optional output limit. |
stream | true emits SSE events. |
system_prompt | Optional per-request system prompt. |
enable_thinking | Izwi extension for thinking-capable models. |
| Event | Notes |
|---|---|
start | Includes thread_id, model_id, and persisted user message. |
delta | Text delta. |
done | Includes persisted assistant message and generation stats. |
error | Error string. |
Agent Sessions
Agent session metadata is process-local preview state. The linked chat thread is durable.| Method | Path | Notes |
|---|---|---|
POST | /v1/agent/sessions | Create an agent session and linked thread. |
GET | /v1/agent/sessions/{session_id} | Fetch retained process-local session metadata. |
POST | /v1/agent/sessions/{session_id}/turns | Run one agent turn. |
agent_id, model_id, system_prompt, planning_mode (off, auto, on), and title.
Turn responses include assistant text, optional plan steps, tool calls, and ordered events such as turn_started, plan_created, tool_call_started, tool_call_completed, assistant_message, and turn_completed.
Voice Profile, Memory, And Sessions
Voice-mode persisted state:| Method | Path | Notes |
|---|---|---|
GET, PATCH | /v1/voice/profile | Fetch or update name, system prompt, and observational memory setting. |
GET, DELETE | /v1/voice/observations | List or clear remembered observations. limit controls list size. |
DELETE | /v1/voice/observations/{observation_id} | Forget one observation. |
GET | /v1/voice/sessions | List voice sessions. |
POST | /v1/voice/sessions | Create a persisted session shell. Defaults to the default profile, modular mode, and the profile system prompt. |
GET | /v1/voice/sessions/{session_id} | Fetch a session with turns. |
PATCH | /v1/voice/sessions/{session_id} | Update system_prompt and/or set ended: true. |
DELETE | /v1/voice/sessions/{session_id} | Delete a session and its stored turns. |
GET | /v1/voice/sessions/{session_id}/turns | List only the stored turns for a session. |
POST | /v1/voice/sessions/{session_id}/end | Mark a session ended. |
GET | /v1/voice/sessions/{session_id}/export?format=json|text | Export session metadata, turn metadata, and a transcript view. |
Media
Media lifecycle routes:| Method | Path | Notes |
|---|---|---|
GET | /v1/media?limit=100 | List media objects when the server is using local media storage; provider-backed storage can return 501 if listing is unavailable. |
POST | /v1/media | Upload a base64 media object. |
GET | /v1/media/{path} | Download a persisted media object. |
DELETE | /v1/media/{path} | Delete a persisted media object. |
{path} can contain nested segments such as
images/example.png or chat/thread-1/attachment.wav.
Upload request:
audio_base64 is accepted as an alias for data_base64, and data URLs such as
data:audio/wav;base64,... are accepted. Upload responses include path,
url, content_type, filename, and size_bytes.
Rules:
- The path is relative to the media root.
- Nested paths are allowed.
- Absolute paths and
..traversal are rejected. - Unknown media returns
404. - Treat media URLs as local API resources, not stable public object-store URLs.
Onboarding And Preferences
Small first-party UI state APIs:| Method | Path | Response |
|---|---|---|
GET | /v1/onboarding | { completed, completed_at, analytics_opt_in } |
POST | /v1/onboarding/complete | Marks onboarding complete and returns state. |
GET | /v1/preferences | { analytics_opt_in } |
PUT | /v1/preferences/analytics | Body { "opt_in": true }; returns preferences. |
Operator APIs
Health And Readiness
| Method | Path | Notes |
|---|---|---|
GET | /livez | Cheap liveness probe. |
GET | /readyz | Deployment readiness probe. Returns 503 when alive but not ready. |
GET | /v1/live | /v1 alias for liveness. |
GET | /v1/ready | /v1 alias for readiness. |
GET | /v1/health | Rich runtime/backend status used by izwi status. |
GET | /internal/live, /internal/ready, /internal/health | Internal compatibility aliases. |
/v1/health includes requested and selected backend, compiled backend support, detected device capabilities, dtype policy, CUDA runtime diagnostics, and fused-attention status.
Metrics
| Method | Path | Notes |
|---|---|---|
GET | /v1/metrics | JSON runtime telemetry snapshot. |
GET | /v1/metrics/prometheus | Prometheus text format. |
GET | /internal/metrics | Internal alias. |
GET | /internal/metrics/prometheus | Internal alias. |
Admin Model Management
Preview local admin routes. Use these routes as the OSS model lifecycle and discovery surface for voice apps: each model record includes local status, broad modalities, speech-generation capabilities when present, and route-level capability booleans.| Method | Path | Notes |
|---|---|---|
GET | /v1/admin/models | List known enabled variants, local status, modalities, and route capabilities. |
GET | /v1/admin/models/{variant} | Fetch one model status and capability contract. |
POST | /v1/admin/models/{variant}/download | Start background download. |
GET | /v1/admin/models/{variant}/download/progress | SSE download progress. |
POST | /v1/admin/models/{variant}/download/cancel | Cancel active download. |
POST | /v1/admin/models/{variant}/load | Load model into runtime memory. |
POST | /v1/admin/models/{variant}/unload | Unload model from runtime memory. |
DELETE | /v1/admin/models/{variant} | Unload and delete local model files. |
status can be downloading, completed, error, or cancelled.
Realtime WebSocket APIs
Realtime routes are preview browser protocols. They use JSON text messages for control events and binary PCM16 frames for audio.Transcription Realtime
GET /v1/speech-to-text/realtime/ws
Server starts with:
| Type | Fields |
|---|---|
session_start | Optional model_id, language. |
session_stop | No fields. |
ping | Optional timestamp_ms. |
| Bytes | Value |
|---|---|
0..4 | ASCII magic ITRW. |
4 | Version 1. |
5 | Kind 1 for client PCM16. |
6..8 | Reserved. |
8..12 | Little-endian sample_rate (u32). |
12..16 | Little-endian frame_seq (u32). |
16.. | Mono PCM16 little-endian audio bytes. |
| Type | Notes |
|---|---|
session_started | Session accepted. |
transcript_partial | Includes sequence, text, optional language, and audio duration. |
session_done | Session stopped. |
pong | Ping response. |
error | Error message. |
- Binary frames larger than 512 KiB are rejected.
- Sample rate must remain stable during a session.
- Sample rates outside the accepted runtime range return an error.
Voice Realtime
GET /v1/voice/realtime/ws
Server starts with:
| Type | Fields |
|---|---|
session_start | Optional system_prompt. |
input_stream_start | Optional mode, model ids, speaker, ASR language, max tokens, VAD settings, and input sample rate. |
input_stream_stop | Stops listening and closes the current session. |
interrupt | Optional reason; interrupts active assistant turn. |
ping | Optional timestamp_ms. |
mode values:
| Mode | Notes |
|---|---|
modular | ASR -> chat/agent -> TTS. Requires ASR, text chat, and TTS models. |
unified | Uses a supported audio-chat model for speech-to-speech style turns. |
input_stream_start fields:
| Field | Notes |
|---|---|
asr_model_id, text_model_id, tts_model_id | Modular model overrides. |
s2s_model_id | Unified audio-chat model override. |
speaker | TTS speaker/voice. |
asr_language | Language hint. |
max_output_tokens | Text output budget. |
vad_threshold | Earshot speech score threshold, default 0.5. Values are clamped to the valid score range. |
min_speech_ms | Minimum speech duration. |
silence_duration_ms | Silence before utterance end. |
max_utterance_ms | Hard utterance duration cap. |
pre_roll_ms | Audio retained before speech start. |
input_sample_rate | Expected input sample rate. |
| Bytes | Value |
|---|---|
0..4 | ASCII magic IVWS. |
4 | Version 1. |
5 | Kind 1 for client PCM16. |
6..8 | Reserved. |
8..12 | Little-endian sample_rate (u32). |
12..16 | Little-endian frame_seq (u32). |
16.. | Mono PCM16 little-endian audio bytes. |
| Bytes | Value |
|---|---|
0..4 | ASCII magic IVWS. |
4 | Version 1. |
5 | Kind 2 for assistant PCM16. |
6..8 | Flags; bit 0 marks final chunk. |
8..16 | Little-endian utterance_seq (u64). |
16..20 | Little-endian chunk sequence (u32). |
20..24 | Little-endian sample rate (u32). |
24.. | Mono PCM16 little-endian audio bytes. |
| Type | Notes |
|---|---|
connected | Socket accepted and protocol version announced. |
session_ready | Voice session initialized. |
input_stream_ready | Includes resolved VAD settings, including backend, score sample rate, and score frame duration. |
input_stream_stopped | Input stream stopped. |
user_speech_start, user_speech_end | VAD utterance boundaries. |
user_speech_rejected | A too-short speech start was rejected as noise and the input stream remains ready. |
turn_processing | Turn started. |
user_transcript_start, user_transcript_delta, user_transcript_final | User transcript events. |
assistant_text_start, assistant_text_delta, assistant_text_final | Assistant text events. |
assistant_audio_start, assistant_audio_done | Assistant audio envelope around binary chunks. |
turn_done | Terminal turn status: ok, error, timeout, interrupted, or no_input. |
pong | Ping response. |
error | Error with optional utterance identifiers. |