Synopsis
Subcommands
| Command | Description |
|---|---|
chat | Benchmark chat inference |
tts | Benchmark TTS inference |
asr | Benchmark ASR inference |
throughput | Benchmark system throughput |
run | Run a benchmark manifest |
compare | Compare JSON reports and fail on regressions |
izwi bench chat
Benchmark chat performance, including time-to-first-token.Options
| Option | Description | Default |
|---|---|---|
-m, --model <MODEL> | Model to benchmark | Qwen3.5-4B |
-i, --iterations <N> | Number of requests | 10 |
-p, --prompt <TEXT> | User prompt to send | Default benchmark prompt |
--system <TEXT> | Optional system prompt | — |
--max-tokens <N> | Maximum completion tokens | 128 |
-c, --concurrent <N> | Concurrent requests | 1 |
--warmup | Enable warmup iteration | — |
Example
izwi bench tts
Benchmark text-to-speech performance.Options
| Option | Description | Default |
|---|---|---|
-m, --model <MODEL> | Model to benchmark | qwen3-tts-0.6b-base |
-i, --iterations <N> | Number of iterations | 10 |
-t, --text <TEXT> | Text to synthesize | Default test text |
-c, --concurrent <N> | Concurrent requests | 1 |
--warmup | Enable warmup iteration | — |
Example
izwi bench asr
Benchmark speech recognition performance.Options
| Option | Description | Default |
|---|---|---|
-m, --model <MODEL> | Model to benchmark | parakeet-tdt-0.6b-v3 |
-i, --iterations <N> | Number of iterations | 10 |
-f, --file <PATH> | Audio file to use | Built-in test audio |
-l, --language <LANG> | Optional ASR language hint (e.g., en) | Auto-detect |
-c, --concurrent <N> | Concurrent requests | 1 |
--warmup | Enable warmup iteration | — |
Example
Whisper ASR Performance Protocol
For CPU and Metal regressions, start a warmed server for the backend under test, then run the smoke guardrail before recording a benchmark:device.model_dtype = "F32", device.cuda_dtype_shim = false, and
device.whisper_impl = "local_whisper" for CPU and Metal. CUDA runs keep the
CUDA-specific Whisper dtype policy.
Use JSON output for comparable benchmark artifacts:
IZWI_WHISPER_PROFILE_SYNC=1 when investigating backend timing
attribution. Set IZWI_WHISPER_DEVICE_GREEDY=0 only when isolating the scalar
device-greedy decode path from the incremental decoder cache.
For CUDA second-wave kernel benchmarks, build and compare Candle feature
combinations before changing CUDA defaults:
cudnn in builds
whose runtime image or host provides matching cuDNN libraries.
Qwen-ASR CUDA Performance Protocol
Qwen-ASR CUDA kernel work follows the same guardrail rule as Whisper: prove transcript quality and kernel telemetry before changing defaults. Start a server with the backend and model under test, then run the Qwen smoke script before recording benchmark artifacts:RTF,
timings_ms.prefill, timings_ms.decode, generated tokens, prompt/audio
tokens, fused attention attempt/success/fallback counts, chunk attention
fused/unfused/fallback counts, dense vs paged decode share, and Qwen
execution diagnostics for device, dtypes, dense decode, FlashAttention, and
text projection backend. FlashAttention
experiments require both the flash-attn build feature and
IZWI_USE_FLASH_ATTENTION=1; unsupported CUDA shapes must fall back through the
existing Candle path with telemetry rather than failing the request.
When validating audio chunk varlen FlashAttention specifically, also set
IZWI_QWEN_ASR_SMOKE_REQUIRE_FUSED_CHUNKS=1 so the smoke script fails if the
runtime chunk-attention counters do not show fused spans.
Paged attention or elementwise custom CUDA kernels are a later escalation, not
the default second-wave path. Only start that work when the JSON artifacts show
that dense decode, Candle FlashAttention, audio varlen attention, and QMatMul are
all active or inapplicable, and the remaining bottleneck is still paged decode
or elementwise launch overhead. Until then, keep Qwen-ASR CUDA work routed
through Candle kernels and existing exact fallbacks.
izwi bench throughput
Benchmark lightweight HTTP server overhead against/livez.
Options
| Option | Description | Default |
|---|---|---|
-d, --duration <SECONDS> | Test duration | 30 |
-c, --concurrent <N> | Concurrent requests | 1 |
Example
izwi bench run
Run a TOML benchmark manifest with one or more benchmark cases.Manifest Format
command values are chat, tts, asr, and throughput. Relative ASR file paths resolve from the manifest directory.
Each benchmark can include a [benchmarks.matrix] table to generate a cartesian matrix from array values. Scalar values on the benchmark are used as defaults, and matrix values override them per generated case.
chat-short[model=Qwen3.5-4B,concurrent=1,max_tokens=64]. Duplicate expanded case names are rejected so JSON reports can be compared safely by case identity.
When --artifact-dir is provided, the CLI writes:
report.json— suite report with all case summaries and samplesmanifest.toml— copied manifest textmetadata.json— CLI version, git SHA when available, platform, server, and run timestampsobservability.json— before/after/v1/health, metrics JSON, and Prometheus telemetry snapshots
izwi bench compare
Compare a current JSON report against a baseline JSON report. The command exits with a non-zero status when any comparable metric regresses beyond the tolerance.--output-format json and suite reports from izwi bench run. Lower-is-better checks include p95 latency, p95 TTFT, p95 end-to-end latency, and average RTF. Higher-is-better checks include request throughput, chat completion TPS, and TTS tokens/sec when present.
Output
Benchmarks report:- TTFT — Time to first streamed token for chat benchmarks
- Latency — Average, min, max, p50, p95, p99
- Throughput — Requests per second
- Tokens/second — For chat completion output and TTS server token generation when available
- Real-time factor — Audio duration vs processing time for TTS/ASR when available
- Runtime telemetry snapshot — Counter deltas for the run plus rolling runtime latency quantiles (explicitly labeled as rolling in CLI output)
- Structured reports — Use global
--output-format jsonto emit a machine-readable report with run config, summary metrics, per-request samples, and runtime telemetry snapshots.
Repeatable Kernel Performance Protocol
For regression tracking, use a fixed protocol so TTFT/TPS comparisons remain apples-to-apples:- Start from a warmed server and unchanged model weights.
- Run both a short and long prompt profile with the same iterations and max tokens.
- Capture both
izwi bench chatoutput and/internal/metrics/prometheus.
Recommended command set
izwi bench chat runtime delta now includes kernel-path counts for:
- prefill token-mode vs sequence-mode activity,
- dense vs paged decode attention routing,
- RoPE kernel vs manual path usage,
- fused attention attempts/success/fallback totals.
See Also
izwi status— System status