Run performance benchmarks.

Synopsis

izwi bench <COMMAND>

Subcommands

CommandDescription
chatBenchmark chat inference
ttsBenchmark TTS inference
asrBenchmark ASR inference
throughputBenchmark system throughput
runRun a benchmark manifest
compareCompare JSON reports and fail on regressions

izwi bench chat

Benchmark chat performance, including time-to-first-token.
izwi bench chat [OPTIONS]

Options

OptionDescriptionDefault
-m, --model <MODEL>Model to benchmarkQwen3.5-4B
-i, --iterations <N>Number of requests10
-p, --prompt <TEXT>User prompt to sendDefault benchmark prompt
--system <TEXT>Optional system prompt
--max-tokens <N>Maximum completion tokens128
-c, --concurrent <N>Concurrent requests1
--warmupEnable warmup iteration

Example

izwi bench chat --model Qwen3.5-4B --iterations 8 --max-tokens 160 --warmup

izwi bench tts

Benchmark text-to-speech performance.
izwi bench tts [OPTIONS]

Options

OptionDescriptionDefault
-m, --model <MODEL>Model to benchmarkqwen3-tts-0.6b-base
-i, --iterations <N>Number of iterations10
-t, --text <TEXT>Text to synthesizeDefault test text
-c, --concurrent <N>Concurrent requests1
--warmupEnable warmup iteration

Example

izwi bench tts --model qwen3-tts-0.6b-base --iterations 20 --warmup

izwi bench asr

Benchmark speech recognition performance.
izwi bench asr [OPTIONS]

Options

OptionDescriptionDefault
-m, --model <MODEL>Model to benchmarkparakeet-tdt-0.6b-v3
-i, --iterations <N>Number of iterations10
-f, --file <PATH>Audio file to useBuilt-in test audio
-l, --language <LANG>Optional ASR language hint (e.g., en)Auto-detect
-c, --concurrent <N>Concurrent requests1
--warmupEnable warmup iteration

Example

izwi bench asr --model Whisper-Large-v3-Turbo --file test.wav --language en --iterations 20 --warmup

Whisper ASR Performance Protocol

For CPU and Metal regressions, start a warmed server for the backend under test, then run the smoke guardrail before recording a benchmark:
IZWI_BACKEND=metal \
IZWI_PRELOAD_MODELS=Whisper-Large-v3-Turbo \
IZWI_WARMUP_PRELOADED_MODELS=1 \
izwi serve --backend metal
IZWI_WHISPER_SMOKE_BACKEND=metal \
IZWI_WHISPER_SMOKE_SHORT_AUDIO=fixtures/short.wav \
IZWI_WHISPER_SMOKE_LONG_AUDIO=fixtures/long.wav \
IZWI_WHISPER_SMOKE_SILENCE_AUDIO=fixtures/silence.wav \
IZWI_WHISPER_SMOKE_NON_EN_AUDIO=fixtures/non-en.wav \
scripts/ci/whisper-smoke.sh
The smoke output must show Whisper diagnostics with device.model_dtype = "F32", device.cuda_dtype_shim = false, and device.whisper_impl = "local_whisper" for CPU and Metal. CUDA runs keep the CUDA-specific Whisper dtype policy. Use JSON output for comparable benchmark artifacts:
izwi --output-format json bench asr \
  --model Whisper-Large-v3-Turbo \
  --file fixtures/short.wav \
  --language en \
  --iterations 10 \
  --warmup > whisper-current.json
izwi bench compare whisper-current.json whisper-baseline.json --tolerance-percent 10
Set IZWI_WHISPER_PROFILE_SYNC=1 when investigating backend timing attribution. Set IZWI_WHISPER_DEVICE_GREEDY=0 only when isolating the scalar device-greedy decode path from the incremental decoder cache. For CUDA second-wave kernel benchmarks, build and compare Candle feature combinations before changing CUDA defaults:
IZWI_CUDA_FEATURES=cuda scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,flash-attn scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,cudnn scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,cudnn,flash-attn scripts/ci/check-backend-truth.sh cargo-cuda
Run FlashAttention experiments with both the build feature and runtime opt-in:
IZWI_BACKEND=cuda \
IZWI_USE_FLASH_ATTENTION=1 \
IZWI_WHISPER_PROFILE_SYNC=1 \
izwi --output-format json bench asr \
  --model Whisper-Large-v3-Turbo \
  --file fixtures/short.wav \
  --language en \
  --iterations 10 \
  --warmup > whisper-cuda-flash.json
The CUDA FlashAttention path is Candle-backed and opportunistic: Whisper uses it only for CUDA tensors when the build, dtype, head dimension, shape, and runtime flag allow it. The benchmark telemetry should show fused-attention attempt/success/fallback counters, and the transcript/WER must be compared against the non-FlashAttention CUDA baseline. Only enable cudnn in builds whose runtime image or host provides matching cuDNN libraries.

Qwen-ASR CUDA Performance Protocol

Qwen-ASR CUDA kernel work follows the same guardrail rule as Whisper: prove transcript quality and kernel telemetry before changing defaults. Start a server with the backend and model under test, then run the Qwen smoke script before recording benchmark artifacts:
IZWI_BACKEND=cuda \
IZWI_PRELOAD_MODELS=Qwen3-ASR-0.6B-GGUF \
IZWI_WARMUP_PRELOADED_MODELS=1 \
izwi serve --backend cuda
IZWI_QWEN_ASR_SMOKE_BACKEND=cuda \
IZWI_QWEN_ASR_SMOKE_SHORT_AUDIO=data/fox.wav \
IZWI_QWEN_ASR_SMOKE_LONG_AUDIO=data/diarization-2.mp3 \
IZWI_QWEN_ASR_SMOKE_SHORT_EXPECT="quick brown fox" \
IZWI_QWEN_ASR_SMOKE_LONG_EXPECTED_PREFIX="So human" \
IZWI_QWEN_ASR_SMOKE_REQUIRE_CHUNK_ATTENTION=1 \
scripts/ci/qwen-asr-smoke.sh
Run the same smoke script for CPU and Metal when changing shared Qwen code. Those backends are protected behavior for CUDA second-wave work. Build and compare the CUDA feature matrix before changing CUDA defaults:
IZWI_CUDA_FEATURES=cuda scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,flash-attn scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,cudnn scripts/ci/check-backend-truth.sh cargo-cuda
IZWI_CUDA_FEATURES=cuda,cudnn,flash-attn scripts/ci/check-backend-truth.sh cargo-cuda
Benchmark each candidate with JSON output. At minimum, compare the CUDA baseline, dense decode, QMatMul, and FlashAttention runtime gates:
IZWI_BACKEND=cuda \
IZWI_QWEN3_DENSE_DECODE_ATTENTION=1 \
IZWI_QWEN3_ASR_GGUF_QMATMUL_TEXT=1 \
izwi --output-format json bench asr \
  --model Qwen3-ASR-0.6B-GGUF \
  --file data/diarization.wav \
  --iterations 10 \
  --warmup > qwen-asr-cuda-current.json
IZWI_BACKEND=cuda \
IZWI_USE_FLASH_ATTENTION=1 \
IZWI_QWEN3_DENSE_DECODE_ATTENTION=1 \
IZWI_QWEN3_ASR_GGUF_QMATMUL_TEXT=1 \
izwi --output-format json bench asr \
  --model Qwen3-ASR-0.6B-GGUF \
  --file data/diarization-2.mp3 \
  --iterations 5 \
  --warmup > qwen-asr-cuda-flash-long.json
Required review fields are the normalized transcript or WER, RTF, timings_ms.prefill, timings_ms.decode, generated tokens, prompt/audio tokens, fused attention attempt/success/fallback counts, chunk attention fused/unfused/fallback counts, dense vs paged decode share, and Qwen execution diagnostics for device, dtypes, dense decode, FlashAttention, and text projection backend. FlashAttention experiments require both the flash-attn build feature and IZWI_USE_FLASH_ATTENTION=1; unsupported CUDA shapes must fall back through the existing Candle path with telemetry rather than failing the request. When validating audio chunk varlen FlashAttention specifically, also set IZWI_QWEN_ASR_SMOKE_REQUIRE_FUSED_CHUNKS=1 so the smoke script fails if the runtime chunk-attention counters do not show fused spans. Paged attention or elementwise custom CUDA kernels are a later escalation, not the default second-wave path. Only start that work when the JSON artifacts show that dense decode, Candle FlashAttention, audio varlen attention, and QMatMul are all active or inapplicable, and the remaining bottleneck is still paged decode or elementwise launch overhead. Until then, keep Qwen-ASR CUDA work routed through Candle kernels and existing exact fallbacks.

izwi bench throughput

Benchmark lightweight HTTP server overhead against /livez.
izwi bench throughput [OPTIONS]

Options

OptionDescriptionDefault
-d, --duration <SECONDS>Test duration30
-c, --concurrent <N>Concurrent requests1

Example

izwi bench throughput --duration 60 --concurrent 4

izwi bench run

Run a TOML benchmark manifest with one or more benchmark cases.
izwi bench run benchmarks/local.toml
izwi bench run benchmarks/local.toml --artifact-dir benchmarks/results/run-001

Manifest Format

server = "http://localhost:8080"

[[benchmarks]]
name = "chat-short-c1"
command = "chat"
model = "Qwen3.5-4B"
iterations = 10
concurrent = 1
warmup = true
prompt = "Summarize why batching helps transformer prefill."
max_tokens = 128

[[benchmarks]]
name = "asr-short-c2"
command = "asr"
model = "parakeet-tdt-0.6b-v3"
file = "data/test.wav"
iterations = 4
concurrent = 2
Supported command values are chat, tts, asr, and throughput. Relative ASR file paths resolve from the manifest directory. Each benchmark can include a [benchmarks.matrix] table to generate a cartesian matrix from array values. Scalar values on the benchmark are used as defaults, and matrix values override them per generated case.
[[benchmarks]]
name = "chat-short"
command = "chat"
prompt = "Summarize why batching helps transformer prefill."
iterations = 10
warmup = true

[benchmarks.matrix]
model = ["Qwen3.5-4B", "Qwen3.5-8B"]
concurrent = [1, 2, 4]
max_tokens = [64, 128]
The example expands to 12 named cases such as chat-short[model=Qwen3.5-4B,concurrent=1,max_tokens=64]. Duplicate expanded case names are rejected so JSON reports can be compared safely by case identity. When --artifact-dir is provided, the CLI writes:
  • report.json — suite report with all case summaries and samples
  • manifest.toml — copied manifest text
  • metadata.json — CLI version, git SHA when available, platform, server, and run timestamps
  • observability.json — before/after /v1/health, metrics JSON, and Prometheus telemetry snapshots

izwi bench compare

Compare a current JSON report against a baseline JSON report. The command exits with a non-zero status when any comparable metric regresses beyond the tolerance.
izwi bench compare current.json baseline.json --tolerance-percent 5
Supported inputs are single reports from --output-format json and suite reports from izwi bench run. Lower-is-better checks include p95 latency, p95 TTFT, p95 end-to-end latency, and average RTF. Higher-is-better checks include request throughput, chat completion TPS, and TTS tokens/sec when present.

Output

Benchmarks report:
  • TTFT — Time to first streamed token for chat benchmarks
  • Latency — Average, min, max, p50, p95, p99
  • Throughput — Requests per second
  • Tokens/second — For chat completion output and TTS server token generation when available
  • Real-time factor — Audio duration vs processing time for TTS/ASR when available
  • Runtime telemetry snapshot — Counter deltas for the run plus rolling runtime latency quantiles (explicitly labeled as rolling in CLI output)
  • Structured reports — Use global --output-format json to emit a machine-readable report with run config, summary metrics, per-request samples, and runtime telemetry snapshots.

Repeatable Kernel Performance Protocol

For regression tracking, use a fixed protocol so TTFT/TPS comparisons remain apples-to-apples:
  1. Start from a warmed server and unchanged model weights.
  2. Run both a short and long prompt profile with the same iterations and max tokens.
  3. Capture both izwi bench chat output and /internal/metrics/prometheus.
# Short prompt profile (TTFT-sensitive)
izwi bench chat \
  --model Qwen3.5-4B \
  --iterations 20 \
  --max-tokens 128 \
  --warmup \
  --prompt "Summarize why batching helps transformer prefill in two concise bullet points."

# Long prompt profile (decode-path-sensitive)
izwi bench chat \
  --model Qwen3.5-4B \
  --iterations 20 \
  --max-tokens 128 \
  --warmup \
  --prompt "You are optimizing an inference runtime. Explain, in detail, how prefill and decode differ for attention kernels, why page-based KV caches help memory at long context, what tradeoffs they can introduce for throughput, and how to design a benchmark protocol that separates TTFT improvements from decode TPS improvements."

# Kernel-path counter snapshot
curl -sS http://127.0.0.1:11435/internal/metrics/prometheus | rg '^izwi_kernel_'
The izwi bench chat runtime delta now includes kernel-path counts for:
  • prefill token-mode vs sequence-mode activity,
  • dense vs paged decode attention routing,
  • RoPE kernel vs manual path usage,
  • fused attention attempts/success/fallback totals.

See Also