Benchmark a Whisper ASR server¶
The voice.transcription plugin produces signed WER envelopes for any
OpenAI-compatible audio-transcription endpoint.
That covers the SaaS providers (OpenAI, Cohere) and the self-hosted servers
that follow the same wire format — primarily
faster-whisper-server and
forks. The recipe below uses faster-whisper-server running on a single GPU
and the bundled voice.transcription.librispeech-clean-mini benchmark (5 real
LibriSpeech test-clean utterances, ~18s of audio).
Why a real ASR benchmark matters¶
A WER number means nothing without the audio it was measured against. The envelope binds:
- the model id (
Systran/faster-whisper-large-v3etc.) - the engine kind and version
- the audio fixture set (hashed in
dataset.path+ per-filesha256_16) - the hardware fingerprint
- a Sigstore or dev-key signature over the canonical hash of all of the above
So a downstream consumer can compare WER between two engines on the same audio under the same conditions, instead of comparing your 2.1% on LibriSpeech-test-clean to someone else's 2.8% on a private dataset.
Setup¶
- Start the server on a GPU box (one H100 / L40 / RTX 4090 is fine for
whisper-large-v3, ~3GB VRAM at int8). The server exposes
POST /v1/audio/transcriptionson port 8000:
# Easiest: the upstream Docker image.
docker run --rm --gpus all -p 8000:8000 \
-e WHISPER__MODEL=Systran/faster-whisper-large-v3 \
-e WHISPER__COMPUTE_TYPE=float16 \
fedirz/faster-whisper-server:latest-cuda
Wait for INFO: Application startup complete. in the logs (~10-30s while
the model downloads on first run).
- Install bench + the voice plugin on the box that will call the server (can be the same machine):
git clone https://github.com/yobitelcomm/bench
cd bench
uv sync --all-packages --dev --prerelease=allow
- Mint a one-shot dev key (no Sigstore network for the demo):
uv run python -c "from inferencebench.envelope import generate_dev_keypair; generate_dev_keypair('cosign.key')"
Run the benchmark¶
uv run bench run voice.transcription.librispeech-clean-mini \
--model Systran/faster-whisper-large-v3 \
--engine whisper-http \
--base-url http://localhost:8000/v1 \
--signing-mode dev --dev-key cosign.key \
--output ./envelopes
Expected runtime: ~10s on H100, ~30s on an L40. The plugin sends each fixture WAV in order, waits for the transcription, scores it against the reference, then emits one signed envelope summarizing the run.
Inspect the result¶
You should see something like:
suite: voice.transcription.librispeech-clean-mini v1.0.0
model: Systran/faster-whisper-large-v3
engine: whisper-http
n_samples: 5
ok_rate: 1.00
wer_mean: 0.02 # 2 % — whisper-large-v3 is very strong on LS-clean
wer_p50: 0.00
wer_p95: 0.07
total_p50_ms: 420
Whisper-large-v3 on LibriSpeech test-clean is published at ~2 % WER; a single H100 box should reproduce that band on this 5-utt slice (within ±2 % given the small N). Higher WER is usually one of:
- Endpoint default language is wrong — pass
--engine-arg language=en(see your server's docs). - Server still warming up — first call after startup can include model compile time; re-run.
- Reference has a quirk the normalizer doesn't strip — the bundled scorer lowercases + strips ASCII punctuation; if the engine emits numerals ("M.A." → "MA" vs the reference "M A") that will count as substitution. Open an issue if you hit a normalization case that's not Whisper-style.
Verify the envelope¶
uv run bench verify ./envelopes/voice.transcription.librispeech-clean-mini-*.json \
--dev-public-key cosign.key.pub
OK
method: dev-key
content_hash: <sha256>
suite: voice.transcription.librispeech-clean-mini v1.0.0
model: Systran/faster-whisper-large-v3
engine: whisper-http
For keyless verification of a community-published envelope, see the Sigstore keyless verify recipe.
What this benchmark isn't¶
- Not a multi-engine bake-off. It runs against whatever endpoint you
point it at. To compare engines, run the same spec against two engines and
combine with
bench compare. - Not a license certificate. LibriSpeech is CC BY 4.0; the bundled 5
utterances are sourced from
hf-internal-testing/librispeech_asr_dummy. If you publish numbers, attribute appropriately. - Not enough audio to make a paper claim. 18s of speech is for smoke
validation. For headline numbers, use a longer subset of LibriSpeech +
CommonVoice + AMI + earnings22 — write a custom JSONL pointing at the same
schema and
bench runwill pick it up.