10-minute tour¶
This page walks a brand-new user from git clone to a signed envelope you can hand to anyone and have them verify. It's the same path bench tour runs, with each step spelled out so you can stop and inspect.
The numbers shown in the output blocks come from a real corpus captured on H100-80GB-HBM3 in May 2026 (validation-runs/2026-05-16-cross-model-corpus/). Your numbers will differ; the shape will not.
0–2 min: Install¶
git clone https://github.com/yobitelcomm/bench
cd bench
uv sync --all-packages --dev --prerelease=allow
uv run bench --version
Expected:
If uv is not on your path, install it with pipx install uv first. We use uv workspace mode so a single uv sync resolves the CLI, the harness, the envelope library, and every plugin from one lock file.
What you learned
- The repo is a uv workspace; one sync installs every package in the monorepo.
2–4 min: Hardware check¶
Expected (excerpt, on a healthy H100 node):
Hardware diagnostic
Check Status Detail
NVML available PASS 8 GPUs visible
Driver version PASS 560.35.03
ECC enabled PASS enabled on all GPUs
Persistence mode PASS enabled
Thermal headroom PASS all GPUs < 75 degC
Clock state PASS no throttling flags
OK — all checks passed.
Field by field:
- NVML available —
benchreads telemetry viapynvml. No NVML, no envelope. - Driver version — captured into the envelope's hardware fingerprint.
- ECC enabled — single-bit memory errors silently corrupt logits. Off = result rejected.
- Persistence mode — keeps the driver loaded between runs so cold-start latency does not pollute warm-up timings.
- Thermal headroom — anything above ~83 °C will trigger throttling on H100s and skew TTFT.
- Clock state —
benchaborts if any throttling flag is set during the run.
bench doctor exits non-zero on a failure. --strict also fails on warnings.
What you learned
- Every check
doctorruns corresponds to a field captured in the signed envelope's hardware fingerprint.
4–7 min: First signed envelope¶
A real bench run needs a model server. The realistic flow on an 8×H100 box is:
# In one terminal:
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0
# In another, once it's serving:
cosign generate-key-pair # produces ./cosign.key + cosign.pub
uv run bench run llm.inference.chatbot-short \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--engine vllm --quant fp16 \
--sweep 1,4 \
--base-url http://localhost:8000/v1 \
--signing-mode dev --dev-key ./cosign.key \
--output ./corpus/tiny
The harness discards three warm-up runs, waits for the convergence gate (CoV < 5% over the last 30 requests), then drives Poisson-arrival load at each concurrency in the sweep. Output ends with the envelope path:
Inspect the JSON directly:
You'll see the content_hash, the hardware fingerprint, the dataset hash, the seed, the metrics, and a Sigstore signature block.
Verify it:
Expected:
OK ./corpus/tiny/c1-<id>.json
method: sigstore-cosign-dev
content_hash: 8b1a…e2c4
suite: llm.inference v1.0.0
Verification recomputes the content hash, checks the cosign signature against the bundled public key, and confirms every metric is internally consistent. Any mismatch is a hard failure.
What you learned
- An envelope is the unit of trust. If
bench verifypasses, you can hand the JSON to anyone and they can reproduce the claim.
7–9 min: Compare + leaderboard¶
Run a second model the same way (or copy two envelopes from validation-runs/2026-05-16-cross-model-corpus/corpus/all/). Diff them:
You'll get a side-by-side metric table with deltas. The corpus shipped in the repo shows Llama-3.1-8B at 1384.2 tok/s and Qwen2.5-7B at 1362.3 tok/s at concurrency 16 — same hardware, same suite, ~1.6 % apart on throughput and ~1.4 % apart on J/tok.
Render a static leaderboard from any directory of envelopes:
uv run bench leaderboard --build \
--envelopes validation-runs/2026-05-16-cross-model-corpus/corpus/all \
--out ./site
open ./site/index.html
The output is a self-contained HTML site with Pareto plots — no JavaScript framework, no server.
What you learned
bench diffandbench leaderboard --buildare pure functions of a directory of envelopes; no network, no DB.
9–10 min: Share¶
Bundle a single envelope plus the public key plus the cosign certificate for offline-recipient verification:
A recipient can verify your run without the original repo by running bench bundle extract followed by bench verify.
Mirror a whole corpus to a local directory tree (a stand-in for a future hosted Studio mirror):
The mirror layout matches what bench fetch consumes, so collaborators can pull from a shared NFS path or an S3 bucket synced to local disk.
What you learned
- Envelopes are portable. Bundle for one-off sharing; publish to a workspace mirror for a team.
Where to go next¶
- Quickstart — the canonical 5-minute install + run.
- Signed envelope — what is in the JSON, and why every field is there.
- Cross-model comparison recipe — the full Llama vs Qwen walkthrough with real numbers.
- Plugin authoring — scaffold your own benchmark with
bench plugin init.