InferenceBench¶
Vendor-neutral, hardware-fingerprinted, Sigstore-signed AI benchmarks for inference systems.
pip install -e ./cli -e ./envelope -e ./harness
bench run llm.inference.chatbot-short \
--model meta-llama/Llama-3.1-8B-Instruct \
--engine vllm --quant fp16 --sweep 1,4,16,64 \
--base-url http://localhost:8000/v1
InferenceBench is a CLI plus a reproducibility envelope. Every result captures the exact hardware, software stack, dataset, and seed, then signs the bundle so anyone can verify it independently.
Live leaderboard
The 50-envelope cross-vendor marathon (9 LLMs × 5 vendors on 8× H100) is rendered at /leaderboard/ — sortable tables, Pareto frontiers, per-envelope verify snippets.
What you get¶
- A
benchCLI. 22 commands grouped around three things: running benchmarks (run,replay,doctor,list,history,profile,matrix), reasoning about results (compare,diff,summary,cost,schema,audit), and moving envelopes around (fetch,cache,publish,verify,bundle,export,leaderboard,watch,ci,plugin,plugins). See the CLI overview. Highlights since the last refresh:bench auditfor trust-but-verify on third-party corpora,bench ci initfor one-shot GitHub Actions regression gates, andbench matrixfor multi-endpoint comparisons. - A signed envelope per result. Hardware fingerprint, software provenance, dataset hash, seed, metrics, signature.
- Pareto outputs. Throughput, latency, cost, energy, and quality together. No single headline number.
- Hugging Face Hub publishing.
bench publish --to hfmints a citable dataset repo.
Recipes — start here¶
Four end-to-end workflows built from the commands above. The numbers in each recipe come from a real corpus captured on a single H100-80GB-HBM3 in May 2026 — they're the same envelopes that ship under validation-runs/ in the repo.
- Concurrency sweep. Throughput climbs 122 → 1384 tok/s on Llama-3.1-8B as J/tok drops from 7.24 to 0.70. The textbook story for
bench run --sweep. - Regression check. Capture a baseline, change a variable,
bench diff --strict. Drop the diff into CI to fail the build on any regression. - Verify and replay. Anyone can verify a signed envelope and replay it on their own hardware.
- Cross-model comparison. Llama-3.1-8B vs Qwen2.5-7B on the same suite — both hit ~1380 tok/s at conc=16 with a slightly different energy profile.
What this is not (yet)¶
- A SaaS. (Phase 2.)
- Multi-modal. Phase 1 ships the
llm.inferenceplugin only. - Multi-vendor at GA. Phase 1 ships with H100 coverage from one cluster; MI300X, RTX 5090, and M5 Max are deferred until partnerships land. Engine breadth is vLLM today, with an SGLang skeleton in tree.
Next steps¶
- Install the CLI
- Run your first benchmark in 5 minutes
- Read the envelope concept
- Browse the CLI reference
Project status¶
Phase 1 is active. PyPI release is pending — install from a git clone for now. The CLI is open source under Apache 2.0. Source lives at github.com/yobitelcomm/bench.