Plugin: llm.inference¶
The llm.inference plugin benchmarks LLM serving systems. Phase 1 ships with vLLM on Linux H100; SGLang, TensorRT-LLM, llama.cpp, and MLX are deferred to Phase 2.
pip install inferencebench inferencebench-llm
bench run llm.inference --model meta-llama/Llama-4-Maverick --engine vllm --quant fp8
What it measures¶
The plugin drives prompts through a serving endpoint and measures:
- Time-to-first-token (TTFT). Latency from request submission to first decoded token, in ms. Reported as
p50andp99. - Time-per-output-token (TPOT). Latency between successive decoded tokens, in ms. Reported as
p50andp99. - Throughput. Tokens produced per second across all concurrent requests.
- Goodput at SLO. Tokens-per-second the system can sustain while still satisfying the SLO template.
- Power. Average wall power across the GPUs, in watts.
- Energy per token.
power_avg_w / throughput_tok_per_s, in joules per token. - Cost. USD per million tokens, computed against a published pricing snapshot when the provider is a hosted endpoint.
Datasets¶
Phase 1 ships:
| Dataset id | Description | Size |
|---|---|---|
sharegpt-v3 |
A canonical-ordered subset of ShareGPT V3 conversations | 10K turns |
Additional datasets land in Phase 2.
Engines¶
| Engine | Status |
|---|---|
vllm |
Phase 1 |
sglang |
Phase 2 |
trtllm |
Phase 2 |
llama.cpp |
Phase 2 |
mlx |
Phase 2 |
SLO templates¶
| Template | TTFT p99 | TPOT p99 |
|---|---|---|
llm.standard |
300 ms | 50 ms |
llm.realtime |
100 ms | 30 ms |
llm.batch |
n/a | 200 ms |
Example run¶
bench run llm.inference \
--model meta-llama/Llama-4-Maverick \
--engine vllm \
--hardware h100 \
--quant fp8 \
--concurrency 1,4,16,64 \
--duration 300 \
--slo-template llm.standard \
--seed 42
Expected output (truncated):
Run id: 01J7Q5C6...
Model: meta-llama/Llama-4-Maverick @ fp8 on H100-SXM5-80GB
Engine: vllm 0.7.2
Metrics:
ttft_p50_ms 142.0
ttft_p99_ms 280.3
tpot_p50_ms 18.5
throughput_tok_s 1842.1
goodput_at_slo 142.3 req/s
power_avg_w 612
joules_per_token 0.32
Methodology¶
Three warm-up runs are discarded. The convergence gate requires CoV < 5% across the last 30 requests before measurement begins. The driver is open-loop Poisson at the requested concurrency. Percentile reports include 95% bootstrap CIs (1000 resamples).
For cross-engine comparisons, three independent process launches with different seeds are required. The plugin enforces this when more than one engine is in the comparison.
Known limitations (Phase 1)¶
- vLLM only. SGLang/TensorRT-LLM/llama.cpp/MLX support is Phase 2.
- Linux x86_64 H100 only. Other hardware classes pass the driver but lack tuned engine configs.
- No vision-language models. Multi-modal prompts are Phase 2.
- The cost figure assumes the listed provider's published pricing snapshot; promotional pricing is not reflected.