Skip to content

Plugin: llm.inference

The llm.inference plugin benchmarks LLM serving systems. Phase 1 ships with vLLM on Linux H100; SGLang, TensorRT-LLM, llama.cpp, and MLX are deferred to Phase 2.

pip install inferencebench inferencebench-llm
bench run llm.inference --model meta-llama/Llama-4-Maverick --engine vllm --quant fp8

What it measures

The plugin drives prompts through a serving endpoint and measures:

  • Time-to-first-token (TTFT). Latency from request submission to first decoded token, in ms. Reported as p50 and p99.
  • Time-per-output-token (TPOT). Latency between successive decoded tokens, in ms. Reported as p50 and p99.
  • Throughput. Tokens produced per second across all concurrent requests.
  • Goodput at SLO. Tokens-per-second the system can sustain while still satisfying the SLO template.
  • Power. Average wall power across the GPUs, in watts.
  • Energy per token. power_avg_w / throughput_tok_per_s, in joules per token.
  • Cost. USD per million tokens, computed against a published pricing snapshot when the provider is a hosted endpoint.

Datasets

Phase 1 ships:

Dataset id Description Size
sharegpt-v3 A canonical-ordered subset of ShareGPT V3 conversations 10K turns

Additional datasets land in Phase 2.

Engines

Engine Status
vllm Phase 1
sglang Phase 2
trtllm Phase 2
llama.cpp Phase 2
mlx Phase 2

SLO templates

Template TTFT p99 TPOT p99
llm.standard 300 ms 50 ms
llm.realtime 100 ms 30 ms
llm.batch n/a 200 ms

Example run

bench run llm.inference \
  --model meta-llama/Llama-4-Maverick \
  --engine vllm \
  --hardware h100 \
  --quant fp8 \
  --concurrency 1,4,16,64 \
  --duration 300 \
  --slo-template llm.standard \
  --seed 42

Expected output (truncated):

Run id:    01J7Q5C6...
Model:     meta-llama/Llama-4-Maverick @ fp8 on H100-SXM5-80GB
Engine:    vllm 0.7.2
Metrics:
  ttft_p50_ms          142.0
  ttft_p99_ms          280.3
  tpot_p50_ms           18.5
  throughput_tok_s    1842.1
  goodput_at_slo       142.3 req/s
  power_avg_w          612
  joules_per_token       0.32

Methodology

Three warm-up runs are discarded. The convergence gate requires CoV < 5% across the last 30 requests before measurement begins. The driver is open-loop Poisson at the requested concurrency. Percentile reports include 95% bootstrap CIs (1000 resamples).

For cross-engine comparisons, three independent process launches with different seeds are required. The plugin enforces this when more than one engine is in the comparison.

Known limitations (Phase 1)

  • vLLM only. SGLang/TensorRT-LLM/llama.cpp/MLX support is Phase 2.
  • Linux x86_64 H100 only. Other hardware classes pass the driver but lack tuned engine configs.
  • No vision-language models. Multi-modal prompts are Phase 2.
  • The cost figure assumes the listed provider's published pricing snapshot; promotional pricing is not reflected.

See also