Skip to content

bench diff

Per-metric delta between two envelopes — a baseline and a candidate. Each metric is classified as improvement, regression, no_change, unknown, or missing using a direction-aware policy (lower-is-better for latencies / cost / energy; higher-is-better for throughput / quality / goodput).

Sharper than bench compare, which renders Pareto frontiers across many runs. bench diff is the canonical "did my optimisation actually help?" command — and the canonical CI regression check via --strict.

Synopsis

bench diff <baseline.json> <candidate.json> [--tolerance 0.02] [--report table|json] [--strict] [--verify]

Example: kernel change regression check

bench diff \
  baseline/c16-60be8efd6d21.json \
  candidate/c16-60be8efd6d21.json \
  --strict

Expected output (truncated, with one regression):

                                    Envelope diff
 Metric                       Baseline  Candidate  Δ abs    Δ rel    Verdict
 ttft_p99_ms                  64.71     78.50      +13.79   +21.31%  ↑ regression
 throughput_tok_per_s         1,384.2   1,402.7    +18.50   +1.34%   ≈
 joules_per_token             0.70      0.68       -0.02    -2.86%   ↓ improvement
 ok_rate                      1.000     1.000      +0.00    +0.00%   ≈

Exit code is 1 when --strict is set and any metric is classified as a regression, 0 otherwise.

Flags

Flag Default Description
--tolerance 0.02 Relative-delta band (±2 %) inside which a metric is no_change.
--report table Output format: table or json.
--strict off Exit 1 if any metric is a regression. Use this in CI.
--verify off Verify both envelopes' signatures before diffing.

Direction policy

Lower is better Higher is better
ttft_*, tpot_*, total_* ms percentiles throughput_tok_per_s
joules_per_token, energy_joules_total req_per_s_passing, req_per_s_all
power_avg_w, power_peak_w compliance_rate, ok_rate
cost_usd_per_million_tokens goodput_at_slo

Metrics not in either set are tagged unknown — the delta is still rendered but no verdict is emitted.

Context warnings

If the baseline and candidate envelopes differ on suite_id, model.id, engine.name, engine.version, quantization.format, or the hardware fingerprint, bench diff still runs but prints a yellow warning. Diffing across contexts is supported; interpret the deltas with care.

See also