Reproducibility¶
The reason InferenceBench exists is to make benchmark results reproducible. Four mechanisms enforce reproducibility:
- Warm-up discipline. Three runs are discarded before measurement.
- A convergence gate. Measurement does not start until the system is statistically stable.
- Explicit seeds. Every stochastic step uses a seed that is recorded in the envelope.
- The hardware fingerprint. Every result is bound to the configuration that produced it.
Together, these give two readers of an envelope enough information to attempt the same run.
Warm-up discipline¶
Run 0 → discarded (cold caches, JIT warm, allocator settling)
Run 1 → discarded
Run 2 → discarded
Run 3+ → measurement begins, conditional on convergence
Discarding three warm-ups is not the only behavior worth measuring. Cold-start latency is reported separately in metrics.cold_start_ms and is sampled in a dedicated phase. The warm steady-state numbers are what ttft_p50_ms and friends measure.
The convergence gate¶
After the discarded warm-ups, the harness waits for the coefficient of variation of TTFT to drop below 5% across a rolling window of 30 requests. If the gate does not pass within 60 seconds, the harness records a warning in the envelope and proceeds anyway, so the run is not lost.
The gate exists because cold-start variance can persist longer than three warm iterations on large models with paged-attention KV caches.
Seeds¶
Every stochastic step uses a seed:
- The dataset sampler
- The request inter-arrival generator (Poisson)
- The prompt shuffle
- The decoding generator (when sampling > 0 temperature)
The seed is recorded in envelope.seed. Re-running with the same seed, same dataset hash, same engine config hash, and same hardware fingerprint should produce identical traces modulo non-determinism in the kernel (which is itself measured and reported in warnings).
Three independent process launches¶
Cross-engine comparisons require three independent process launches with different seeds, then aggregating via bootstrap CI. This catches engine-internal warm-up effects and allocator nondeterminism that a single run hides.
Bootstrap percentile CIs¶
Percentiles are reported with a 95% bootstrap confidence interval (1000 resamples). The harness never reports a percentile as a single point estimate.
Expected output snippet:
What you need to reproduce a run¶
From an envelope alone, a reader needs:
- The model id and revision (
model.id,model.revision) - The engine name, version, and config hash (
engine.*) - The dataset id and hash (
dataset.*) - The seed (
seed) - The driver options (
driver_options) - A matching hardware configuration (
hardware_fingerprint)
Phase 1 ships everything but the engine config replay tool. Recovering the engine config from config_hash is on the v0.2 roadmap.