InferenceBench

llm.quality.factual-mini

9 entries. Pareto frontier computed on throughput_tok_per_s (higher is better) vs. ttft_p50_ms (lower is better). Rows marked P are on the frontier.

9 of 9 matching
Model Engine Hardware Quant TTFT P50 (ms) TTFT P99 (ms) Throughput (tok/s) $/M tokens J/token Power avg (W) Power peak (W) WER mean J / audio s Envelope
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 45.01 JSON
Qwen/Qwen2-VL-7B-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 29.44 JSON
mistralai/Mistral-7B-Instruct-v0.3 vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 11.70 JSON
google/gemma-2-9b-it vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 16.89 JSON
meta-llama/Llama-3.1-8B-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 14.08 JSON
microsoft/Phi-3.5-mini-instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 11.21 JSON
meta-llama/Llama-3.1-70B-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 32.61 JSON
Qwen/Qwen2.5-Coder-7B-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 13.96 JSON
Qwen/Qwen2.5-7B-Instruct vllm unknown 8x NVIDIA H100 80GB HBM3 fp16 14.12 JSON