InferenceBench

llm.inference.chatbot-short

9 entries. Pareto frontier computed on throughput_tok_per_s (higher is better) vs. ttft_p50_ms (lower is better). Rows marked P are on the frontier.

9 of 9 matching
Model Engine Hardware Quant TTFT P50 (ms) TTFT P99 (ms) Throughput (tok/s) $/M tokens J/token Power avg (W) Power peak (W) WER mean J / audio s Envelope
mistralai/Mistral-7B-Instruct-v0.3 vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 20.77 87.92 472 1.88 901 937 JSON
meta-llama/Llama-3.1-70B-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 46.74 1,333 195 10.07 2,003 2,153 JSON
Qwen/Qwen2.5-7B-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 26.84 52.98 541 1.65 909 948 JSON
meta-llama/Llama-3.1-8B-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 25.39 1,678 228 3.26 758 943 JSON
Qwen/Qwen2.5-Coder-7B-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 26.43 74.97 529 1.69 905 944 JSON
Qwen/Qwen2-VL-7B-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 42.50 198 216 3.83 837 857 JSON
P microsoft/Phi-3.5-mini-instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 18.41 265 716 1.19 862 898 JSON
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 74.43 521 134 5.94 808 855 JSON
google/gemma-2-9b-it vllm 0.21.0 8x NVIDIA H100 80GB HBM3 fp16 30.05 184 385 2.31 901 938 JSON