llm.inference.chatbot-short
9 entries.
Pareto frontier computed on
throughput_tok_per_s (higher is better) vs.
ttft_p50_ms (lower is better).
Rows marked P are on the frontier.
9 of 9 matching
| Model | Engine | Hardware | Quant | TTFT P50 (ms) | TTFT P99 (ms) | Throughput (tok/s) | $/M tokens | J/token | Power avg (W) | Power peak (W) | WER mean | J / audio s | Envelope | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mistralai/Mistral-7B-Instruct-v0.3 | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 20.77 | 87.92 | 472 | — | 1.88 | 901 | 937 | — | — | JSON | |
| meta-llama/Llama-3.1-70B-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 46.74 | 1,333 | 195 | — | 10.07 | 2,003 | 2,153 | — | — | JSON | |
| Qwen/Qwen2.5-7B-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 26.84 | 52.98 | 541 | — | 1.65 | 909 | 948 | — | — | JSON | |
| meta-llama/Llama-3.1-8B-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 25.39 | 1,678 | 228 | — | 3.26 | 758 | 943 | — | — | JSON | |
| Qwen/Qwen2.5-Coder-7B-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 26.43 | 74.97 | 529 | — | 1.69 | 905 | 944 | — | — | JSON | |
| Qwen/Qwen2-VL-7B-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 42.50 | 198 | 216 | — | 3.83 | 837 | 857 | — | — | JSON | |
| P | microsoft/Phi-3.5-mini-instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 18.41 | 265 | 716 | — | 1.19 | 862 | 898 | — | — | JSON |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 74.43 | 521 | 134 | — | 5.94 | 808 | 855 | — | — | JSON | |
| google/gemma-2-9b-it | vllm 0.21.0 | 8x NVIDIA H100 80GB HBM3 | fp16 | 30.05 | 184 | 385 | — | 2.31 | 901 | 938 | — | — | JSON |