llm.inference.chatbot-short

12 entries. Pareto frontier computed on throughput_tok_per_s (higher is better) vs. ttft_p50_ms (lower is better). Rows marked P are on the frontier.

	Model	Engine	Hardware	Quant	TTFT P50 (ms)	TTFT P99 (ms)	Throughput (tok/s)	$/M tokens	J/token	Power avg (W)	Power peak (W)	WER mean	J / audio s	Envelope
	mistralai/Mistral-7B-Instruct-v0.3	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	20.77	87.92	472	—	1.88	901	937	—	—	JSON
	meta-llama/Llama-3.1-70B-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	46.74	1,333	195	—	10.07	2,003	2,153	—	—	JSON
	Qwen/Qwen2.5-7B-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	26.84	52.98	541	—	1.65	909	948	—	—	JSON
	meta-llama/Llama-3.1-8B-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	25.39	1,678	228	—	3.26	758	943	—	—	JSON
	Qwen/Qwen2.5-Coder-7B-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	26.43	74.97	529	—	1.69	905	944	—	—	JSON
	Qwen/Qwen2-VL-7B-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	42.50	198	216	—	3.83	837	857	—	—	JSON
P	microsoft/Phi-3.5-mini-instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	18.41	265	716	—	1.19	862	898	—	—	JSON
	Qwen/Qwen2.5-72B-Instruct	vllm 0.22.1	8x NVIDIA H100 80GB HBM3	bf16	24.21	33.59	56.47	—	37.18	2,112	2,189	—	—	JSON
P	Qwen/Qwen2.5-72B-Instruct	vllm 0.22.1	8x NVIDIA H100 80GB HBM3	bf16	47.83	85.24	891	—	2.56	2,287	2,312	—	—	JSON
	Qwen/Qwen2.5-72B-Instruct	vllm 0.22.1	8x NVIDIA H100 80GB HBM3	bf16	46.34	48.40	234	—	9.32	2,184	2,198	—	—	JSON
	deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	74.43	521	134	—	5.94	808	855	—	—	JSON
	google/gemma-2-9b-it	vllm 0.21.0	8x NVIDIA H100 80GB HBM3	fp16	30.05	184	385	—	2.31	901	938	—	—	JSON