We’ve been frustrated by how scattered LLM benchmarking has become. Endless Reddit threads, conflicting posts, and inconsistent metrics across GPUs and inference engines.
So we built Inference Arena, an open benchmarking hub where you can:
- Discover and compare inference results for open models across vLLM, SGLang, Ollama, MLX, and LM Studio
- See performance trade-offs for quantized versions
- Analyze throughput, latency, and cost side by side across hardware setups
- Explore benchmark data interactively or access it programmatically via MCP (Model Context Protocol)
You can also use it in Agent Mode where an agent can search, analyze, and compare results (and even fetch new ones from the web or subreddits).
We’d love feedback from you on:
- Which metrics matter most for your workflows (TTFT, TPS, memory, cost?)
- Other engines or quantization methods you’d like to see
- How we can make the data more useful for real-world inference tuning
driaforall•2h ago
So we built Inference Arena, an open benchmarking hub where you can:
- Discover and compare inference results for open models across vLLM, SGLang, Ollama, MLX, and LM Studio
- See performance trade-offs for quantized versions
- Analyze throughput, latency, and cost side by side across hardware setups
- Explore benchmark data interactively or access it programmatically via MCP (Model Context Protocol)
You can also use it in Agent Mode where an agent can search, analyze, and compare results (and even fetch new ones from the web or subreddits).
We’d love feedback from you on:
- Which metrics matter most for your workflows (TTFT, TPS, memory, cost?)
- Other engines or quantization methods you’d like to see
- How we can make the data more useful for real-world inference tuning
MCP url: https://mcp-api-production-44d1.up.railway.app/ GitHub: https://github.com/firstbatchxyz/inference-arena