I built LLMKit after getting frustrated with choosing the right LLM for different projects. Instead of guessing or relying on benchmarks that don't match real use cases, I wanted to see actual performance with my own prompts.
What it does: • Compare up to 5 models simultaneously (GPT-4, Claude, Gemini, etc.) • Real-time streaming comparison - watch models race to respond • Custom scoring weights based on your priorities (speed vs cost vs quality) • System prompt support for production-realistic testing • TTFT (Time to First Token) metrics for latency-sensitive apps • No signup required, API keys stay in your browser
The "aha moment" was adding streaming comparison - seeing GPT-4 start fast but Claude catch up, or watching cost-effective models perform surprisingly well. It's like A/B testing but for LLMs.
Built with Next.js + TypeScript. The streaming implementation was tricky - had to handle different provider formats (OpenAI vs Anthropic) and parallel SSE connections.