Over the last few months, I noticed a massive problem: developers (including me) are lazy. We were sending every single prompt—even basic JSON extractions—to GPT-4o or Claude 3.5 Sonnet, and my API bills were sky rocketing
Because of this I built an AI gateway to fix this. It acts as a drop-in replacement for your OpenAI endpoint. When a request comes in, a tiny, fast classifier scores the prompt's complexity in a few milliseconds. It switches which LLM to use based on it's prompt complexity and cuts your costs by around 30%.
Simple extraction and formatting: Routes to Llama 3 8B or Gemini Flash (costs almost nothing).
Complex reasoning: Routes to GPT-4o/Claude (costs dollars).
Semantic Cache: If the exact same question was asked 5 minutes ago, it serves the cached response instantly (costs zero).
I'd love feedback on this. Happy to answer any questions!