InferShrink wraps your existing OpenAI/Anthropic/Google client in 3 lines. It classifies prompt complexity and routes to the cheapest model that can handle it. Same provider, no surprise switches.
The pipeline: classify → compress (LLMLingua, optional) → retrieve (FAISS, optional) → route → track. When all stages combine, 10x+ cost reduction on mixed workloads.
Key design decisions:
• Same-provider routing only. If you use OpenAI, it stays on OpenAI. No cross-provider surprises. • Sub-millisecond classification overhead • Optional FAISS retrieval + LLMLingua compression for RAG pipelines • 539 tests, Semgrep + Trivy scanned
pip install infershrink
Blog post with the reasoning: https://musashimiyamoto1-cloud.github.io/infershrink-site/bl...
Happy to answer questions about the routing heuristics or compression tradeoffs.