We built a RAG system for enterprise clients and realized most production RAGs are optimization disasters. The literature obsesses over accuracy while completely ignoring unit economics.
The Three Cost Buckets Vector Database (40-50% of bill) Standard RAG pipelines do 3-5 unnecessary DB queries per question. We were making 5 round-trips for what should've been 1.5.
LLM API (30-40%) Standard RAG pumps 8-15k tokens into the LLM. That's 5-10x more than necessary. We found: beyond 3,000 tokens of context, accuracy plateaus. Everything beyond that is noise and cost.
Infrastructure (15-25%) Vector databases sitting idle, monitoring overhead, unnecessary load balancing.
What Actually Moved the Needle Token-Aware Context (35% savings) Budget-based assembly that stops when you've used enough tokens. Before: 12k tokens/query. After: 3.2k tokens. Same accuracy.
python def _build_context(self, results, settings): max_tokens = settings.get("max_context_tokens", 2000) current_tokens = 0 for result in results: tokens = self.llm.count_tokens(result) if current_tokens + tokens <= max_tokens: current_tokens += tokens else: break Hybrid Reranking (25% savings) 70% semantic + 30% keyword scoring. Better ranking means fewer chunks needed. Top-20 → top-8 retrieval while maintaining quality.
Embedding Caching (20% savings) Workspace-isolated cache with 7-day TTL. We see 45-60% hit rate intra-day.
python async def set_embedding(self, text, embedding, workspace_id=None): key = f"embedding:ws_{workspace_id}:{hash(text)}" await redis.setex(key, 604800, json.dumps(embedding)) Batch Embedding (15% savings) Batch API pricing is 30-40% cheaper per token. Process 50 texts simultaneously instead of individu