fp.

I built a Ruby gem that caches LLM responses using semantic similarity. If someone asks "What's the capital of France?" and later "What is France's capital city?" — the second call hits the cache instead of the API.

How it works: - Queries are converted to embeddings (text-embedding-3-small) - Cosine similarity finds matches above a threshold (default 0.85) - Cache hit = instant response, no API call, no cost

Usage is simple:

  cache = SemanticCache.new

  response = cache.fetch("What's the capital of France?") do
    openai.chat(messages: [{ role: "user", content: "..." }])
  end

  # This returns the cached response — no API call
  response = cache.fetch("What is France's capital city?") do
    openai.chat(messages: [{ role: "user", content: "..." }])
  end

Features: - In-memory and Redis stores - TTL expiry and tag-based invalidation - Cost tracking with savings reports - Works with OpenAI, Anthropic, Gemini - Client wrapper that caches all calls automatically - Rails integration (concern + per-user namespacing) - Max cache size with automatic LRU eviction

In my testing, hit rates of 60-80% are typical for apps with repetitive user queries (chatbots, search, FAQ tools).

The math: if you spend $500/mo on OpenAI and get a 70% hit rate, that's $350/mo saved minus ~$2 in embedding costs.

Repo: https://github.com/stokry/semantic-cache Install: gem install semantic-cache

SpaYco•1w ago

really nice! I will give it a try for sure, I was just looking for something like this today, thanks!