frontpage.

We’ve been running distributed LLM infrastructure at work for a while and over time we’ve built a few tools to make it easier to manage them. Olla is the latest iteration - smaller, faster and we think better at handling multiple inference endpoints without the headaches.

The problems we kept hitting without these tools:

* One endpoint dies > workflows stall

* No model unification so routing isn't great

* No unified load balancing across boxes

* Limited visibility into what’s actually healthy

* Failures when querying because of it

* We'd love to merge all them into OpenAI queryable endpoints

Olla fixes that - or tries to. It’s a lightweight Go proxy that sits in front of Ollama, LM Studio, vLLM or OpenAI-compatible backends (or endpoints) and:

* Auto-failover with health checks (transparent to callers)

* Model-aware routing (knows what’s available where)

* Priority-based, round-robin, or least-connections balancing

* Normalises model names for the same provider so it's seen as one big list say in OpenWebUI

* Safeguards like circuit breakers, rate limits, size caps

We’ve been running it in production for months now, and a few other large orgs are using it too for local inference via on prem MacStudios, RTX 6000 rigs.

A few folks that use JetBrains Junie just use Olla (https://thushan.github.io/olla/usage/#development-tools-juni...) in the middle so they can work from home or work without configuring each time (and possibly Cursor etc).

You can compare how Olla is complimentary with tools like LiteLLM (https://thushan.github.io/olla/compare/litellm/) and others in our docs (https://thushan.github.io/olla/compare/overview/).

Links:

GitHub: https://github.com/thushan/olla

Docs: https://thushan.github.io/olla/

Olla is still very much in early development (v0.0.16).

Next up: auth support so it can also proxy to OpenRouter, GroqCloud, etc.

If you give it a spin, let us know how it goes (and what breaks). Oh yes, Olla does mean other things (https://thushan.github.io/olla/about/#the-name-olla).

Show HN: Olla – Lightweight LLM Proxy for Homelab and OnPrem AI Inference