Herd is a zero-dependency Go library that manages fleets of OS subprocesses and routes HTTP traffic to them with strict 1:1 session affinity.
If you put heavy, stateful binaries (like Ollama, headless Chromium, or Python REPLs) behind a standard reverse proxy and get a spike in traffic, it usually ends badly. You either trigger a massive CUDA/Metal context storm that OOM-kills the host machine, or you bleed state across different users' sessions.
Herd handles this without needing a heavy control plane like Kubernetes StatefulSets or Firecracker. It gives you automatic process lifecycle management and a built-in reverse proxy in about 10 lines of Go.
How it works under the hood:
- It spawns OS-level subprocesses via exec.Cmd.
- It routes incoming HTTP traffic based on any custom Session ID you define (a header, a cookie, a path parameter).
- If a session exists, it routes to that exact pinned OS process.
- If it doesn't, it safely acquires a singleflight lock, spawns a new process, waits for the /health endpoint, and proxies the request.
- If a process crashes, the blast radius is contained to one session, and the pool auto-recovers.
To test the concurrency constraints, I hurled 200 concurrent LLM inference requests at a Herd gateway backed by a pool capped at 10 Ollama (Qwen3:0.6B) workers on an M4 Pro Mac. It scored 200/200 with zero dropped packets, acting as a perfect backpressure queue to safely drip-feed the OS without thrashing the host's Unified Memory.
It’s MIT licensed. Would love for you to check out the repo, try to break the singleflight lock, or review the architecture.
Repo: https://github.com/HackStrix/herd
Architecture & Mermaid Diagrams: https://github.com/HackStrix/herd/blob/main/docs/ARCHITECTUR...