100 workers all retry independently → retry storm Sequential fallbacks are slow (try OpenRouter, wait 5s, try Anthropic, wait 5s) No coordination across instances
So I built a coordination layer that:
Races multiple providers simultaneously (OpenRouter + Anthropic + OpenAI) Coordinates retries across all workers (no retry storms) Resumes workflows via webhooks (idempotent keys = checkpoints)
It runs on Fly.io's anycast network + BEAM for distributed coordination. Architecture deep dive: https://www.ezthrottle.network/blog/making-failure-boring-ag... Happy to answer questions about the approach or why BEAM made this possible when other languages would struggle.