tokio-prompt-orchestrator breaks LLM inference into 5 physical stages (RAG → Assemble → Inference → Post-Process → Stream), each running in its own Tokio task with bounded channels between them. When a stage falls behind, backpressure builds locally instead of blowing up the whole pipeline. Some things that might be interesting to folks here:
Circuit breakers per provider (OpenAI, Anthropic, local llama.cpp) so one failing API doesn't cascade Request deduplication that saved 60-80% on inference costs in my testing Prometheus metrics + a TUI dashboard for watching the pipeline in real time MCP server integration so you can use it as a Claude Desktop tool
It's 58k lines of Rust, MIT licensed, no unsafe. Been running it in production for my own projects for a few months now. Would love feedback on the channel sizing heuristics and the retry/backoff strategy, those were the hardest parts to get right. Happy to answer questions about the architecture.
GitHub: https://github.com/Mattbusel/tokio-prompt-orchestrator