We kept seeing teams reinvent similar patterns in slightly different ways, especially around correlating events, handling partial failures, and keeping the frontend in sync with what actually happened on the backend. The goal with this writeup was to make those tradeoffs explicit and show what’s actually happening on the wire in each approach.
Curious to hear how others here are handling long-lived or streaming AI requests in production, especially once things start failing in non-obvious ways.
Teams usually integrate it incrementally in front of existing calls. If you remove it, you’re mostly deleting the orchestration layer and keeping your provider integrations and client logic. You lose centralized retries and observability, but you’re not stuck rewriting your entire request model.
If adopting it requires a full rewrite, that’s usually a sign it’s being applied too broadly.
Queue-based async still works well for batch jobs, offline processing, or anything where latency and ordering aren’t user-visible. The event-driven approach mainly pays off once you have long-lived or interactive requests where failures can happen mid-response and you care about what the user actually sees.
The requests that “grow” tend to share a few signals early on: they stream partial results, they take long enough that the frontend needs progress updates, or failures start happening after something has already been shown to the user. Another common signal is when retries stop being transparent and you start needing to explain to users what actually happened.
Once those patterns show up, teams usually end up reworking the flow anyway. The event-driven approach just makes that lifecycle explicit earlier, instead of letting it emerge implicitly and painfully over time.
The main thing we try to avoid is pretending mid-stream retries are the same as pre-request retries. Once a stream has started, we treat it as a sequence of events with checkpoints rather than a single opaque response. Retries are scoped to known safe boundaries, and anything ambiguous is surfaced explicitly instead of silently re-emitting tokens.
In other words, correctness is prioritized over pretending the stream is seamless. If we can’t guarantee no duplication, we make that visible rather than hide it.
akarshc•1h ago
This page breaks down the three request patterns we see teams actually using in production (sync, async, and event-driven async), how data flows in each case, and why we ended up favoring an event-driven approach for interactive, streaming apps.
Happy to answer questions or go deeper on any part of the architecture.