It's yet another system that needs some DRAM though. The good news is that you can auto-expire the queued up responses pretty fast :shrug:
No idea if it's worth it, though. Someone with access to the statistics surrounding dropped connections/repeated prompts at a big LLM service provider would need to do some math.
I don't understand. The payload can be designed to have sequence number. In case of reconnect, send the last known sequence number. Sounds like a application level protocol problem and not transport. Am I missing something?
The pub/sub mentioned in the article essentially does the same thing.
We need more frameworks like ADK that handle the bookkeeping and give us great abstractions for building agentic systems.
Yeah, pretty common misunderstand among us self-taught developers who at one point never came across things like the OSI model (https://en.wikipedia.org/wiki/OSI_model) or similar before, that we confuse what layer things actually happens at.
[1] https://developer.mozilla.org/en-US/docs/Web/API/Server-sent...
Citation Needed.
More importantly, benchmarks needed.
Cannot claim something X is better approach than Y without benchmarks, it is an idea but needs to be proven to be better.
Until then, this post is nothing more than yet another opinion.
You can store the state in the SSE connection and have the problems described, and if you don't like those, you can move thr state to something distributed/persisted.
Pubsub is just a layer on top of SSE or websockets, cause guess how it'd end up sending things to the browser
But it's not gigantic improvement as models don't regenerate "lost"/past parts of conversation, they're heavily cached and were from pretty much day 1, that's why they have highly reduced cost.
There's no solution other than to store the tokens somewhere, or drop them. You have to make a choice how long you want to allow reconnects for. And this is all pretty independent of the transport layer, as the author even mentioned themselves, you can resume even a new session as long as you have a prompt ID or something to tie it back to the original request.
I don't know enough about how the LLM providers stream results, but the original claim that inference is more expensive than transport is a good point, and caching tokens seems like a smart move. Unfortunately, we pay by the token, so I don't see the incentive for providers to spend time and money doing this for us.
Providing a better service, for one. Plenty of providers do offer caching, both input and output tokens, and usually give you a cheaper price for it too. Example from two of them: https://platform.claude.com/docs/en/build-with-claude/prompt... & https://api-docs.deepseek.com/guides/kv_cache
https://github.com/durable-streams/durable-streams
https://electric-sql.com/blog/2025/12/09/announcing-durable-...
When we built ElectricSQL we needed a resumable and durable stream of messages for sync and developed a highly robust and scalable protocol for it. We have now taken that experience and are extracting the underlying transport as an open protocol. This is something the industry needs, and it's essential that it's a standard that portable between provider, libraries and SDKs.
The idea is that a stream is a url addressable entity that can be read and tailed, using very simple http protocol (long polling and a SSE-like mode). But it's fully resumable from a known offset.
We've been using the previous iteration of this as the transport part of the electric sync protocol for the last 18 months. It's very well tested, both on servers, in the browser, but importantly in combination with CDNs. It's possible to scale this to essential unlimited connections (we've tested to 1 million) by request collapsing in the CDN, and as it's so cacheable it lifts a lot of load of your origin when a client reconnect from the start.
For the LLM use case you will be able to append messages/tokens directly to a stream via a http post (we're working on specifying a websocket write path) and the client just tails it. If the user refreshes the page it will just read back from the start and continue tailing the live session. Avoids appending tokens to a database in order to provide durability.
But, as the OP suggested, none of these are resumable. So we have created a pub/sub over HTTP that does exactly this.
The way it works is that you get a unique URL that is specific to the current authenticated session. You can request data as many time as you like from that URL and it will get streamed to your client from the last checkpoint.
It works well even with rudimentary tools like curl. There is practically no protocol of sorts required - the kind of implementation I really like because who wants more complications in their life, right?
The big issue build conversational and even agentic AI with off-the-shelf frameworks rather than using a platform such as ours is that there is a lot of plumbing involved to get these functionalities going which is makes even basic setup a lot more complicated depending on the kind of platform you are deploying to.
You can have RPC prompt(input) return an opaque cursor, and promptNext(cursor) which returns partial output and the next cursor. As appropriate, the client could specify the desired size of output chunks along with input. The server could advertise or have documented the grace period for exhausting the cursors, but that’s not strictly necessary as it could be indicated in a promptNext(cursor) call that failed because of timeout. Transport session reuse can be handled automatically by HTTP.
When client receives output from a promptNext call it can work on updating the UI while promptNext is dispatched again in the background.
A server would keep a buffer of events and a cursor keyed by some kind of event id. Resuming clients would reattach on a new dissociated connection and ask to resume from a particular id.
This is what the id field in events and the last-event-id are for in the SSE specification.
A pub sub system would also not solve the described unidirectional problem.
As inference is so expensive it seems natural to add a sensible stream buffer solution in some middleware and that’s trivial to write. Hell it’s the kind of thing a one paragraph prompt would almost certainly one-shot.
The post seems to be more of a reaction to some specific situation than a meaningful commentary on the protocol or associated architectures.
ivan_gammel•2h ago
riskable•2h ago
That's why LLM outputs that get cut off mid-stream require the end user click the "retry" button and not the, "re-send me that last output" button (which doesn't exist).
I would imagine that a simpler approach would be to simply make the last prompt idempotent... Which would require caching on their servers; something that supposedly isn't happening right now. That way, if the user re-sends the last prompt the server just responds with the same exact output it just generated. Except LLMs often make mistakes and hallucinate things... So re-sending the last prompt and hoping for a better output isn't an uncommon thing.
Soooo... Back to my suggested workaround in my other comment: Pub/sub over WebSockets :D
debazel•1h ago
bjt•1h ago
ivan_gammel•1h ago
imsh4yy•2h ago
ivan_gammel•1h ago
verdverm•45m ago
ivan_gammel•6m ago