SSE sucks for transporting LLM tokens

https://zknill.io/posts/sse-sucks-for-transporting-llm-tokens/

38•zknill•2mo ago

Comments

ivan_gammel•1mo ago

I don’t get it. Client generates UUID for prompt, PUTs the prompt with this UUID on server. Server caches the generated output for reasonable time, so that subsequent PUTs get 200 instead of 201. Transport protocol failures then do not matter. If response isn’t 4x, just retry.

riskable•1mo ago

The way the current architecture works—as far as I know—is your assumed "server caches the generated output" step doesn't exist. What you get in your output is streamed directly from the LLM to your client. Which is, in theory, the most efficient way to do it.

That's why LLM outputs that get cut off mid-stream require the end user click the "retry" button and not the, "re-send me that last output" button (which doesn't exist).

I would imagine that a simpler approach would be to simply make the last prompt idempotent... Which would require caching on their servers; something that supposedly isn't happening right now. That way, if the user re-sends the last prompt the server just responds with the same exact output it just generated. Except LLMs often make mistakes and hallucinate things... So re-sending the last prompt and hoping for a better output isn't an uncommon thing.

Soooo... Back to my suggested workaround in my other comment: Pub/sub over WebSockets :D

debazel•1mo ago

But adding caching to SSE is trivial compared to completely changing your transfer protocol, so why wouldn't you just do that instead?

bjt•1mo ago

The user's last prompt can be sent with an idempotency key that changes each time the user initiates a new request. If that's the same, use the cache. If it's new, hit the LLM again.

ivan_gammel•1mo ago

The only reason LLM server responds with partial results instead of waiting and returning all at once is UX. It’s just too slow. But the problem of slow bulk responses isn’t unique for LLM and can be solved within HTTP 1.1 well enough. Doesn’t have to be the same server, can be a caching proxy in front of it. Any privacy concerns can be addressed by giving the user opportunity to tell server to cache/not to cache (can be as easy as submitting with PUT vs POST requests)

imsh4yy•1mo ago

Yep, came here expecting to read an interesting take on why SSE sucks or a better alternative, but this just reads like "skill issue." A term I very much dislike but seems appropriate here.

ivan_gammel•1mo ago

Significant part of relatively new technology stacks and tech slang is “skill issue”. A lot of problems were already solved or at least analyzed 40-20 years ago and hardly need to be re-invented, maybe just modernized.

verdverm•1mo ago

It's likely better to just store them in the database, it could be hours or days later that I want to look at the session again. If you're going to want a database for conversation anyway, then it doesn't make sense to cache and query individual messages

ivan_gammel•1mo ago

A database can be the implementation of choice for the cache. But not all use cases do require long-term storage like that.

verdverm•1mo ago

yup, the ADK framework I use handles all that bookkeeping for me, has a few options built in, and pretty easy to add new implementations. I'm extending it to attach filesys+exe changes during a session, persisted as container layers via Dagger

riskable•1mo ago

Pub/sub via WebSockets seems like the simplest solution. You'll need to change your LLM serving architecture around a little bit to use a pub/sub system that a microservice can grab the output from (to send to the client) but it's not rocket science.

It's yet another system that needs some DRAM though. The good news is that you can auto-expire the queued up responses pretty fast :shrug:

No idea if it's worth it, though. Someone with access to the statistics surrounding dropped connections/repeated prompts at a big LLM service provider would need to do some math.

nightshift1•1mo ago

I think it would be even more wasteful to continue inference in background for nothing if the user decided to leave without pressing the stop button. Saving the partial answer at the exact moment the client disappeared would be better.

verdverm•1mo ago

What if I want to have the agent go off and work on something for a while and I'll check back tomorrow?

bragh•1mo ago

Corporate security hates websockets though, SSE is much easier for end-users to get approved.

anonymoushn•1mo ago

so sad to hear that about Streaming SIMD Extensions

devnull3•1mo ago

> the model has to re-run the generation, and the client has to start receiving tokens from scratch again.

I don't understand. The payload can be designed to have sequence number. In case of reconnect, send the last known sequence number. Sounds like a application level protocol problem and not transport. Am I missing something?

The pub/sub mentioned in the article essentially does the same thing.

medbrane•1mo ago

Indeed, the reconnect behavior is described in the protocol and the server will simply resume from the requested sequence id.

petcat•1mo ago

The blog author is confusing SSE the protocol itself, with how the application is typically implemented. SSE is great and can trivially be implemented in a way that allows history, catch-up, and resuming. The "Pub/Sub" mentioned at the end of the exact application of SSE that the author wants.

verdverm•1mo ago

Exactly, I use ADK (https://google.github.io/adk-docs/runtime/) and it handles all that bookkeeping for me. If my client disconnects, the engine continues to run until the turn is up, recording all events.

We need more frameworks like ADK that handle the bookkeeping and give us great abstractions for building agentic systems.

embedding-shape•1mo ago

If you don't have a proper grasp of what is the transport, what the is the protocol and what is your application protocol, I think chucking in libraries to try to help often makes things too complicated. You still would need to understand the differences and nuances.

verdverm•1mo ago

yup, I have recently started saying "building blocks over batteries included"

particularly as it comes to people trying to sell me on their Agent "framework", which amounts to little more than some well built tools and prompts, but pigeon holes me into how they think about solving certain issues in the agentic space, based on how things work today. If I go out 2 years, do I have to wait for the "framework" to realize their ideas are now out-of-touch and wait for them to course correct, or have I selected a framework that allows me to easily experiment, evaluate, and adjust any technique, with an ecosystem of building blocks for both the provider and user side of what I am building

embedding-shape•1mo ago

> The blog author is confusing SSE the protocol itself, with how the application is typically implemented

Yeah, pretty common misunderstand among us self-taught developers who at one point never came across things like the OSI model (https://en.wikipedia.org/wiki/OSI_model) or similar before, that we confuse what layer things actually happens at.

normie3000•1mo ago

What is SSE?

devnull3•1mo ago

Server Sent Events [1]

[1] https://developer.mozilla.org/en-US/docs/Web/API/Server-sent...

inesranzo•1mo ago

> A better approach: Pub/Sub

Citation Needed.

More importantly, benchmarks needed.

Cannot claim something X is better approach than Y without benchmarks, it is an idea but needs to be proven to be better.

Until then, this post is nothing more than yet another opinion.

tedivm•1mo ago

SSE just sucks for most use cases, we don't have to go through each one pointing it out.

sauercrowd•1mo ago

Seems like there's a few abstractions mixed up, the problems have nothing to do with SSE.

You can store the state in the SSE connection and have the problems described, and if you don't like those, you can move thr state to something distributed/persisted.

Pubsub is just a layer on top of SSE or websockets, cause guess how it'd end up sending things to the browser

aguynamedben•1mo ago

Yeah I didn't really get that... PubSub is more of a design pattern... you still have to get the data transported to the browser (via WebSockets, SSE, etc.)

bjt•1mo ago

Weird take. The id field in the SSE spec is there specifically so you can resume a stream. And that requires persistence/caching on the server side. Once you have those things, you're practically at the pubsub option that the article prefers.

the_mitsuhiko•1mo ago

Precisely. In fact SSE as a protocol was specifically designed to support resuming. It’s unfortunate that most APIs don’t support that but that’s not the fault of SSE.

wyldfire•1mo ago

Not x86_64's Streaming SIMD Extensions, but Server-sent events [1]. SSE and AVX are probably not that bad at /handling/ LLM tokens...

[1] https://en.wikipedia.org/wiki/Server-sent_events

FuckButtons•1mo ago

Sse is ancient and limited to 128 bit wide registers without any native 16 bit float ops or 8/4 bit ints. It’s definitely going to be very slow.

wyldfire•1mo ago

And yet it's an improvement over the scalar core.

mirekrusin•1mo ago

Response API does support resuming.

But it's not gigantic improvement as models don't regenerate "lost"/past parts of conversation, they're heavily cached and were from pretty much day 1, that's why they have highly reduced cost.

nixpulvis•1mo ago

Author tries to avoid a database for storing tokens while the client is disconnected and ends up storing them in a pub/sub provider.

There's no solution other than to store the tokens somewhere, or drop them. You have to make a choice how long you want to allow reconnects for. And this is all pretty independent of the transport layer, as the author even mentioned themselves, you can resume even a new session as long as you have a prompt ID or something to tie it back to the original request.

I don't know enough about how the LLM providers stream results, but the original claim that inference is more expensive than transport is a good point, and caching tokens seems like a smart move. Unfortunately, we pay by the token, so I don't see the incentive for providers to spend time and money doing this for us.

embedding-shape•1mo ago

> Unfortunately, we pay by the token, so I don't see the incentive for providers to spend time and money doing this for us.

Providing a better service, for one. Plenty of providers do offer caching, both input and output tokens, and usually give you a cheaper price for it too. Example from two of them: https://platform.claude.com/docs/en/build-with-claude/prompt... & https://api-docs.deepseek.com/guides/kv_cache

nixpulvis•1mo ago

I feel like it's slightly different to cache duplicate parts of the input, vs storing outputs when a connection drops.

FuckButtons•1mo ago

It seems like a good use case for a caching layer. It seems like you would probably be able to make a set up for agentic systems more simply / cheaply in Hetzner than trying to cobble together a bunch of fragmented apis.

binarymax•1mo ago

Websockets are great for this. Initialize the completion with an id, and store the token stream while emitting as a websocket event. Keep the id in state somewhere on the client (url works great with pushstate). If the client disconnects then just replay the event stream for the id.

samwillis•1mo ago

This is one of the main use cases we are building "Durable Streams" for, it's an open source spec for a resumable and durable stream protocol. It's essentially an append only log with a http api.

https://github.com/durable-streams/durable-streams

https://electric-sql.com/blog/2025/12/09/announcing-durable-...

When we built ElectricSQL we needed a resumable and durable stream of messages for sync and developed a highly robust and scalable protocol for it. We have now taken that experience and are extracting the underlying transport as an open protocol. This is something the industry needs, and it's essential that it's a standard that portable between provider, libraries and SDKs.

The idea is that a stream is a url addressable entity that can be read and tailed, using very simple http protocol (long polling and a SSE-like mode). But it's fully resumable from a known offset.

We've been using the previous iteration of this as the transport part of the electric sync protocol for the last 18 months. It's very well tested, both on servers, in the browser, but importantly in combination with CDNs. It's possible to scale this to essential unlimited connections (we've tested to 1 million) by request collapsing in the CDN, and as it's so cacheable it lifts a lot of load of your origin when a client reconnect from the start.

For the LLM use case you will be able to append messages/tokens directly to a stream via a http post (we're working on specifying a websocket write path) and the client just tails it. If the user refreshes the page it will just read back from the start and continue tailing the live session. Avoids appending tokens to a database in order to provide durability.

aeciorc•1mo ago

Weird post. Do dropped connections happen often enough that your users are inconvenienced about it? If not, why overcomplicate? SSE + polling for cached output as plan b is just fine for most people.

_pdp_•1mo ago

We ended up using JSONL for our implementation but it is also possible to switch using other "streamable" variants like SSE and even CSV if that is what you are into.

But, as the OP suggested, none of these are resumable. So we have created a pub/sub over HTTP that does exactly this.

The way it works is that you get a unique URL that is specific to the current authenticated session. You can request data as many time as you like from that URL and it will get streamed to your client from the last checkpoint.

It works well even with rudimentary tools like curl. There is practically no protocol of sorts required - the kind of implementation I really like because who wants more complications in their life, right?

The big issue build conversational and even agentic AI with off-the-shelf frameworks rather than using a platform such as ours is that there is a lot of plumbing involved to get these functionalities going which is makes even basic setup a lot more complicated depending on the kind of platform you are deploying to.

michaelsbradley•1mo ago

The concept of database cursor, or more generally continuation token, has been around a long time. Jumping to pub/sub is missing the mark.

You can have RPC prompt(input) return an opaque cursor, and promptNext(cursor) which returns partial output and the next cursor. As appropriate, the client could specify the desired size of output chunks along with input. The server could advertise or have documented the grace period for exhausting the cursors, but that’s not strictly necessary as it could be indicated in a promptNext(cursor) call that failed because of timeout. Transport session reuse can be handled automatically by HTTP.

When client receives output from a promptNext call it can work on updating the UI while promptNext is dispatched again in the background.

raggi•1mo ago

Resumable “at least once” pub sub would be exactly the same architectural semantics as a more complete SSE implementation.

A server would keep a buffer of events and a cursor keyed by some kind of event id. Resuming clients would reattach on a new dissociated connection and ask to resume from a particular id.

This is what the id field in events and the last-event-id are for in the SSE specification.

A pub sub system would also not solve the described unidirectional problem.

As inference is so expensive it seems natural to add a sensible stream buffer solution in some middleware and that’s trivial to write. Hell it’s the kind of thing a one paragraph prompt would almost certainly one-shot.

The post seems to be more of a reaction to some specific situation than a meaningful commentary on the protocol or associated architectures.

cheald•1mo ago

There's absolutely nothing wrong with SSE. The problem is that the author is, apparently, stapling SSE directly onto streaming LLM responses, using the LLM inference provider as the "message queue", and then complaining that it doesn't provide seekability and resumption. Things that aren't MQs make pretty bad MQs. The problem isn't the message bus, it's trying to treat an LLM as a message queue.

Use an MQ if you want MQ things. Worse, the conclusion is wrong - pubsub isn't an MQ, though it often uses them. Pubsub is a pattern for message delivery, which by itself, makes no inherent guarantees about message durability or ordering; Rails' ActionCable, for example, utilizes the PubSub pattern (and in fact is built on Redis' pubsub feature!), but due to its implementation, it makes no delivery order guarantees, which is completely inappropriate for LLM inference.

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

FDA intends to take action against non-FDA-approved GLP-1 drugs

You Are Here

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Reinforcement Learning from Human Feedback

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

FDA intends to take action against non-FDA-approved GLP-1 drugs

You Are Here

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Reinforcement Learning from Human Feedback

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

SSE sucks for transporting LLM tokens

Comments