WebSocket+Huffman vs. SSE+JSON for streaming LLM tokens

https://github.com/vidur2/token_entropy_encoder

1•vidur2•3h ago

Comments

vidur2•3h ago

I built a proof-of-concept that streams LLM tokens as Huffman-compressed binary over WebSocket instead of JSON text over SSE.

The Problem: Current LLM APIs (OpenAI, Anthropic, self-hosted) send decoded text wrapped in JSON. For every token, you get something like: `data: {"choices":[{"delta":{"content":"hello"}}]}`. This is verbose, wastes bandwidth, and forces the server to decode tokens to text (CPU cost).

The Solution: Stream raw token IDs as binary. The server sends Huffman-compressed token IDs over WebSocket, and the client decodes them locally using WASM. This offloads token decoding from server to client.

Results from mock benchmarks: - 30% faster for inline completions (the critical vibecoding use case) - 25% faster for small completions (100 tokens) - 12% faster overall average - ~60% bandwidth savings (3 bytes/token vs 8 bytes/token) - Client-side decoding means servers can handle more concurrent users

Architecture:

LLM → Token IDs → Huffman encode → WebSocket (binary) → WASM decode → Text vs. LLM → Token IDs → Decode to text → JSON → SSE (HTTP) → Parse → Text

Tech Stack: Rust (WASM for encoder/decoder), TypeScript (test harness), Node.js (mock servers). Includes comprehensive benchmarks comparing both protocols on identical workloads.

Limitations: - Requires modifying the LLM server to expose token IDs (standard APIs don't do this) - Tokenizer is baked in at build time (`./build.sh <tokenizer_name>`) - can't switch models dynamically - Mock server only - no real LLM integration yet - VS Code extension is non-functional (command registration issues) - Best for self-hosted deployments where you control the stack

The VS Code extension code is included but doesn't work. Benchmarks and Node.js examples demonstrate the approach.

Why it matters: - Protocol-level thinking for LLM APIs (not just server scaling) - Shows binary protocols + client-side decoding beats traditional HTTP/JSON - Opens discussion about whether LLM APIs should expose token IDs

Built this in ~3K LOC. Fully open source (MIT). Includes comprehensive benchmarks and Node.js examples.

Try it: https://github.com/vidur2/token_entropy_encoder

Looking for feedback on the approach, potential issues, and whether this is worth pursuing further!

Big Sleep Tracker: Google Project Zero + Google DeepMind find security bugs

Suggestion Regarding References to the Prophet Muhammad (Peace Be Upon Him)

Show HN: Career AutoPilot – AI guidance for navigating your career

Can a wealthy family change the course of a deadly brain disease?

Show HN: Contd makes interactive CLIs usable for agents in an async way

Hitting the High Notes (2005)

Show HN: What zero-intervention E2E test generation looks like

Neolab and Emerging AI Lab Tracker

"Clinejection" Turned an AI Bot into a Supply Chain Attack

Show HN: Managed S3 exports for billing data (no AWS setup required)

Coruna: The Mysterious Journey of a Powerful iOS Exploit Kit

Vibe Security Radar – Tracking the security cost of vibe coding

Spark Runner: Easily Automate Front End Tests

I built this privacy-focused analytics tool

"Game Development in Eight Bits" by Kevin Zurawel (2021) [video]

open_slate: A Powerful and Private 2-in-1 Tablet

Converting Binary Floating-Point Numbers to Shortest Decimal Strings

The era of Doctor AI is here

Show HN: Context-compact – Summarize agent context instead of truncating it

Coding Agents in Feb 2026

Calif. lawsuit accuses Meta of sending nude video from AI glasses to workers

Anthropic and The Pentagon

Show HN: Crypto data API where AI agents pay per request with USDC (x402)

The first AI counter surveillance app

Loop Conference Channel [YouTube]

The Mystery of Asjo.org

How College Admissions Officers Spot Over-Coached Applications

Our Hospice System Subverts the Point of Hospice Care

SEIU Delenda Est

Tell HN: Azure Data Factory pipeline execution delays in East US 2