Show HN: Stateful LLM inference (no cost for input tokens, not prompt-caching)

2•arkonrad•1h ago

Hi HN,

I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.

This leads to two big problems:

Performance & cost – constantly resending input tokens is wasteful.

Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.

Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.

We’re taking a different approach with ark-labs.cloud:

True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.

Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.

Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.

From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.

Docs https://ark-labs.cloud/documentation/

We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs. Does this approach make sense to you?

Comments

NitpickLawyer•1h ago

> Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.

Perhaps you misspoke / misquoted some internal copy, but that doesn't mean what you think it means, and "caching" in kv caching doesn't mean what you imply it means here. The model doesn't "jump" on anything because of kv caching.

> From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive

How is this related to LLM inference?! What are cookies doing there? What?

(from your docs) > OpenAI optimizes by processing every single request on randomly selected GPUs - but in the process most of the state is lost because only the final assistant reply is kept. Ark allows users to have a session during which all requests are processed on the same set of GPUs and the full internal state is maintained between requests. Depending on use case, this approach can improve both model's response quality and performance.

Yeah, except no. Every model builder so far has emphasised that this is not how you want to do it. With "thinking" models, you want to NOT include thinking steps for earlier messages, since that degrades the models outputs.

----

If you want to convince people about a better way of doing things, when the entire industry is doing another thing, you have to come up with data supporting your stance. Can you show such data? Do you have qualitative studies / benchmarks on your methods? Can you show that whatever state you hold is actually helping? That would go against the current practices of every inference engine out there currently, so it would be quite a thing to show.

arkonrad•25m ago

On cookies: we use an HTTP cookie (ark_session_id) purely as an opaque session identifier. The cookie is how the client ties subsequent requests to the same pinned session/worker/GPUs on the provider side so the provider can keep the model activations/state in GPU memory between calls. Not a magic for the model; it’s a routing key that enables true session affinity.

On “thinking steps” and contamination: good point - naively persisting raw chain-of-thought tokens can degrade outputs. ARKLABS Stateful approach is not a blanket “store everything” policy.

And my criticism targets higher-level provider practices: things like response caching, aggressive prompt-matching / deduplication heuristics, or systems that return previously generated outputs when a new prompt is “similar enough.” Those high-level caches absolutely can produce the behaviour I described - a subtle prompt change that nevertheless gets routed to a cached reply.

The platform has been launched — we’re collecting data, but early results are very promising: we’re seeing linear complexity, lower latency, and ~80% input-token savings. At the same time we’d love to hear more feedback on whether this approach could be useful in real-world projects.

And about going against the grain, as you mentioned at the end… well — if startups didn’t think differently from everyone else, what would be the point of being a startup?

Show HN: Open-Source Framework for Real-Time AI Video Avatars

WestJet says some passengers' personal information stolen in cyberattack

Apple Finally Destroyed Steve Jobs’ Vision of the iPad. Good

Eclipse Atlas

The Weight of a Cell

Fresh Eyes as a Service: Using LLMs to Test CLI Ergonomics

UnitedHealthcare rolling out new benefit aimed at early cancer detection

Show HN: Zdo – a CLI tool for tasks and todos using Markdown and Zig

Scientists capture first footage of human embryo implanting in a uterus

What are unrelated cost of switching to Linux you know?

Show HN: AI system that spots brand trolls before they can strike

How to ingest 1B rows/s in ClickHouse

An Interactive Guide to SVG Paths

James Cameron on AI futures: Where can the Terminator franchise go from here?

Stream Integration

Launch HN: Reality Defender (YC W22) – API for Deepfake and GenAI Detection

The Pitfalls of Peer Review

Show HN: Playing Piano with Prime Numbers

I benchmarked nine Go SQLite drivers and here are the results

Action Native Push for Ruby on Rails

Show HN: API for E-Commerce Data

First exclusive feature for open ESP32 Wi-Fi stack: standards-compliant meshing

Nothing New Under the Sun

Wearable sensor could help bipolar patients track medication through sweat

Show HN: Mac app to proofread and rewrite any highlighted text with Cmd+Shift+P

SiriusXM Sued for Alleged AI Hiring Bias

Show HN: I built a tool to generate 100s of short-form videos in minutes

Grindset is the new Protestant ethic

Show HN: I built 70 exercises for you to become better at JavaScript

3D Book Cover Creator