Ask HN: What makes it so hard to keep LLMs online?

2•realberkeaslan•3h ago

It feels like every few days one of the big AI services is down, degraded, or just slow. I don't mean this as a complaint. I'm just genuinely curious. These are well-funded companies with smart people. What is it about running these models that makes reliability so elusive? Is it just demand nobody predicted, or is there something fundamentally different about serving AI vs. a normal web app?

Comments

zipy124•2h ago

Likely one large contributor is that for a normal service, if it's down it's as simple as re-routing to another service, and there is basically an unlimited amount of CPU servers around the world to spin up on demand. GPU servers are much harder to spin up on demand, as supply is so constrained.

Another factor is just it's a new field and move fast and break things is still the go to as competition is high, and the stakes are incredibly high monetary wise.

A pessimistic, but perhaps true theory is also just vibe-coding/slop is reducing their reliability.

A counter point is that regular services like github seem to go down almost as frequently.

arc_light•2h ago

A piece that often gets overlooked: unlike a web app where you can cache aggressively and serve millions of users from relatively few servers, LLM inference is stateless in a weird way — each request is expensive and can't really be batched the way traditional workloads can. A spike in traffic doesn't just slow you down linearly, it creates queuing effects that cascade fast.

There's also the memory side. A large model has to live entirely in GPU VRAM to run efficiently. You can't just "add more RAM" on the fly the way you can with CPU workloads. Scaling means acquiring, provisioning, and loading entirely new physical machines — which takes minutes to hours, not seconds.

So you end up with a system that's simultaneously very expensive per-request, very hard to scale horizontally in real time, and very sensitive to traffic spikes. That's a reliability engineer's nightmare even before you factor in the supply constraints the sibling comment mentioned.

andyjohnson0•24m ago

Not sure why this was apparently flagged to death. Vouched.

angarrido•1h ago

must people think it’s just GPU cost. In practice it’s coordination: model latency variance + queueing + retries under load. You don’t scale linearly, you get cascading slowdowns.

Ask HN: What Are You Working On? (April 2026)

Tell HN: Docker pull fails in Spain due to football Cloudflare block

Ask HN: What makes it so hard to keep LLMs online?

Tell HN: Another Monday, Another Claude Outage

Ask HN: What's the best AI model for system design nowadays?

Tell HN: OpenAI silently removed Study Mode from ChatGPT

How do you validate your GTM Efforts?

Ask HN: Are you negatively affected by the recent economic stagnation?

Ask HN: What are you building that's not AI related?

Tell HN: Reddit now demands to know why you won't use their app

Ask HN: Is Codex really on Par with Claude Code?

Ask HN: Do you trust AI agents with API keys / private keys?

Ask HN: Anyone using Nostr as a lightweight back end/DB for rapid prototyping?

Ask HN: Hiring in the age of AI-assisted coding: what works?

Is the pitch deck culture making founders worse at building businesses?

Ask HN: Best books on building a programming language

Ask HN: What's your experience with PoW captchas against form spam?

Antigravity Is Down

Ask HN: What should I do with my app? 130 downloads 3 real subscribers

Any Open Source projects in need of documentation writer?

Ask HN: Former grok-code-fast-1 users, what coding model are you using now?

Ask HN: Why Databases Instead of Filesystem?

Ask HN: Agentic Permutation of Testing Paths In A System

Ask HN: Should AI credits be refunded on mistakes?

Tor Browser on Android leaks IP in desktop mode

Ask HN: Has anyone reconsidered Antivirus software after recent security news?

Do founders' political views affect how you see a product?

I collected startup ideas. It changed how I think about ideas completely

Ask HN: How do you manage your digital legacy for after you die?

Ask HN: Local-first meetings recorder and transcriber?

Ask HN: What makes it so hard to keep LLMs online?

Comments

Ask HN: What Are You Working On? (April 2026)

Tell HN: Docker pull fails in Spain due to football Cloudflare block

Ask HN: What makes it so hard to keep LLMs online?

Tell HN: Another Monday, Another Claude Outage

Ask HN: What's the best AI model for system design nowadays?

Tell HN: OpenAI silently removed Study Mode from ChatGPT

How do you validate your GTM Efforts?

Ask HN: Are you negatively affected by the recent economic stagnation?

Ask HN: What are you building that's not AI related?

Tell HN: Reddit now demands to know why you won't use their app

Ask HN: Is Codex really on Par with Claude Code?

Ask HN: Do you trust AI agents with API keys / private keys?

Ask HN: Anyone using Nostr as a lightweight back end/DB for rapid prototyping?

Ask HN: Hiring in the age of AI-assisted coding: what works?

Is the pitch deck culture making founders worse at building businesses?

Ask HN: Best books on building a programming language

Ask HN: What's your experience with PoW captchas against form spam?

Antigravity Is Down

Ask HN: What should I do with my app? 130 downloads 3 real subscribers

Any Open Source projects in need of documentation writer?

Ask HN: Former grok-code-fast-1 users, what coding model are you using now?

Ask HN: Why Databases Instead of Filesystem?

Ask HN: Agentic Permutation of Testing Paths In A System

Ask HN: Should AI credits be refunded on mistakes?

Tor Browser on Android leaks IP in desktop mode

Ask HN: Has anyone reconsidered Antivirus software after recent security news?

Do founders' political views affect how you see a product?

I collected startup ideas. It changed how I think about ideas completely

Ask HN: How do you manage your digital legacy for after you die?

Ask HN: Local-first meetings recorder and transcriber?