We "solved" C10K years ago yet we keep reinventing it (2003)

https://www.kegel.com/c10k.html

108•birdculture•1mo ago

Comments

gnabgib•1mo ago

(2011 / 2003)

Title: The C10K problem

Popular in:

2014 (112 points, 55 comments) https://news.ycombinator.com/item?id=7250432

2007 (13 points, 3 comments) https://news.ycombinator.com/item?id=45603

trueismywork•1mo ago

With nginx and 256 core Epycs, most single servers can easily do 200k requests per sec. Very few companies have more needs

intothemild•1mo ago

I can't tell if this is sarcasm or not.

They didn't have this kind of compute back when the article was written. Which is the point in the article.

trueismywork•1mo ago

Half serious. I guess what Iwas saying is that it is that kind of science which is still very useful but more to nginx developers themselves. And most users now dont have to worry about this anymore.

Should have prefixed my comment wirh "nowadays"

hinkley•1mo ago

In spring 2005 Azul introduced a 24 core machine tuned for Java. A couple years later they were at 48 and then jumped to an obscene 768 cores which seemed like such an imaginary number at the time that small companies didn’t really poke them to see what the prices were like. Like it was a typo.

fweimer•1mo ago

Before clusters with fast interconnects were a thing, there were quite a few systems that had more than a thousand hardware threads: https://linuxdevices.org/worlds-largest-single-kernel-linux-...

We're slowly getting back to similarly-sized systems. IBM now has POWER systems with more than 1,500 threads (although I assume those are SMT8 configurations). This is a bit annoying because too many programs assume that the CPU mask fits into 128 bytes, which limits the CPU (hardware thread) count to 1,024. We fixed a few of these bugs twenty years ago, but as these systems fell out of use, similar problems are back.

alexjplant•1mo ago

> Driven by 1,024 Dual-Core Intel Itanium 2 processors, the new system will generate 13.1 TFLOPs (Teraflops, or trillions of calculations per second) of compute power.

This is equal to the combined single precision GPU and CPU horsepower of a modern MacBook [1]. Really makes you think about how resource-intensive even the simplest of modern software is...

[1] https://www.cpu-monkey.com/en/igpu-apple_m4_10_core

fweimer•1mo ago

Note that those 13.1 TFLOPs are FP64, which isn't supported natively on the MacBook GPU. On the other hand, local/per-node memory bandwidth is significantly higher on the MacBook. (Apparently, SGI Altix only had 8.5 to 12.8 GB/s.) Total memory bandwidth on larger Altix systems was of course much higher due to the ridiculous node count. Access to remote memory on other nodes could be quite slow because it had to go through multiple router hops.

hinkley•1mo ago

My Apple Watch can blow the doors off a Cray 1. It’s crazy.

marcosdumay•1mo ago

The article was written exactly because they had machines capable enough at the time. But the software worked against it on every level.

yencabulator•1mo ago

I mean, yes and no. It was a software challenge to hit the hardware limit, but the hardware limits were also much lower. My team stopped optimizing when we maxed out the PCI bus in ~2001.

Maxatar•1mo ago

I don't see how you could have read the article and come to this conclusion. The first few sentences of the article even go into detail about how a cheap $1200 consumer grade computer should be able to handle 10,000 concurrent connections with ease. It's literally the entire focus of the second paragraph.

2003 might seem like ancient history, but computers back then absolutely could handle 10,000 concurrent connections.

api•1mo ago

I’m shocked that a 256 core Epyc can’t do millions of requests per second at a minimum. Is it limited by the net connection or is there still this much inefficiency?

otterdude•1mo ago

256 Processes x 10k clients (per the article) = 256K RPS

mrweasel•1mo ago

Aren't you of by a zero? 10K requests per core / per second, time 256 cores is 2.560.000 RPS.

There's probably going to be some overhead, but it seems like you could do 1M, if you have the bandwidth.

cap11235•1mo ago

I think the most likely bottleneck is gonna be your NIC hating getting a ton of packets. Line rate with huge frames is quite different than line rate with just ICMP packets, for instance (see CME binary glink market data for a similarly stressful experience to the ICMP).

tempest_•1mo ago

Like anything it really depends on what they are doing, if you wanted to just open and close a connection you might run into bottle necks in other parts of the stack before the CPU tops out but the real point is that yea, a single machine is going to be enough.

zipy124•1mo ago

It almost certainly can, even old intel systems with dual CPU 16 core systems could do 4 and a half million a second [1]. At a certain point network/kernel bottlenecks become apparent though, rather than being compute limited.

[1]: https://www.amd.com/content/dam/amd/en/documents/products/et...

tempest_•1mo ago

This is how I feel about this industries fetishization of "scalability".

A lot of software time is spent making something scalable when in 2025 I can probably run any site the bottom 99% of most visited sites on the internet on a couple machines and < 40k capital.

tbrownaw•1mo ago

> any site the bottom 99% of most visited sites on the internet

What % is the AWS console, and what counts as "running" it?

tempest_•1mo ago

> What % is the AWS console

Prior to the recent RAM insanity(a big caveat I know) a 1u supermicro machine with 768GB some NVME storage and twin 32 core Epyc 9004s was ~12K USD. You can get 3 of those and and some redundant 10G network infra(people are literally throwing this out) for < 40k. Then you just have to find a rack/internet connection to put them in which would be a few hundred a month.

The reality is most sites don't need multi region setups, they have very predicable load and 3 of those machines would be massive overkill for many. A lot of people like to think they will lose millions per second of down time, and some sites certainly do but most wont.

All of this of course would be using new stuff. If you wanted to use used stuff the most cost effective are the 5 year old second gen xeon scalables that are being dumped by cloud providers. Those are more than enough compute for most they are just really thirsty so you will pay with the power bill.

This of course is predicated on assumption you have the skill set to support these machines and that is increasingly becoming less common though as successful companies that started in the last 10 years are starting to do more "hybrid cloud" it is starting to come back around.

cap11235•1mo ago

If you are paying 12k, why would you ever subject yourself to supermicro

trueismywork•1mo ago

Hehe.

tempest_•1mo ago

They just happen to have an online config tool that is somewhat close to what you would pay if you didnt engage with sales which is useful for a hackernews comment.

oblio•1mo ago

Raw technical excellence doesn't rake in billions, despite what IT people keep saying.

Otherwise Viaweb would be the shining star of 2025. Instead it's a forgotten footnote on a path to programming with money (VC).

Animats•1mo ago

The sites that think they need huge numbers of small network interactions are probably collecting too much detailed data about user interaction. Like capturing cursor movement. That might be worth doing for 1% of users to find hot spots, but capturing it for all of them is wasteful.

A lot of analytic data is like that. If you captured it for 1% of users you'd find out what you needed to know at 1% of the cost.

otterdude•1mo ago

When people talk about a single server they're not talking about one hunk of metal, they're talking about 1 server process.

This article describes the 10k client connection problem, you should be handling 256K clients :)

marcosdumay•1mo ago

When people talk about a single server they are pretty much talking about either a single physical box with a CPU inside or a VPS using a few processor threads.

When they say "most companies can run in a single server, but do backups" they usually mean the physical kind.

Maxatar•1mo ago

The term is absolutely ambiguous and I know I've run into confusion in my own work due to the ambiguity. For the purpose of the C10K, server is intended to mean server process rather than hardware.

marcosdumay•1mo ago

> You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients.

It was about physical servers.

Aloisius•1mo ago

This is definitely talking about scaling past 10K open connections on a single server daemon (hence the reference to a web server and an ftp server).

However, most people used dedicated machines when this was written, so scaling 10K open connections on a daemon was essentially the same thing as 10K open connections on a single machine.

marcosdumay•1mo ago

> You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so

Those are not "by process" capabilities and daemons were never restricted to a single process.

The article focuses on threads because processes had more kernel level problems than threads. But it was never about processes limitations.

And by the way, process capabilities on Linux are exactly the same as machine capabilities. There's no limitation. You are insisting everybody uses a category that doesn't even exist.

Aloisius•1mo ago

Of course daemons weren't limited to a single process, but the old 1 process per connection forking model wasn't remotely scalable, not only because of kernel deficiencies (which certainly didn't help), but also because of the extreme cost of context switches on the commodity servers back then.

Now perhaps my memory is a bit fuzzy after all these years, but I'm pretty sure when I asking about scaling above 15,000 simultaneous connections back in 1999 (I think the discussion on linux-kernel is referenced in this article), it was for a server listening on a single port that required communication between users and the only feasible way at the time to do that was multiplexing the connections in a single process.

Without that restriction, hitting 10,000 connections on a single Linux machine was much easier by running multiple daemons each listening on their own port and just use select(). It still wasn't great, but it wasn't eating 40% of the time in poll() either.

Most of the things the article covers: multiplexing, non-blocking IO and event handling were strategies for handling more connections in a process. The various multiplexing methods discussed were because syscalls like poll() scaled extremely poorly as the number of fds increased. None of that is particularly relevant for 1 connection per process forking daemons where in many cases, you don't even need polling at all.

cap11235•1mo ago

Who cares? A "process" is a made up division.

dilyevsky•1mo ago

At the time this was written powerful backend server only had like 4 cores. Linux only started adopting SMP like that same year. Also CPU caches were tiny

Serving less than 1k qps per core is pretty underwhelming today, at such a high core count you'd likely hit OS limitations way before you're bound by hardware

Grosvenor•1mo ago

Linux had been doing SMP for about 5 years by that point.

But you're right OS resource limitations (file handles, PIDs, etc) would be the real pain for you. One problem after another.

Now, the real question is do you want to spend your engineering time on that? A small cluster running erlang is probably better than a tiny number of finely tuned race-car boxen.

dilyevsky•1mo ago

My recollection is fuzzy but i remember having to recompile 2.4-ish kernels to enable SMP back in the day which took hours... And I think it was buggy too.

Totally agree on many smaller boxes vs bigger box especially for proxying usecase.

hinkley•1mo ago

I don’t think I even heard of C10K until around 2003.

hoppp•1mo ago

We solved it 2 decades ago but then decided to use javascript on the server ...

wmf•1mo ago

Node.js is actually pretty good at C10K but it failed at multicore and C10M.

mifreewil•1mo ago

Node.js uses libuv, which implements strategy 2. mentioned on the linked webpage.

"libuv is a multi-platform C library that provides support for asynchronous I/O based on event loops. It supports epoll(4), kqueue(2)"

marcosdumay•1mo ago

Except that it wastes 2 or 3 orders of magnitude in performance and polls all the connections from a single OS thread, locking everything if it has to do extra work on any of them.

Picking the correct theoretical architecture can't save you if you bog down on every practical decision.

mifreewil•1mo ago

I'm sure there is plenty of data/benchmarks out there and I'll let that speak for itself, but I'll just point out that there are 2 built-in core modules in Node.js, worker_threads (threads) and cluster (processes) which are very easy to bolt on to an existing plain http app.

IgorPartola•1mo ago

So think of it this way: you want to avoid calling malloc() to increase performance. JavaScript does not have the semantics to avoid this. You also want to avoid looping. JavaScript does not have the semantics to avoid it.

If you haven’t had experience with actual performant code JS can seem fast. But it’s is a Huffy bike compared to a Kawasaki H2. Sure it is better than a kid’s trike but it is not a performance system by any stretch of the imagination. You use JS for convenience, not performance.

winrid•1mo ago

(to be fair the memory manager reuses memory, so it's not calling out to malloc all the time, but yes a manually-managed impl. will be much more efficient)

IgorPartola•1mo ago

Whichever way you manage memory, it is overhead. But the main problem is the language does not have zero copy semantics so lots of things trigger a memcpy(). But if you also need to call malloc() or even worse if you have to make syscalls you are hosed. Syscalls aren’t just functions, they require a whole lot of orchestration to make happen.

JavaScript engines also are also JITted which is better than a straight interpreter but except microbenchmarks worse than compiled code.

I use it for nearly all my projects. It is fine for most UI stuff and is OK for some server stuff (though Python is superior in every way). But would never want to replace something like nginx with a JavaScript based web server.

winrid•1mo ago

V8 does a lot of things to prevent copies. If you create two strings and concat them and assigned to a third var, no copy happens (c = a+b, c is a rope). Objects are by reference... Strings are interned.. the main gotcha with copies is when you need to convert from internal representation (utf16) to utf8 for outputs, it will copy then.

gbuk2013•1mo ago

IIRC V8 actually does some tricks under the hood to avoid malocs which is why Node.js can be be unexpectedly fast (I saw some benchmarks where it was only 4x of equivalent C code) - for example it recycles objects of the same shape (which is why it is beneficial not to modify object structure in hot code paths).

hinkley•1mo ago

Hidden classes is a whoooole thing. I’ve switched several projects to Maps for lookup tables so as not to poke that bear. Storage is the unhappy path for this.

yencabulator•1mo ago

JITs are a great magic trick but it's nowhere near guaranteed you'll get good steady performance out of one, especially when the workload is wide not narrow.

https://arxiv.org/abs/1602.00602v1

drogus•1mo ago

Worker threads can't handle I/O, so a single process Node.js app will still have the connection limit much lower than languages where you can handle I/O on multiple threads. Obviously, the second thing you mention, ie. multiple processes, "solves" this problem, but at a cost of running more than one process. In case of web apps it probably doesn't matter too much (although it can hurt performance, especially if you cache stuff in memory), but there are things where it just isn't a good trade-off.

hinkley•1mo ago

And I have confirmed to my own satisfaction that both PM2 and worker threads have their own libuv threads. Yes very common in node to run around one instance per core, give or take.

hinkley•1mo ago

Libuv now supports io_uring but I’m fuzzy on how broadly nodejs is applying that fact. It seems to be a function by function migration with lots of rollbacks.

nine_k•1mo ago

It's mostly RAM allocated per client. E.g. Postgres is very much limited by this fact in supporting massive numbers of clients. Hence pgbouncer and other kinds of connection pooling which allow a Postgres server to serve many more clients than it has RAM to allow connecting.

If your Node app spends vety little RAM per client, it can indeed service a great many of them.

A PHP script that does little more than checking credentials and invoking sendfile() could be adequate for the case of serving small files described in the article.

drob518•1mo ago

FFS, yep. Sigh.

eternityforest•1mo ago

I'm surprised there's not a lot more work on "backend free" systems.

A lot of apps seem like they could literally all use the same exact backend, if there was a service that just gave you the 20 or so features these kinds of things need.

Pocketbase is pretty close, but not all the way there yet, you still have to handle billing and security yourself, you're not just creating a client that has access to whatever resources the end user paid for and assigned to the app via the host's UI.

xnx•1mo ago

Yes. Most Saas are a tiny bit of business logic with a substantial helping of design and code framework bloat.

_qua•1mo ago

I personally think it's more of a https://c25k.com/ time of year.

alwa•1mo ago

Apparently this refers to making a web server able to serve 10,000 clients simultaneously.

IgorPartola•1mo ago

It has been long enough that C10K is not in common software engineer vernacular anymore. There was a time when people did not trust async anything. This was also a time when PHP was much more dominant on the web, async database drivers were rare and unreliable, and you had to roll your own thread pools.

amelius•1mo ago

Yes. But it's easy to reinvent it, with modern OSes and tools.

readthenotes1•1mo ago

The internationally famous Unix Network Programming book. An icon, a shibboleth, a cynosure

https://youtu.be/hjjydz40rNI?si=F7aLOSkLqMzgh2-U

(From Wayne's World--how we knew the comedians had smart advisors)

senko•1mo ago

The date (2003) is incorrect. The article itself refers to events from 2009, is listed at the bottom of the page as having been last updated in 2014, with a copyright notice spanning 2018, and a minor correction in 2019.

throwaway29303•1mo ago

  The date (2003) is incorrect.

You're right, it's even older than that; it should be (1999).

https://web.archive.org/web/*/https://www.kegel.com/c10k.htm...

petters•1mo ago

Seems to be some kind of living document. It was refered to as an "oldie" in 2007: https://news.ycombinator.com/item?id=45603

zahlman•1mo ago

> And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

It seems to me that there are far fewer problems nowadays with trying to figure out how to serve a tiny bit of data to many people with those kinds of resources, and more problems with understanding how to make a tiny bit of data relevant.

It still absolutely can be. We've just lost touch.

nine_k•1mo ago

This particular case, with the numbers given, would work as a server for profile pictures, for instance. Or for package signatures. Or for status pages of a bunch of services (generated statically, since the status rarely changes).

Yes, an RPi4 might be adequate to serve 20k of client requests in parallel, without crashing or breaking too much sweat. You usually want to plan for 5%-10% of this load as a norm if you care about tail latency. But a 20K spike should not kill it.

cxr•1mo ago

Stop fucking editorializing the fucking submission titles.

signa11•1mo ago

no one is talking about Erlang here ? i was / am under the impression that it is designed for these scenarios.

yencabulator•1mo ago

Not really. Erlang's VM is actually surprisingly slow, and in the era of c10k it was still using select/poll, which were getting to be bottlenecks.

What Erlang excels at is "no programming mistake ever should the whole system permanently down". As in components will reboot to recover. It's not a magic fix for anything outside of that.

fallingfrog•1mo ago

Ha, I instantly read "reverse log taper, 10k ohms".

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

Software factories and the agentic moment

I write games in C (yes, C)

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Selection Rather Than Prediction

Show HN: A luma dependent chroma compression algorithm (image compression)

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

History and Timeline of the Proco Rat Pedal (2021)

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

Software factories and the agentic moment

I write games in C (yes, C)

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Selection Rather Than Prediction

Show HN: A luma dependent chroma compression algorithm (image compression)

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

History and Timeline of the Proco Rat Pedal (2021)

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

We "solved" C10K years ago yet we keep reinventing it (2003)

Comments