The death of thread per core

https://buttondown.com/jaffray/archive/the-death-of-thread-per-core/

55•ibobev•1d ago

Comments

vacuity•17h ago

There are no hard rules; use principles flexibly.

That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.

AaronAPU•1h ago

I did mass scale performance benchmarking on highly optimized workloads using lockfree queues and fibers, and locking to a core almost never was faster. There were a few topologies where it was, but they were outliers.

This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.

Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.

This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.

zamadatix•42m ago

I think workload might be as (if not more) the factor than the uniqueness of the topology itself for how much pinning matters. If your workload is purely computationally limited then it doesn't matter. Same if it's actually I/O limited. If it's memory bandwidth limited then it depends on things like how much fits in per core cache vs shared cache vs going to RAM, and how is RAM actually fed to the cores.

A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.

Or, as vacuity put it, "there are no hard rules; use principles flexibly".

bob1029•14h ago

I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.

You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.

The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.

bsenftner•2h ago

Astute points. I've worked on an extremely performant facial recognition system (tens of millions of face compares per second per core) that lives in L1 and does not use the GPU for the FR inference at all, only for the display of the video and the tracked people within. I rarely even bother telling ML/DL/AI people it does not use the GPU, because I'm just tired of the argument that "we're doing it wrong".

kiitos•1h ago

> I look at cross core communication as a 100x latency penalty

if your workload is majority cpu-bound then this is true, sometimes, and at best

most workloads are io (i.e. syscall) bound, and io/syscall overhead is >> cross-core communication overhead

dist-epoch•1h ago

The thing with GPUs is that for many problems really dumb and simple algorithms (think bubble sort equivalent) are many times faster than very fancy CPU algorithms (think quick sort equivalent). Your typical non-neural-network GPU algorithm is rarely using more than 50% of it's power, yet still outperforms carefully written CPU algorithms.

josefrichter•13h ago

Isn't this what Erlang/Elixir BEAM is all about?

ameliaquining•2h ago

How so? AFAIK BEAM is pretty much agnostic between work-stealing and work-sharding* architectures.

* I prefer the term "work-sharding" over "thread-per-core", because work-stealing architectures usually also use one thread per core, so it tends to confuse people.

adsharma•2h ago

Morsel driven parallelism is working great in DuckDB, KuzuDB and now Ladybug (fork of Kuzu after archival).

jandrewrogers•2h ago

I've worked on several thread-per-core systems that were purpose-built for extreme dynamic data and load skew. They work beautifully at very high scales on the largest hardware. The mechanics of how you design thread-per-core systems that provide uniform distribution of load without work-stealing or high-touch thread coordination have idiomatic architectures at this point. People have been putting thread-per-core architectures in production for 15+ years now and the designs have evolved dramatically.

The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.

News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.

touisteur•1h ago

I feel I'm still doing it the old 2010 way, with all my hand-crafted dpdk-and-pipelines-and-lockless-queues-and-homemade-taskgraph-scheduler, any modern reference (apart from 'use seastar' ? ... which fair if it fills your needs) ?

FridgeSeal•56m ago

Are there any resources/learning material about the more modern thread-per-core approaches? It’s a particular area of interest for me, but I’ve had relatively little success finding more learning material, so I assume there’s lots of tightly guarded institutional knowledge.

Replacing a $3000/mo Heroku bill with a $55/mo server

Doomsday Scoreboard

Build Your Own Database

rlsw – Raylib software OpenGL renderer in less than 5k LOC

LLMs can get "brain rot"

Neural audio codecs: how to get audio into LLMs

We rewrote OpenFGA in pure Postgres

Mathematicians have found a hidden 'reset button' for undoing rotation

Minds, brains, and programs (1980) [pdf]

NASA chief suggests SpaceX may be booted from moon mission

Wikipedia says traffic is falling due to AI search summaries and social video

Foreign hackers breached a US nuclear weapons plant via SharePoint flaws

The Salt and Pepper Shaker Museum

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

Show HN: Katakate – Dozens of VMs per node for safe code exec

Flexport Is Hiring SDRs in Chicago

Lottery-fication of Everything: 0 day options, perps, parlays are now mainstream

Diamond Thermal Conductivity: A New Era in Chip Cooling

AWS multiple services outage in us-east-1

Ilo – a Forth system running on UEFI

The death of thread per core

ChatGPT Atlas

Show HN: bbcli – A TUI and CLI to browse BBC News like a hacker

Our modular, high-performance Merkle Tree library for Rust

What do we do if SETI is successful?

Binary Retrieval-Augmented Reward Mitigates Hallucinations

The Programmer Identity Crisis

The Greatness of Text Adventures

60k kids have avoided peanut allergies due to 2015 advice, study finds

Apple alerts exploit developer that his iPhone was targeted with gov spyware

The death of thread per core

Comments

Replacing a $3000/mo Heroku bill with a $55/mo server

Doomsday Scoreboard

Build Your Own Database

rlsw – Raylib software OpenGL renderer in less than 5k LOC

LLMs can get "brain rot"

Neural audio codecs: how to get audio into LLMs

We rewrote OpenFGA in pure Postgres

Mathematicians have found a hidden 'reset button' for undoing rotation

Minds, brains, and programs (1980) [pdf]

NASA chief suggests SpaceX may be booted from moon mission

Wikipedia says traffic is falling due to AI search summaries and social video

Foreign hackers breached a US nuclear weapons plant via SharePoint flaws

The Salt and Pepper Shaker Museum

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

Show HN: Katakate – Dozens of VMs per node for safe code exec

Flexport Is Hiring SDRs in Chicago

Lottery-fication of Everything: 0 day options, perps, parlays are now mainstream

Diamond Thermal Conductivity: A New Era in Chip Cooling

AWS multiple services outage in us-east-1

Ilo – a Forth system running on UEFI

The death of thread per core

ChatGPT Atlas

Show HN: bbcli – A TUI and CLI to browse BBC News like a hacker

Our modular, high-performance Merkle Tree library for Rust

What do we do if SETI is successful?

Binary Retrieval-Augmented Reward Mitigates Hallucinations

The Programmer Identity Crisis

The Greatness of Text Adventures

60k kids have avoided peanut allergies due to 2015 advice, study finds

Apple alerts exploit developer that his iPhone was targeted with gov spyware