You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.
if your workload is majority cpu-bound then this is true, sometimes, and at best
most workloads are io (i.e. syscall) bound, and io/syscall overhead is >> cross-core communication overhead
* I prefer the term "work-sharding" over "thread-per-core", because work-stealing architectures usually also use one thread per core, so it tends to confuse people.
The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.
News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.
vacuity•17h ago
That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.
AaronAPU•1h ago
This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.
Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.
This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.
zamadatix•42m ago
A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.
Or, as vacuity put it, "there are no hard rules; use principles flexibly".