https://www.phoronix.com/news/Linux-6.17-NUMA-Locality-Rando... https://www.phoronix.com/news/Linux-6.13-Sched_Ext https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tierin... https://www.phoronix.com/news/Linux-6.14-FUSE
There's some big work I'm missing thats more recent too, again about allocating & scheduling IIRC. Still trying to find it. The third link is in DAMON, which is trying to do a lot to optimize; good thread to tug more on!
I have this pocket belief that eventually we might see post NUMA post coherency architectures, where even a single chip acts more like multiple independent clusters, that use something more like networking (CXL or UltraEthernet or something) to allow RDMA, but without coherency.
Even today, the title here is woefully under-describing the problem. A Epyc chip is actually multiple different compute die, each with their own NUMA zone and their own L3 and other caches. For now yes each socket's memory is all via a single IO die & semi uniform, but whether that holds is in question, and even today, the multiple NUMA zones on one socket already require careful tuning for efficient workload processing.
Even the Raspberry Pi 5 benefits from NUMA emulation because it makes memory use patterns better match the memory controller’s parallelization capabilities.
stego-tech•2h ago
One thing the writeup didn’t seem to get into is the lack of scalability of this approach (manual pinning). As core counts and chiplets continue to explode, we still need better ways of scaling manual pinning or building more NUMA-aware OSes/applications that can auto-schedule with minimal penalties. Don’t get me wrong, it’s a lot better than ye olden days of dual core, multi-socket servers and stern warnings against fussing with NUMA schedulers from vendors if you wanted to preserve basic functionality, but it’s not a solved problem just yet.
jasonjayr•2h ago
EDIT: aaaand ... I commented before reading the article, which describes this very mechanism.
colechristensen•2h ago
Most of us are in the realm of the lowest hanging fruit being database queries that could be 100x faster and functions being called a million times a day that only need to be called twice.
stego-tech•1h ago
In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to.
For everyone else, it’s an excellent explainer why most guides and documentation will sternly warn you against fussing with the NUMA scheduler.
toast0•42m ago
Cpu pinning can be super easy too. If you have an application that uses the whole machine, you probably already spawn one thread per cpu thread. Pinning those threads is usually pretty easy. Checking if it makes a difference might be harder... For most applications, it won't make a big difference, but some applications will see a big difference. Usually a positive difference, but it depends on the application. If nobody has tried cpu pinning your application lately, it's worth trying.
Of course, doing something efficiently is nice, but not doing it is often a lot faster... Not doing things that don't need to be done has huge potential speedups.
If you want to cpu pin network sockets, that's not as easy, but it can also make a big difference in some circumstances; mostly if you're a load balancer/proxy kind of thing where you don't spend much time processing packets, just receive and forward. In that case, avoiding cross cpu reads and writes can provide huge speedups, but it's not easy. That one, yeah, only do it if you have a good idea it will help, it's kind of invasive and it won't be noticable if you do a lot of work on requests.
frollogaston•1h ago
Probably another situation is if you're working on a DBMS itself.
PerryStyle•1h ago
Would be interesting to see if something similar appears for cloud workloads.
wmf•1h ago
ccgreg•1h ago
ccgreg•1h ago
Last time I was architect of a network chip, 21 years ago, our library did that for the user. For workloads that use threads that consume entire cores, it's a solved problem.
I'd guess that the workload you had in mind doesn't have that property.