"1/61 of the time, check the global run queue." Stuff like this is a little odd; I would have thought this would be a variable dependent on the number of physical cores.
I once profiled a slow go program running on a node with 168 cores, but cpu.max was 2 cores for the cgroup. The runtime defaults to set GOMAXPROCS to the number of visible cores which was 168 in this case. Over half the runtime was the scheduler bouncing goroutines between 168 processes despite cpu.max being 2 CPU.
The JRE is smart enough to figure out if it is running in a resource limited cgroup and make sane decisions based upon that, but golang has no such thing.
https://github.com/golang/go/blob/a1a151496503cafa5e4c672e0e...
As said in 2019, import https://github.com/uber-go/automaxprocs to get the functionality ASAP.
Always a weird feeling, it’s a small world
If they'd now also make the GC respect memory cgroup limits (i.e. automatic GOMEMLIMIT), we'd probably be freeing up a couple petabytes of memory across the globe.
Java has been doing these things for a while, even OpenJDK 8 has had those patches since probably before covid.
Or is it? Need calculations
Let's go with three quadrillion (which is apparently 10^15), let's assume a server CPU does 3 GHz (10^9), that's 10^6, a day is about 100k seconds, so ~ten days. But of course we're only saving cycles. I've seen throughput increase by about 50% when setting GOMAXPROCS on bigger machines, but in most of those cases we're looking at containers with fractional cores. On the other hand, there are many containers. So...
Hey, but what did you have in mind with regard to bigger machines? I think we're talking here about lowering GOMAXPROCS to have in effect less context switching of the OS threads. While it can bring some good result, a gut feeling is that it'd be hardly 50% faster overall, is your scenario the same then?
> It may overcount the amount of parallelism available when limited by a process-wide affinity mask or cgroup quotas and sched_getaffinity() or cgroup fs can’t be queried, e.g. due to sandboxing.
[1] https://docs.rs/tokio/1.45.0/src/tokio/loom/std/mod.rs.html#...
[2] https://doc.rust-lang.org/stable/std/thread/fn.available_par...
The fundamental issue comes down to background GC and CPU quotas in cgroups.
If your number of worker threads is too high, GC will eat up all the quota.
90s_dev•1mo ago
NAHWheatCracker•1mo ago
I'm not a low level optimization guy, but I've had occasions where I wanted control over which threads my goroutines are running on or prioritizing important goroutines. It's a trade off for making things less complex, which is standard for Go.
I suppose there's always hope that the Go developers can change things.
silisili•1mo ago
If you model it in a way where you have one goroutine per os thread that receives and does work, it gets you close. But in many cases that means rearching the entire code base, as it's not a style I typically reach for.
naikrovek•1mo ago
silisili•1mo ago
That said, if you're greenfielding and see this as a limitation to begin with, picking another language is probably the right way.
jerf•1mo ago
If you need it pervasively, Go may not be the correct choice. Then again, the list of languages that is not a correct choice in that case is quite long. That's a minority case. An important one, but a minority one.