You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.
You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.
I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...
Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.
You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.
Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.
You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.
There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.
But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.
Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.
> coming to the conclusion that GC was somewhat faster than malloc/free
Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.
Modern GC certainly better than old ones.
But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.
Workload always matters. Always.
> The allocation fast path of the RPython GC is a simple bump pointer,
and the assembly for it.
Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)
And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.
I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)
FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.
I could perhaps be convinced that modern GC are super fast. But in hindsight your blog post is intended to preach to the choir and not change the minds of non-believers. If you ever write an apologetic I will gladly read it. It needs to be a holistic analysis of memory operation cost across all threads.
Bolz's blog post is not written to persuade anyone but to provide objective, verifiable information from which people can form their own conclusions.
I agree that a holistic analysis of memory operation cost across all threads would be not only more persuasive but more informative. But it is also much more difficult.
Unfortunately I appear to lack sufficient context from which to draw any conclusions from the OP post. More context and more data needed (for me).
Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!
This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.
It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.
It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.
> I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.
I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.
That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.
Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).
I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.
Happy to hear it! Thanks.
kragen•7mo ago
This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.
I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.
______
† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c
runlaszlorun•7mo ago
And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?
kragen•7mo ago
charleslmunger•7mo ago
https://godbolt.org/z/1oTrv1Y58
Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.
If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.
However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.
kragen•7mo ago