frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

How fast can the RPython GC allocate?

https://pypy.org/posts/2025/06/rpython-gc-allocation-speed.html
41•todsacerdoti•10h ago

Comments

kragen•8h ago
My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.

This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.

I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.

______

† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c

runlaszlorun•7h ago
I don't know much about language internals or allocation but am learning. why this could be significantly faster than a bump/arena allocator?

And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?

kragen•7h ago
It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.
charleslmunger•4h ago
I simulated yours vs the UPB arena fast path:

https://godbolt.org/z/1oTrv1Y58

Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.

If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.

However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.

kragen•4h ago
Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).
pizlonator•8h ago
The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.

You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.

You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.

forrestthewoods•5h ago
> Every allocation takes 110116790943 / 10000000000 ≈ 11 instructions and 21074240395 / 10000000000 ≈ 2.1 cycles

I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...

Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.

kragen•2h ago
You should believe it, because it's true.

You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.

Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.

You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.

There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.

But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.

Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.

forrestthewoods•35m ago
Sorry, I refuse to accept a trivial loop of generating up to 2 objects at a time as ever being a relevant benchmark. I didn’t say throughput wasn’t relevant. I said this test isn’t convincing.

> coming to the conclusion that GC was somewhat faster than malloc/free

Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.

Modern GC certainly better than old ones.

Accumulation of cognitive debt when using an AI assistant for essay writing task

https://arxiv.org/abs/2506.08872
98•stephen_g•3h ago•48 comments

Being a Force Multiplier

https://substack.com/home/post/p-165651243
7•jandrewrogers•3d ago•1 comments

Lisp-stat: Lisp environment for statistical computing

https://lisp-stat.dev/about/
35•oumua_don17•1d ago•8 comments

Modifying an HDMI dummy plug's EDID using a Raspberry Pi

https://www.downtowndougbrown.com/2025/06/modifying-an-hdmi-dummy-plugs-edid-using-a-raspberry-pi/
217•zdw•14h ago•58 comments

Twin – A Textmode WINdow Environment

https://github.com/cosmos72/twin
75•kim_rutherford•10h ago•11 comments

Why SSL was renamed to TLS in late 90s (2014)

https://tim.dierks.org/2014/05/security-standards-and-name-changes-in.html
242•Bogdanp•16h ago•115 comments

Chemical knowledge and reasoning of large language models vs. chemist expertise

https://www.nature.com/articles/s41557-025-01815-x
46•bookofjoe•1d ago•9 comments

Canyon.mid

https://canyonmid.com/
263•LorenDB•16h ago•155 comments

Childhood leukemia: how a deadly cancer became treatable

https://ourworldindata.org/childhood-leukemia-treatment-history
183•surprisetalk•16h ago•47 comments

Telephone Exchanges in the UK

https://telephone-exchanges.org.uk/
110•petecooper•10h ago•36 comments

Jokes and Humour in the Public Android API

https://voxelmanip.se/2025/06/14/jokes-and-humour-in-the-public-android-api/
30•todsacerdoti•5h ago•6 comments

Datalog in Rust

https://github.com/frankmcsherry/blog/blob/master/posts/2025-06-03.md
268•brson•18h ago•28 comments

First 2D, non-silicon computer developed

https://www.psu.edu/news/research/story/worlds-first-2d-non-silicon-computer-developed
88•giuliomagnifico•3d ago•16 comments

DARPA program sets distance record for power beaming

https://www.darpa.mil/news/2025/darpa-program-distance-record-power-beaming
31•gnabgib•7h ago•16 comments

Reinventing circuit breakers with supercritical CO2

https://spectrum.ieee.org/sf6-gas-replacement
64•rbanffy•7h ago•27 comments

Datalog in miniKanren

https://deosjr.github.io/dynamicland/datalog.html
92•deosjr•13h ago•8 comments

How to modify Starlink Mini to run without the built-in WiFi router

https://olegkutkov.me/2025/06/15/how-to-modify-starlink-mini-to-run-without-the-built-in-wifi-router/
281•LorenDB•17h ago•77 comments

Simplest C++ Callback, from SumatraPDF

https://blog.kowalczyk.info/a-stsj/simplest-c-callback-from-sumatrapdf.html
105•jandeboevrie•12h ago•82 comments

Real-time CO2 monitoring without batteries or external power

https://news.kaist.ac.kr/newsen/html/news/?mode=V&mng_no=47450
20•gnabgib•7h ago•4 comments

Fields where Native Americans farmed a thousand years ago discovered in Michigan

https://www.smithsonianmag.com/smart-news/massive-field-where-native-american-farmers-grew-corn-beans-and-squash-1000-years-ago-discovered-in-michigan-180986758/
178•CoopaTroopa•3d ago•75 comments

Random Walk: A Modern Introduction [pdf]

https://www.math.uchicago.edu/~lawler/srwbook.pdf
19•Anon84•3d ago•1 comments

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent
110•aspenmayer•18h ago•147 comments

Is Gravity Just Entropy Rising? Long-Shot Idea Gets Another Look

https://www.quantamagazine.org/is-gravity-just-entropy-rising-long-shot-idea-gets-another-look-20250613/
9•pseudolus•5h ago•0 comments

David Attenborough at 99: 'I will not see how the story ends'

https://www.thetimes.com/life-style/celebrity/article/david-attenborough-book-extract-age-99-lj3rd2fg7
170•herbertl•8h ago•82 comments

Foundations of Computer Vision

https://visionbook.mit.edu
175•tzury•19h ago•7 comments

The Hewlett-Packard Archive

https://hparchive.com
11•joebig•4h ago•0 comments

Cyborg Embryos Offer New Insights into Brain Growth

https://spectrum.ieee.org/embryo-electrode-array
20•rbanffy•3d ago•1 comments

Cure Dolly's Japanese Grammar Lessons

https://kellenok.github.io/cure-script/
71•agnishom•2d ago•13 comments

The Art of Lisp and Writing (2003)

https://www.dreamsongs.com/ArtOfLisp.html
172•Bogdanp•23h ago•68 comments

Text-to-LoRA: Hypernetwork that generates task-specific LLM adapters (LoRAs)

https://github.com/SakanaAI/text-to-lora
108•dvrp•4d ago•10 comments