frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•47s ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•1m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•1m ago•0 comments

Expertise, AI and Work of Future [video]

https://www.youtube.com/watch?v=wsxWl9iT1XU
1•indiantinker•2m ago•0 comments

So Long to Cheap Books You Could Fit in Your Pocket

https://www.nytimes.com/2026/02/06/books/mass-market-paperback-books.html
1•pseudolus•2m ago•1 comments

PID Controller

https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller
1•tosh•7m ago•0 comments

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

https://twitter.com/AlecStapp/status/2019932764515234159
1•bkls•7m ago•0 comments

Kubernetes MCP Server

https://github.com/yindia/rootcause
1•yindia•8m ago•0 comments

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

https://rokn.io/posts/building-movie-recommendation-agent
2•roknovosel•8m ago•0 comments

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
2•beardyw•16m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•16m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
1•surprisetalk•19m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
2•surprisetalk•19m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
1•surprisetalk•19m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
2•pseudolus•19m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•20m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•21m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
1•1vuio0pswjnm7•21m ago•0 comments

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
3•obscurette•21m ago•0 comments

Cycling in France

https://www.sheldonbrown.com/org/france-sheldon.html
1•jackhalford•23m ago•0 comments

Ask HN: What breaks in cross-border healthcare coordination?

1•abhay1633•23m ago•0 comments

Show HN: Simple – a bytecode VM and language stack I built with AI

https://github.com/JJLDonley/Simple
1•tangjiehao•26m ago•0 comments

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

https://caratria.com/
1•jonrosner•27m ago•1 comments

My Eighth Year as a Bootstrapped Founde

https://mtlynch.io/bootstrapped-founder-year-8/
1•mtlynch•27m ago•0 comments

Show HN: Tesseract – A forum where AI agents and humans post in the same space

https://tesseract-thread.vercel.app/
1•agliolioyyami•27m ago•0 comments

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

https://vibecolors.life/
2•tusharnaik•29m ago•0 comments

OpenAI is Broke ... and so is everyone else [video][10M]

https://www.youtube.com/watch?v=Y3N9qlPZBc0
2•Bender•29m ago•0 comments

We interfaced single-threaded C++ with multi-threaded Rust

https://antithesis.com/blog/2026/rust_cpp/
1•lukastyrychtr•30m ago•0 comments

State Department will delete X posts from before Trump returned to office

https://text.npr.org/nx-s1-5704785
7•derriz•30m ago•1 comments

AI Skills Marketplace

https://skly.ai
1•briannezhad•30m ago•1 comments
Open in hackernews

How fast can the RPython GC allocate?

https://pypy.org/posts/2025/06/rpython-gc-allocation-speed.html
55•todsacerdoti•7mo ago

Comments

kragen•7mo ago
My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.

This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.

I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.

______

† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c

runlaszlorun•7mo ago
I don't know much about language internals or allocation but am learning. why this could be significantly faster than a bump/arena allocator?

And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?

kragen•7mo ago
It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.
charleslmunger•7mo ago
I simulated yours vs the UPB arena fast path:

https://godbolt.org/z/1oTrv1Y58

Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.

If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.

However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.

kragen•7mo ago
Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).
pizlonator•7mo ago
The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.

You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.

You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.

forrestthewoods•7mo ago
> Every allocation takes 110116790943 / 10000000000 ≈ 11 instructions and 21074240395 / 10000000000 ≈ 2.1 cycles

I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...

Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.

kragen•7mo ago
You should believe it, because it's true.

You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.

Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.

You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.

There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.

But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.

Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.

forrestthewoods•7mo ago
Sorry, I refuse to accept a trivial loop of generating up to 2 objects at a time as ever being a relevant benchmark. I didn’t say throughput wasn’t relevant. I said this test isn’t convincing.

> coming to the conclusion that GC was somewhat faster than malloc/free

Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.

Modern GC certainly better than old ones.

dzaima•7mo ago
For malloc/free specific workload is much more important than for a bump allocator; malloc potentially has branching and different cache access/utilization patterns depending on requested sizes or distance/pattern between mallocs & frees, whereas a bump allocator is just bumping a single pointer, with basically zero mispredicts, and very simple linear cache utilization that's easily prefetched.
forrestthewoods•7mo ago
Yes bump allocators are very fast. (It would have been helpful if the blog post explained what type of allocator it was benchmarking).

But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.

Workload always matters. Always.

yvdriess•7mo ago
In defense of the OP: bump allocators are part of the CPython runtime. It's used to avoid calling malloc a bazillion times a second. Of course, it still needs to defer to libc for certain objects and to grab more memory for its pools/arena/nursery.
kragen•7mo ago
This is about the PyPy runtime, not the CPython runtime.
dzaima•7mo ago
The article does include:

> The allocation fast path of the RPython GC is a simple bump pointer,

and the assembly for it.

Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)

And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.

I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)

forrestthewoods•7mo ago
The bump allocator declaration occurs after the 2.1 cycles per allocation line. Perhaps all the information is the article. But it is not presently clearly imho. YMMV.

FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.

dzaima•7mo ago
Yeah, it is a good bit down; that said, bump allocators are a very common (most common?) option around GC anyway; I personally scrolled straight to the disassembly, and it's pretty clear from that, and the explicit mention of bump allocator is just around there too.
kragen•7mo ago
The presentation clearly presumes a lot of implicit knowledge about generational garbage collectors that you not only don't have but firmly believed to be false.
forrestthewoods•7mo ago
I have spent quite a lot of time updating code to not generate garbage because the ostensibly super fast GC was infact demonstrably not super fast!

I could perhaps be convinced that modern GC are super fast. But in hindsight your blog post is intended to preach to the choir and not change the minds of non-believers. If you ever write an apologetic I will gladly read it. It needs to be a holistic analysis of memory operation cost across all threads.

kragen•7mo ago
I haven't written a blog post about this; do you mean my HN comment or CF Bolz's blog post?

Bolz's blog post is not written to persuade anyone but to provide objective, verifiable information from which people can form their own conclusions.

I agree that a holistic analysis of memory operation cost across all threads would be not only more persuasive but more informative. But it is also much more difficult.

forrestthewoods•7mo ago
Sorry I thought you were OP. Saw your root reply and got wires crossed.

Unfortunately I appear to lack sufficient context from which to draw any conclusions from the OP post. More context and more data needed (for me).

kragen•7mo ago
No, Bolz is way more polite and circumspect than I am. They also know a great deal more.
kragen•7mo ago
> But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle.

Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!

This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.

It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.

It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.

yvdriess•7mo ago
Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.

> I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.

That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.

Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).

I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.

forrestthewoods•7mo ago
> Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.

Happy to hear it! Thanks.

antilisp•7mo ago
From the perspective of someone rewriting an interpreter in RPython from CPython, I have no idea what to think of this besides "Yes, RPython is fast. I am aware, and I still haven't recovered from the shock of using Python to write a production-grade interpreter."