frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•54s ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•4m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
1•andreabat•6m ago•0 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
1•mgh2•12m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•14m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•19m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•21m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•21m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•24m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•25m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•27m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•28m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
1•ramenbytes•31m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•32m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•35m ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•36m ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•36m ago•1 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•38m ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

2•prateekdalal•41m ago•0 comments

Make your iPad 3 a touchscreen for your computer

https://github.com/lemonjesus/ipad-touch-screen
2•0y•47m ago•1 comments

Internationalization and Localization in the Age of Agents

https://myblog.ru/internationalization-and-localization-in-the-age-of-agents
1•xenator•47m ago•0 comments

Building a Custom Clawdbot Workflow to Automate Website Creation

https://seedance2api.org/
1•pekingzcc•49m ago•1 comments

Why the "Taiwan Dome" won't survive a Chinese attack

https://www.lowyinstitute.org/the-interpreter/why-taiwan-dome-won-t-survive-chinese-attack
2•ryan_j_naughton•50m ago•0 comments

Xkcd: Game AIs

https://xkcd.com/1002/
2•ravenical•51m ago•0 comments

Windows 11 is finally killing off legacy printer drivers in 2026

https://www.windowscentral.com/microsoft/windows-11/windows-11-finally-pulls-the-plug-on-legacy-p...
1•ValdikSS•52m ago•0 comments

From Offloading to Engagement (Study on Generative AI)

https://www.mdpi.com/2306-5729/10/11/172
1•boshomi•54m ago•1 comments

AI for People

https://justsitandgrin.im/posts/ai-for-people/
1•dive•55m ago•0 comments

Rome is studded with cannon balls (2022)

https://essenceofrome.com/rome-is-studded-with-cannon-balls
1•thomassmith65•1h ago•0 comments

8-piece tablebase development on Lichess (op1 partial)

https://lichess.org/@/Lichess/blog/op1-partial-8-piece-tablebase-available/1ptPBDpC
2•somethingp•1h ago•0 comments

US to bankroll far-right think tanks in Europe against digital laws

https://www.brusselstimes.com/1957195/us-to-fund-far-right-forces-in-europe-tbtb
5•saubeidl•1h ago•0 comments
Open in hackernews

How fast can the RPython GC allocate?

https://pypy.org/posts/2025/06/rpython-gc-allocation-speed.html
55•todsacerdoti•7mo ago

Comments

kragen•7mo ago
My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.

This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.

I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.

______

† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c

runlaszlorun•7mo ago
I don't know much about language internals or allocation but am learning. why this could be significantly faster than a bump/arena allocator?

And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?

kragen•7mo ago
It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.
charleslmunger•7mo ago
I simulated yours vs the UPB arena fast path:

https://godbolt.org/z/1oTrv1Y58

Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.

If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.

However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.

kragen•7mo ago
Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).
pizlonator•7mo ago
The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.

You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.

You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.

forrestthewoods•7mo ago
> Every allocation takes 110116790943 / 10000000000 ≈ 11 instructions and 21074240395 / 10000000000 ≈ 2.1 cycles

I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...

Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.

kragen•7mo ago
You should believe it, because it's true.

You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense may say.

Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.

You're right that there are some other costs. A generational or incremental GC more or less requires write barriers, which slow down the parts of your code that mutate existing objects instead of allocating new ones. And a generational collector can move existing objects to a different address, which complicates FFIs that might try to retain pointers to them outside the GC's purview.

There are a lot of papers running captured allocation traces like you did against different memory allocators, some of them including GCed allocators. Ben Zorn did a lot of work like this in the 90s, coming to the conclusion that GC was somewhat faster than malloc/free. Both GCs and mallocs have improved substantially since then, but it would be easy to imagine that the results would be similar. Maybe someone has done it.

But to a significant extent we write performance-critical code according to the performance characteristics of our platforms. If mutation is fast but allocation is slow, you tend to write code that does less allocation and more mutation than you would if mutation was slow and allocation was fast, all else being equal. So you'd expect code written for platforms with fast allocation to allocate more, which makes faster allocation a bigger advantage for that code. Benchmarking captured allocation traces won't capture that effect.

Tail latency has always been one of the worst things about GCs. I hear amazing things about new real-time GC algorithms, including remarkable reports from Azul, but I don't know if they're true or what their price is.

forrestthewoods•7mo ago
Sorry, I refuse to accept a trivial loop of generating up to 2 objects at a time as ever being a relevant benchmark. I didn’t say throughput wasn’t relevant. I said this test isn’t convincing.

> coming to the conclusion that GC was somewhat faster than malloc/free

Many people have made that conclusion. Certainly not true for real-time cases where tail latency is critical.

Modern GC certainly better than old ones.

dzaima•7mo ago
For malloc/free specific workload is much more important than for a bump allocator; malloc potentially has branching and different cache access/utilization patterns depending on requested sizes or distance/pattern between mallocs & frees, whereas a bump allocator is just bumping a single pointer, with basically zero mispredicts, and very simple linear cache utilization that's easily prefetched.
forrestthewoods•7mo ago
Yes bump allocators are very fast. (It would have been helpful if the blog post explained what type of allocator it was benchmarking).

But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle. And pushing onto alternating lists/vec/etc is notoriously tricky.

Workload always matters. Always.

yvdriess•7mo ago
In defense of the OP: bump allocators are part of the CPython runtime. It's used to avoid calling malloc a bazillion times a second. Of course, it still needs to defer to libc for certain objects and to grab more memory for its pools/arena/nursery.
kragen•7mo ago
This is about the PyPy runtime, not the CPython runtime.
dzaima•7mo ago
The article does include:

> The allocation fast path of the RPython GC is a simple bump pointer,

and the assembly for it.

Indeed a bump allocator has the disadvantage that it effectively continuously thrashes the lower cache layers, but I'd think that'd be a pretty universal workload-independent issue. (and the cache's LRU will still preserve used parts in the cache, and the thrashing is by definition gonna be pretty evenly spread, giving the user code plenty of time to do other things in between using the thing desired to stay in cache again, slightly lowering the issue)

And given that a Python impl will necessarily need to have a GC path anyway, which already will thrash cache (and Python users aren't and won't be writing free()s regardless), it's probably not a bad trade overall.

I'd assume for pushing to lists the runtime just explicitly reserves capacity ahead-of-time instead of trying to integrate into the memory manager in any way. You'd want to do that anyway even if using malloc & malloc_usable_size or similar, as you'll end up reallocating way too much without that (in a quick test I see my system malloc + malloc_usable_size + realloc used to push 100000 8-byte elts ending up reallocating 8500 times, with the entire code being just this, realloc settling on rounding up to the next 4KB; the actual pointer only changes 10-50 times, but that's still pretty bad, and would presumably get much worse if the code did anything other than just pushing to a single vec)

forrestthewoods•7mo ago
The bump allocator declaration occurs after the 2.1 cycles per allocation line. Perhaps all the information is the article. But it is not presently clearly imho. YMMV.

FWIW I’m super supportive of bump allocators and arenas. I just don’t find this particular analysis compelling or sufficiently informative. YMMV.

dzaima•7mo ago
Yeah, it is a good bit down; that said, bump allocators are a very common (most common?) option around GC anyway; I personally scrolled straight to the disassembly, and it's pretty clear from that, and the explicit mention of bump allocator is just around there too.
kragen•7mo ago
The presentation clearly presumes a lot of implicit knowledge about generational garbage collectors that you not only don't have but firmly believed to be false.
forrestthewoods•7mo ago
I have spent quite a lot of time updating code to not generate garbage because the ostensibly super fast GC was infact demonstrably not super fast!

I could perhaps be convinced that modern GC are super fast. But in hindsight your blog post is intended to preach to the choir and not change the minds of non-believers. If you ever write an apologetic I will gladly read it. It needs to be a holistic analysis of memory operation cost across all threads.

kragen•7mo ago
I haven't written a blog post about this; do you mean my HN comment or CF Bolz's blog post?

Bolz's blog post is not written to persuade anyone but to provide objective, verifiable information from which people can form their own conclusions.

I agree that a holistic analysis of memory operation cost across all threads would be not only more persuasive but more informative. But it is also much more difficult.

forrestthewoods•7mo ago
Sorry I thought you were OP. Saw your root reply and got wires crossed.

Unfortunately I appear to lack sufficient context from which to draw any conclusions from the OP post. More context and more data needed (for me).

kragen•7mo ago
No, Bolz is way more polite and circumspect than I am. They also know a great deal more.
kragen•7mo ago
> But you can’t take a random Python GC’d workload and slap it on a bump allocator and call it good. A consequence of bump allocators is you can’t, ahem, freely free objects in the middle.

Yes, actually, you can, if your pointer-bumping allocator is the nursery allocator for a generational GC, which is the case in PyPy. In fact, you can free objects in the middle so freely that it takes literally zero instructions to free them; all you have to do is not retain a reference to them, and when it is time to move the live objects out of the nursery, the GC will avoid moving them because it never finds a pointer to them. That's how this microbenchmark is able to spend only 2% of its time in the GC, even though it's doing nothing but allocating and initializing objects!

This allows you to, precisely, take a random Python GC'd workload and satisfy almost all of its allocations with a simple pointer-bumping allocator. It avoids all the trickiness of pushing onto alternating lists/vec/etc.

It's true that in production workloads the time spent in the GC is usually more than 2%, typically more like 15%. The time wasted on write barriers, or on writing your code to copy objects unnecessarily to avoid write barriers, is harder to quantify, much like the time wasted on mutating objects unnecessarily to avoid allocation.

It is also true that workload always matters for allocator performance. So it's instructive to see how well RPython's allocator does with a workload which would be a pathological worst case for the kinds of allocators you're accustomed to, yielding performance so dramatically superior that you dismissed it as impossible, like when GMail launched with a gigabyte of email storage per user and Hotmail assumed it was a joke.

yvdriess•7mo ago
Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.

> I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

I just want to make the point here that making bump allocators fast is useful because it's an essential part of modern allocators and will be called A LOT in case of CPython.

That benchmark is useful to make sure the codegen/JIT is performing like it should, but it is still just measuring bump pointer allocation in the most ideal situation. 2 cycles/alloc is entirely normal reasonable considering modern uarchs, the critical path is two one-cycle uops.

Now, I agree with your point that this is far from meaningful for the p95/p99 of actual workloads. That is indeed the metric that users or customers actually care about. Modern GCs, like Go's, have moved away from using throughput as a metric. It's "easy" to make bump allocators fast, the hard problem is everything else around it: you still need to allocate your arenas/pools and manage the object lifetime. In the case of CPython that's refcounting, deallocs, GCs, work that modern CPUs really suck at (bad locality, bad branching).

I measured ~19.5% of the wallclock time of pyperformance benchmarks being related to memory management (ISPASS25 poster). That's not taking into account the extra boxing and struct slots being created for bookkeeping purposes, which adds more second-order effect related to cache pressure. It's insanely hard to accurately measure memory management overheads.

forrestthewoods•7mo ago
> Thanks for that article by the way. I've referenced it often to make the point that allocation behaviour is complex and messy.

Happy to hear it! Thanks.

antilisp•7mo ago
From the perspective of someone rewriting an interpreter in RPython from CPython, I have no idea what to think of this besides "Yes, RPython is fast. I am aware, and I still haven't recovered from the shock of using Python to write a production-grade interpreter."