I was surprised by how simple an allocator is

https://tgmatos.github.io/allocators-are-for-monkeys-with-typewriters/

96•gilgamesh3•7mo ago

Comments

wavemode•7mo ago

On first seeing this I wasn't sure what analogy the author was trying to make. After reading the article my best guess is that they are simply trying to say that, writing an allocator is easier than it seems on the surface.

Though it's not clear to me that the article does a good job of establishing that this is actually true ("mimalloc is only a few thousand lines of code" doesn't pass the smell test).

majormajor•7mo ago

Yeah, re: the title, the URL/path string "allocators-are-for-monkeys-with-typewriters" (including on the page) seems more clear about that "less hard than you'd think" thing than the larger-font published headline "Allocators are Monkeys With Typewriters". And of course the quote in the article is even more specific "given enough time, even a monkey with a typewriter can write a memory allocator".

I generally agree with the "memory management doesn't have to be as complicated as you might think" vibe, especially if you've read about some optimizations in fancy modern GCs and aren't aware of what a basic simple non-GC world can look like. That said, of course, you can indeed get into a lot of complexity beyond the textbook 101 examples. Like the mentioned threading...

dang•7mo ago

We replaced the title with a more representative sentence from the article body.

b0a04gl•7mo ago

but saying they're useless ignores a bunch of real systems that wouldn't run without them.

in unity, you literally can't do burst compiled jobs efficiently unless you choose the right allocator. they expose `Temp`, `Persistent`, etc because GC isn't even an option when you're fighting for milliseconds per frame. no allocator choice = frame skips = shipped bugs.

in embedded, FreeRTOS gives you multiple heap implementations for a reason. sometimes you need fixed size pools, sometimes you care about fragmentation. malloc's out of the question. same in any real time firmware, safety critical or not.

infra world has been using arena allocators for years. folly's `Arena`, grpc's arena, even jemalloc has thread caching and slab style region reuse built in. this isn't academic. large scale systems hit allocation pressure that general purpose allocators can't tune for globally.

and rust's whole alloc abstraction : it's more about expressing ownership across lifetimes in memory sensitive codebases. `Bumpalo` isn't a premature optimization. it's what you reach for when you know the object graph layout and want to free it in one call. also safe by design. it's not even painful to use.

imo allocator choice is an interface decision. it shapes how the rest of your code handles memory lifetimes. once you start passing memory lifetimes as part of your type system or job model, the whole architecture shifts. this is way deeper than faster malloc

reactordev•7mo ago

Allocator choice is a spice, not a staple. There’s various reasons to use one design vs another for your use case and no “one-size fits all”. Sometimes you want continuous memory. Sometimes you want sorted memory. Sometimes you want fast memory damned the internal fragmentation. Others, you want terse memory because you’re trying to fit a struct in 32 bytes. There’s a use case for it all.

thechao•7mo ago

"Staple" as in "potato"; not "paper". Although, I'm fond of both metaphors — the initial mixed one was very delightful!

jasonthorsness•7mo ago

Applications should use more special-purpose memory allocators. Much of the complexity in memory management is designing for an unknown usage pattern and things can be quite simple when allocator is specialized and patterns are predictable.

This is difficult though in higher-level languages. Go tried and failed with arenas.

Snawoot•7mo ago

I think I found some solution for arenas in Go: https://pkg.go.dev/github.com/Snawoot/freelist

Has some shortcomings, but I think it should work.

Gibbon1•7mo ago

defer and a stack allocator would probably go well together.

Start of scope mark the allocators state. defer when it goes out of scope you release everything. Cheezy would be the release function takes a list of objects to keep.

sirwhinesalot•7mo ago

Extra cheese is if the list you pass only has the roots of a larger graph that you then "scan", "marking" every reachable block of data in the allocator and then you "compact" it, clearing out all unreachable objects.

Hmmm... ;)

the__alchemist•7mo ago

I have an STM32 rust project I've been working on this week. It talks to an ESP using protobuf/RPC.

I'm doing it bare-metal/no allocator, as I do most embedded projects... and it's flirting with running me out of memory! What do most (and the most popular) protobuf libs do in rust? Use an allocator. What does the ESP itself do? Use an allocator (with FreeRTOS).

Meanwhile I'm using Heapless (Vec and String syntax with a statically-allocated array), on a MCU with 128K flash and 32K Ram... This won't end well.

javier2•7mo ago

Think you need an allocator !

charleslmunger•7mo ago

If you use upb, an allocator is optional - you can provide a presized block to upb_Arena and a NULL upb_alloc. Of course, you'll still fail to parse a message with an in memory representation larger than your region.

dang•7mo ago

[stub for offtopicness]

davydm•7mo ago

lol, I read this as "alligators are monkeys with typewriters" and thought it would be a well interesting article

but it's just more blah-blah about ai :/

triknomeister•7mo ago

As Andrei Alexandrescu famously said, "Allocator is to allocation what alligators is to allegation"

https://www.youtube.com/watch?v=LIb3L4vKZ7U

bitwize•7mo ago

So when people retool their application to use GC do they say "See you later, allocator!"?

...I'll be here all week. Try the veal!

layer8•7mo ago

It’s actually without GC that you have to see the allocator again later, to free the memory.

gilgamesh3•7mo ago

Author of the post here. This talk was one of the talks I watched when learning about allocators. I thought about writing a part of the post about this talk but I didn't have much time.

freeone3000•7mo ago

I think you might be commenting on the wrong article — this is an implementation of a memory allocator, as in, malloc.

pmelendez•7mo ago

AI? This article is about memory allocation strategies... Did I miss something?

tantalor•7mo ago

[flagged]

keeptrying•7mo ago

From the title I thought he meant VCs.

xeonmc•7mo ago

Those would be alligators, rather.

ekianjo•7mo ago

Why change the title? That's also against HN policy

tom_•7mo ago

Flag it if you're unhappy with the submission.

gnabgib•7mo ago

It had the original title - mods changed it. See the title complaints here: https://news.ycombinator.com/item?id=44350320

And replacement: https://news.ycombinator.com/item?id=44350333

charcircuit•7mo ago

>Seeing that mimalloc does not offer this feature

If you click on the issue you will see that mimalloc already does offer this feature via mi_manage_os_memory to create an arena, mi_heap_new_in_arena to create a heap that uses the arena, and mi_heap_malloc to allocate memory from the heap.

userbinator•7mo ago

Yes, but the article doesn't really show with code why it's simple.

A basic first-fit free-list type allocator can be less than 100 LoC - in x86 Asm. Using free space to store the overhead is a simple but effective optimisation too.

notepad0x90•7mo ago

allocators can be simple but how about memory management kernel subsystems?

One question I have for those who know the topic at a greater depth: Are there simpler memory managers that don't use a fixed page size? Or does that make it a lot more complex? imagine requesting allocation of 1 byte and the kernel allocating a page of 1 byte and immediately you request 20GB and you get a 20GB page. Other than memory-management metadata taking up more memory on its own, what are the downsides of dynamic page sizes?

duped•7mo ago

That's essentially what an allocator is, it's an abstraction between a requested size and alignment from the user space program and the memory management API that interfaces with the MMU hardware (which maps virtual addresses into physical addresses where the virtual address is a "page" number + offset). The MMU is what's dealing in pages and offsets more than the kernel.

notepad0x90•7mo ago

What I mean is, there is always a minimum page size you're given by the kernel. Typically that is 4kb. if you malloc(sizeof(int)); you're not getting sizeof(int), you're getting a 4kb page allocated. You can change that value to something other than 4kb but then the entire system will use that value. I'm asking why the page size allocated can't be depending on the user space allocator's syscall parameter (dynamic page size). Or is it that the MMU at a hardware/microcode level can only handle static page sizes?

Veserv•7mo ago

MMU hardware has a hardware-defined minimum mapping granularity (i.e. page size). On x86-64 this is 4KB. On ARMv8/v9 the specification allows runtime configuration for any of 4KB, 16KB, or 64KB as the minimum mapping granularity on a per-process basis, though the actual chip implementation is free to only support a subset of such configurations and still be conformant.

Implementations may also define certain sizes that are more efficient. On modern x86-64, large pages may allow 2MB, 1GB, 512 GB, etc. (multiples of 512) to be more efficient. On modern ARMv8/v9 there is similar large page support (the exact details depend on the minimum mapping granularity) and they may also support aligned, contiguous mappings which making other sizes efficient (the exact details are even more complicated than simple large pages and highly optional).

As such, if you want to get just "1 byte", then you create a userspace allocator that asks for memory that conforms to the limits of the MMU and then it is the job of the userspace allocator to abstract that away for you by chopping it up under its own auspices.

notepad0x90•7mo ago

I suppose having some sort of minimum granularity for the MMU makes a lot of sense, but above that point, why limit the "steps" or the maximum? why not 4KB+2 for example (4098)? For the sake or discussion, let's say I have 128GB ram and 50TB page file, can that work? Why is a fixed-size page more efficient, is it because of offset calculations being simpler arithmetic operations? I figured the TLB is (or can be) a simple (large) table implemented in hardware, the simplicity translating into O(N) complexity. I am suspecting it might have to do with hardware cost?

rcxdude•7mo ago

Note that the page file is essentially a datastructure which maps addresses to other addresses, but it also resides in the same physical memory. If you were to try to map every byte of a system to different physical address, it would need significantly more memory than the system had! Further, the datastructure needs to allow efficient lookup of data because the pagefile mapping occurs on every memory access (results are cached in the Translation Lookaside Buffer (TLB) and the speed and accuracy of this cache is often a limiting factor in the speed of the CPU). This all biases towards the mapping wanting to be quite coarse, in fact 4kB is probably already not optimal on today's systems with gigabytes of memory.

notepad0x90•7mo ago

But why does it work that way, why does all memory need to be mapped before allocation? why can't it work like file systems where you have a table/datastructure mapping inodes to files? The smaller your allocations are, the less total ram you have, that would make sense. Some apps need huge chunks of memory, some don't, why the one-size-fits-all approach? TLB lookup can also be O(n), it will be slower when you have lots of small allocations because that means lots of lookups but that lets applications optimize performance that way instead of having it be a system-wide setting.

o11c•7mo ago

Fragmentation is one major reason to use fixed size classes. After just a few allocate/free-a-bit cycles, you can very easily end up with a large amount of free space in unusably tiny chunks. (this is one reason why languages with a Garbage Generator tend to prefer being allowed to move and compact their objects - their whole thing is keeping the allocator as simple as possible to beat the benchmarks, so they can't use the usual solutions)

There's also the matter of space overhead.

A bitmap allocator is simple and has an overhead of 1 bit per chunk (allocated or not). Normally the chunk is at least 16 bytes, but if you know you're doing smaller allocations there's nothing stopping you from doing one at byte granularity.

If there exists an invalid value (for example, UTF-8 strings cannot contain the byte 0xff), you can implement allocators with zero overhead, though efficiency requires changing the API (`malloc` requires writing to all the memory to erase the tombstones, so it's more efficient for the allocator to support `strdup` and `realloc_strcat` directly).

By bucketing you can vary the overhead based on allocation size. But also remember, a lot of the trickery you can do here comes at the cost of becoming more vulnerable to use-after-free bugs.

Besides all the missing APIs, one major annoyance related to allocators is the fact that userland lacks support for CPU-local variables that the kernel can use for more efficient operations. `rseq` helps a little but still falls short.

nitrix•7mo ago

> "The C standard library provides malloc/free/resize"

It does not provide a `resize`. You're thinking of `realloc` on hosted environments.

GuB-42•7mo ago

Writing a malloc/free replacement was a 1 day project we did at school in the first couple of weeks of learning C/UNIX. Many students didn't even know how to code before that.

All that to say that writing an allocator is not hard. Writing a good allocator however is another story. Not because the algorithms are complicated, but because it is a very critical part of most systems and requires extensive knowledge of real life situations.

worthless-trash•7mo ago

Writing an allocator that is harder to exploit is harder again.

WalterBright•7mo ago

Writing a basic allocator should be a foundational part of any programming curriculum. It removes the mystery of them.

Just like writing a lexer should be. It's amazing how many programming tasks are improved by writing a lexer for it, and the awful code I see from people who have no idea how to write one.

(For example, I wrote a lexer for function that read a date like "Mar 7, 1997" and turned it into a time_t. The lexer greatly simplified it, as strings like "Mar" and "March" became the same tokens. A number from 13 to 31 would lex as a day token. The number 3 would lex as a DayOrMonth token. And so on.)

o11c•7mo ago

And how did you lex "September 3, 1752"?

WalterBright•7mo ago

Month, DayOrMonth, Year

marssaxman•7mo ago

I wrote a lexer-driven date parser like that as one of the first contract jobs I ever did - maybe the first, I can't quite remember - at the very beginning of my software development career, back in the previous century. I remember feeling rather proud of it.

(In hindsight, I'm not sure that trying to accept any conceivable date format as if by magic was really such a great idea, but that's what my client wanted, and building it was an enjoyable challenge.)

AndyKelley•7mo ago

Decent thread-safe allocator in 200 lines: https://github.com/ziglang/zig/blob/0.14.1/lib/std/heap/SmpA...

romesmoke•7mo ago

I spent the most part of my PhD trying to convince people that allocators are not simple. But the not-simple part is design, specifically the strategy/policy part. In other words, minimizing fragmentation by adapting to linked program behavior. From what I know, no product-level allocator and very few research-level ones go beyond a "one size fits all" approach.

The reason why this is so is pragmatic: fancy strategies take extra compute time that is too expensive when one's sitting on the critical path. So what follows has a good chance of not bearing practical value but (i) there are setups where memory capacity is a bottleneck and (ii) hey, it's just a PhD anyway.

So what does a fancy strategy even look like? Let's look at the simplest non-trivial version of the problem. Say, we know the sizes and lifetimes of all objects in advance, and we want to pack them in as tight a contiguous address space as possible. Mathematicians call this the Dynamic Storage Allocation (DSA) problem and it's NP-complete, i.e., no known optimal algorithm exists for the general case. Deep learning compilers battling the AI memory wall have been returning to DSA lately (it's an old problem).

The implication of the above is that a real allocator, that is, one forced to work under uncertainty w.r.t. object sizes and lifetimes down the line, can always fail remarkably. So going for a general-purpose solution begs that you know what you're doing (one could argue that most programs out there are "well-behaved", and I would agree up until the point where this observation is turned into a universal law).

Nevertheless, it remains a fact that writing a toy allocator is both educational and soberingly not hard.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

What Is Ruliology?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Microsoft open-sources LiteBox, a security-focused library OS

Delimited Continuations vs. Lwt for Threads

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Was Benoit Mandelbrot a hedgehog or a fox?

Dark Alley Mathematics

Show HN: If you lose your memory, how to regain access to your computer?

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

How to effectively write quality code with AI

Why I Joined OpenAI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Introducing the Developer Knowledge API and MCP Server

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Learning from context is harder than we thought

FORTH? Really!?

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: ARM64 Android Dev Kit

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

What Is Ruliology?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Microsoft open-sources LiteBox, a security-focused library OS

Delimited Continuations vs. Lwt for Threads

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Was Benoit Mandelbrot a hedgehog or a fox?

Dark Alley Mathematics

Show HN: If you lose your memory, how to regain access to your computer?

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

How to effectively write quality code with AI

Why I Joined OpenAI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Introducing the Developer Knowledge API and MCP Server

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Learning from context is harder than we thought

FORTH? Really!?

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: ARM64 Android Dev Kit

I was surprised by how simple an allocator is

Comments