in unity, you literally can't do burst compiled jobs efficiently unless you choose the right allocator. they expose `Temp`, `Persistent`, etc because GC isn't even an option when you're fighting for milliseconds per frame. no allocator choice = frame skips = shipped bugs.
in embedded, FreeRTOS gives you multiple heap implementations for a reason. sometimes you need fixed size pools, sometimes you care about fragmentation. malloc's out of the question. same in any real time firmware, safety critical or not.
infra world has been using arena allocators for years. folly's `Arena`, grpc's arena, even jemalloc has thread caching and slab style region reuse built in. this isn't academic. large scale systems hit allocation pressure that general purpose allocators can't tune for globally.
and rust's whole alloc abstraction : it's more about expressing ownership across lifetimes in memory sensitive codebases. `Bumpalo` isn't a premature optimization. it's what you reach for when you know the object graph layout and want to free it in one call. also safe by design. it's not even painful to use.
imo allocator choice is an interface decision. it shapes how the rest of your code handles memory lifetimes. once you start passing memory lifetimes as part of your type system or job model, the whole architecture shifts. this is way deeper than faster malloc
This is difficult though in higher-level languages. Go tried and failed with arenas.
Has some shortcomings, but I think it should work.
Start of scope mark the allocators state. defer when it goes out of scope you release everything. Cheezy would be the release function takes a list of objects to keep.
Hmmm... ;)
I'm doing it bare-metal/no allocator, as I do most embedded projects... and it's flirting with running me out of memory! What do most (and the most popular) protobuf libs do in rust? Use an allocator. What does the ESP itself do? Use an allocator (with FreeRTOS).
Meanwhile I'm using Heapless (Vec and String syntax with a statically-allocated array), on a MCU with 128K flash and 32K Ram... This won't end well.
but it's just more blah-blah about ai :/
...I'll be here all week. Try the veal!
And replacement: https://news.ycombinator.com/item?id=44350333
If you click on the issue you will see that mimalloc already does offer this feature via mi_manage_os_memory to create an arena, mi_heap_new_in_arena to create a heap that uses the arena, and mi_heap_malloc to allocate memory from the heap.
A basic first-fit free-list type allocator can be less than 100 LoC - in x86 Asm. Using free space to store the overhead is a simple but effective optimisation too.
One question I have for those who know the topic at a greater depth: Are there simpler memory managers that don't use a fixed page size? Or does that make it a lot more complex? imagine requesting allocation of 1 byte and the kernel allocating a page of 1 byte and immediately you request 20GB and you get a 20GB page. Other than memory-management metadata taking up more memory on its own, what are the downsides of dynamic page sizes?
Implementations may also define certain sizes that are more efficient. On modern x86-64, large pages may allow 2MB, 1GB, 512 GB, etc. (multiples of 512) to be more efficient. On modern ARMv8/v9 there is similar large page support (the exact details depend on the minimum mapping granularity) and they may also support aligned, contiguous mappings which making other sizes efficient (the exact details are even more complicated than simple large pages and highly optional).
As such, if you want to get just "1 byte", then you create a userspace allocator that asks for memory that conforms to the limits of the MMU and then it is the job of the userspace allocator to abstract that away for you by chopping it up under its own auspices.
There's also the matter of space overhead.
A bitmap allocator is simple and has an overhead of 1 bit per chunk (allocated or not). Normally the chunk is at least 16 bytes, but if you know you're doing smaller allocations there's nothing stopping you from doing one at byte granularity.
If there exists an invalid value (for example, UTF-8 strings cannot contain the byte 0xff), you can implement allocators with zero overhead, though efficiency requires changing the API (`malloc` requires writing to all the memory to erase the tombstones, so it's more efficient for the allocator to support `strdup` and `realloc_strcat` directly).
By bucketing you can vary the overhead based on allocation size. But also remember, a lot of the trickery you can do here comes at the cost of becoming more vulnerable to use-after-free bugs.
Besides all the missing APIs, one major annoyance related to allocators is the fact that userland lacks support for CPU-local variables that the kernel can use for more efficient operations. `rseq` helps a little but still falls short.
It does not provide a `resize`. You're thinking of `realloc` on hosted environments.
All that to say that writing an allocator is not hard. Writing a good allocator however is another story. Not because the algorithms are complicated, but because it is a very critical part of most systems and requires extensive knowledge of real life situations.
Just like writing a lexer should be. It's amazing how many programming tasks are improved by writing a lexer for it, and the awful code I see from people who have no idea how to write one.
(For example, I wrote a lexer for function that read a date like "Mar 7, 1997" and turned it into a time_t. The lexer greatly simplified it, as strings like "Mar" and "March" became the same tokens. A number from 13 to 31 would lex as a day token. The number 3 would lex as a DayOrMonth token. And so on.)
(In hindsight, I'm not sure that trying to accept any conceivable date format as if by magic was really such a great idea, but that's what my client wanted, and building it was an enjoyable challenge.)
The reason why this is so is pragmatic: fancy strategies take extra compute time that is too expensive when one's sitting on the critical path. So what follows has a good chance of not bearing practical value but (i) there are setups where memory capacity is a bottleneck and (ii) hey, it's just a PhD anyway.
So what does a fancy strategy even look like? Let's look at the simplest non-trivial version of the problem. Say, we know the sizes and lifetimes of all objects in advance, and we want to pack them in as tight a contiguous address space as possible. Mathematicians call this the Dynamic Storage Allocation (DSA) problem and it's NP-complete, i.e., no known optimal algorithm exists for the general case. Deep learning compilers battling the AI memory wall have been returning to DSA lately (it's an old problem).
The implication of the above is that a real allocator, that is, one forced to work under uncertainty w.r.t. object sizes and lifetimes down the line, can always fail remarkably. So going for a general-purpose solution begs that you know what you're doing (one could argue that most programs out there are "well-behaved", and I would agree up until the point where this observation is turned into a universal law).
Nevertheless, it remains a fact that writing a toy allocator is both educational and soberingly not hard.
wavemode•7mo ago
Though it's not clear to me that the article does a good job of establishing that this is actually true ("mimalloc is only a few thousand lines of code" doesn't pass the smell test).
majormajor•7mo ago
I generally agree with the "memory management doesn't have to be as complicated as you might think" vibe, especially if you've read about some optimizations in fancy modern GCs and aren't aware of what a basic simple non-GC world can look like. That said, of course, you can indeed get into a lot of complexity beyond the textbook 101 examples. Like the mentioned threading...
dang•7mo ago