A fast, growable array with stable pointers in C

https://danielchasehooper.com/posts/segment_array/

144•ibobev•9h ago

Comments

zokier•8h ago

> Today’s computers use only 48 bits of the 64 bits in a pointer

https://en.wikipedia.org/wiki/Intel_5-level_paging introduced in Ice Lake 6 years ago.

But anyways, isn't this just variant of std::deque? https://en.cppreference.com/w/cpp/container/deque.html

cornstalks•8h ago

What kind of setups use over 256 TiB of RAM?

reorder9695•8h ago

"640k ought to be enough for anybody"

bonzini•8h ago

In practice it's over 64 TiB because kernels often use a quarter of the available addressing space (half of the kernel addressing space) to map the physical addresses (e.g. FFFFC000_12345678 maps physical address 0x12345678). So 48 virtual address bits can be used with up to 2^46 bytes of RAM.

hinkley•5h ago

And how long has 48 bit addressing been de rigeur? Not so long ago we had processors that could address 40 bits of address space. Or was that 38?

winocm•5h ago

At least since maybe the DEC Alpha 21264. It could address 48-bits of VA space, but that comes with caveats due to PALcode specific intricacies.

I think VMS (or was it Tru64?) uses this mode, but many other OSes just use 43-bit or 40-bit addressing. Realistically though, I don’t think many users would be using workloads that addressed more than 38-bits worth of contiguous VA space in 1998-1999.

loeg•4h ago

amd64 has had 48-bit addressing / 4-level paging from the beginning.

TapamN•8h ago

It not necessarily physical RAM. If you memmap large files, like maybe a large file from RAID or network share, you could still need that much virtual address space.

sestep•8h ago

Does std::deque support random access?

mwkaufma•8h ago

Yes, you can see operator[] in the linked reference docs.

ksherlock•6h ago

The cppreference documentation specifies "Random access - constant O(1)" (same as std::vector). There's a slight overhead from going through 2 pointers instead of 1 and keeping track of how many items are in the first bucket (so you know which bucket to check), but that's a constant and the big O don't care about constants.

mwkaufma•8h ago

In principle it's not that different that deque, though:

(1) deque uses fixed-sized blocks, not increasing-size blocks. (2) dequeue supports prepending, which adds another level of indirection internally.

sigbottle•8h ago

You can support prepending by mirroring the allocations, probably? eg for the "negative index" case do an exponential thing in the other direction.

Your indexing has some legitimate math to be done now which can be annoying (efficiency perspective) I think you can still get o(1) with careful allocation of powers of 2.

o11c•7h ago

That's fine if you only ever add, but is likely to fail if you pop FIFO-style. This too is ultimately fixable but means we can no longer assume "every bucket size doubles".

That said, IMO "stable pointers" is overrated; "minimize copying" is all that's useful.

dwattttt•6h ago

Stable pointers is a limitation for simplicity's sake. If C were better at tracking pointer invalidation, we'd be better at propagating updated pointers.

sigbottle•8h ago

I don't know the precise details of how deques are implemented in C++, but given the most popular stack overflow explanation of them, some immidiate pitfalls are that the T* map itself sounds unbounded and if each chunk allocates only a fixed constant size it's probably horrible for fragmentation or overallocation. The indexing also seems dependent on division.

With this power of twos approach you can't really truly delete from the front of the array but the amount of pointers you store is constant and the memory fragmentation is better. (Though OP never claimed to want to support deque behavior, it shouldn't be that hard to modify, though indexing seems like it has to go thru more arithmetic again)

I haven't used OP's array, but I have been bit plenty of times with std::deque's memory allocation patterns and had to rewrite with raw arrays and pointer tracking.

forrestthewoods•6h ago

std::deque details vary by implementation and is largely considered unusable for MSVC.

MSVC uses a too small block size making it worthless. libc++ block size is 16 elements or 4096 bytes.

It is generally better to use a container you can actually understand the implementation details and control.

I would not call it a variant of std::deque myself. Not wrong. But not a helpful observation imho.

01HNNWZ0MV43FF•8h ago

Readers might also find `plf::colony` interesting: https://www.plflib.org/colony.htm

variadix•8h ago

You can also use virtual memory for a stable resizable vector implementation, up to some max length based on how much you virtual memory you reserve initially, then commit as required to grow the physical capacity.

loeg•8h ago

Yeah, with less runtime overhead, so long as you're ok with the ~4kB minimum allocation size.

IshKebab•6h ago

I think optimal would be just normal vector under 4 kB and then switch to whole pages after that.

Do any vector implementations & allocators actually do this though?

loeg•4h ago

You lose the in-place resize property if you don't start with an entire page.

As far as "has anyone implemented this?" -- I don't know.

fyrn_•7h ago

This mention this alturnative in the article, and also point out how it does not work in embeded contexts or with WASM

ncruces•5h ago

I've actually used this to implement the linear memory of a Wasm runtime (reserve 4GB and commit as needed) and have faced user complaints (it's in a library that uses a Wasm runtime internally) due to it unexpectedly running into issues, particularly under nested virtualization scenarios.

I've needed to add knobs to configure it, because even a handful of 4GB instances causes issues. I've defaulted to 256MB/instance, and for my own GitHub CI use 32MB/instance to reduce test flakiness.

This is to say: I found the idea that “just reserve address space, it's not an issue of you don't commit it” very flaky in practice, unless you're running on bare metal Linux.

loeg•4h ago

Historically the Java runtime did something similar, with similar problems (exacerbated by being older hardware with smaller RAM).

Virtual address space just isn't free.

loeg•4h ago

Embedded environments without virtual memory are increasingly rare, and generally not places where you would need a variably sized generic list ADT anyway.

tovej•8h ago

Very nice! I do wonder if it would be useful to be able to skip even more smaller segments, maybe a ctor argument for the minimum segment size. Or maybe some housekeeping functions to collapse the smallest segments into one.

Mostly the thing that feels strange is when using say, n > 10 segments, then the smallest segment will be less than a thousandth of the largest, and iterating over the first half will access n-1 or n-2 segments, worse cache behaviour, while iterating over the second half will access 1 or two segments.

Seems like, in most cases, you would want to be able to collapse those earlier segments together.

o11c•7h ago

Can we really call it an array if it's not contiguous (or at least strided)? Only a small fraction of APIs take an `iovec, iovcnt`-equivalent ...

dhooper•7h ago

feel free to call it a "levelwise-allocated pile"

jandrese•7h ago

Yeah, the limitation that it can't be just dumped into anything that expects a C array is a large one. You need to structure your code around the access primitives this project implements.

unwind•7h ago

Very nice, although I think the level of "trickery" with the macros becomes a bit much. I do understand that is The Way in C (I've written C for 30 years), it's just not something I'd do very often.

Also, from a strictly prose point of view, isn't it strange that the `clz` instruction doesn't actually appear in the 10-instruction disassembly of the indexing function? It feels like it was optimized out by the compiler perhaps due to the index being compile-time known or something, but after the setup and explanation that was a bit jarring to me.

mananaysiempre•7h ago

The POSIX name for the function is clz() [the C23 name is stdc_leading_zeros(), because that's how the committee names things now, while the GCC intrinsic is __builtin_clz()]. The name of the x86 instruction, on the other hand, is BSR (80386+) or LZCNT (Nehalem+, K10+) depending on what semantics you want for zero inputs (keep in mind that early implementations of BSF/BSR are very slow and take time proportional to the output value). The compiled code uses BSR. (All of these are specified slightly differently, take care if you plan to actually use them.)

unwind•5h ago

Got it, thanks. I suck at x86, even more than I thought. :/

Edit: it's CLZ on Arm [1], probably what I was looking for.

[1]: https://developer.arm.com/documentation/100069/0610/A32-and-...

jovial_cavalier•7h ago

The example code doesn't seem to compile.

seanwessmith•7h ago

Make sure the create the segment_array.h file. Mine outputted just fine on Mac M4

  Desktop gcc main.c        
  Desktop ./a.out

entities[0].a = 1 entities[0].a = 1 entities[1].a = 2

pfg_•7h ago

Zig has this as std.SegmentedList, but it can resize the segment array dynamically

andrewla•7h ago

This is really clever but better to call this a list rather than an array; functions which expect array semantics will simply not work, and there's no way to transparently pass slices of this data structure around.

In the past I've abused virtual memory systems to block off a bunch of pages after my array. This lets you use an array data structure, have guard pages to prevent out of bounds access, and to have stable pointers in the data structure.

hinkley•5h ago

I believe I've seen problems like this also solved with arena allocators. You have certain very special allocations have an arena unto themselves.

stmw•4h ago

Same re: virtual memory systems (using guard pages), that is an old idea that works well but it did once produce a really unpleasant bug in production... But that was an unfortunate implementation mishap.

benlwalker•2h ago

For an expanding array in a 64 bit address space, reserving a big region and mmaping it in as you go is usually the top performing solution by a wide margin. At least on Linux, it is faster to speculatively mmap ahead with MAP_POPULATE rather than relying on page faults, too.

And, if you find you didn't reserve enough address space, Linux has mremap() which can grow the reserved region. Or map the region to two places at once (the original place and a new, larger place).

OptionOfT•1h ago

One place I had issues was rapidly allocating space I needed temporarily but then discarding it.

The space I needed was too large to be added to the heap, so I used mmap. Because of the nature of the processing (mmap, process, yeet mmap) I put the system under a lot of pressure. Maintaining the set of mapped blocks and reusing them around fixed the issue.

vlovich123•1h ago

Freeing memory to the OS or doing things like munmap involves a TLB shootdown which by definition is a performance bottleneck. Probably what you ended up experiencing.

vlovich123•1h ago

FWIW mremap is very expensive as it involves a TLB shootdown.

zoogeny•6h ago

I think the article buries a significant drawback: contiguity. It is obviously implied by the design but I think this approach would have hard-to-define characteristics for things like cache prefetching. The next address is a function, not an easily predictable change.

One frequent reason to use an array is to iterate the items. In those cases, non-contiguous memory layout is not ideal.

forrestthewoods•6h ago

I would not call it “non-contiguous”. It’s more like “mostly contiguous”. Which for large amounts data is “amortized contiguous” just like a regular vector has “amortized constant” time to add an element.

dhooper•6h ago

1. The image at the top of the article makes it clear the segments aren't contiguous

2. iterating a 4 billion item segmented array would have 26 cache misses. Not a big deal.

taminka•6h ago

i love a nice single header project, it's c23 only tho (bc of typeof)?

listeria•4h ago

They link to an old article [1] that was featured in HN [2] somewhat recently, in which there's a workaround for older standards with regards to typeof.

[1]: https://danielchasehooper.com/posts/typechecked-generic-c-da...

[2]: https://news.ycombinator.com/item?id=44425461

johnisgood•4h ago

It is a GNU extension, supported by both GCC and Clang.

gotoeleven•6h ago

Under what conditions is exponential segment sizing preferable to fixed size segments? Are there any specific algorithms or situations where this is especially good? It seems like the likelihood of large amounts of wasted space is a major downside.

o11c•6h ago

It's always better - the increase in indexing complexity is negligible, and it completely eliminates resizing of the top-level array. It also reduces the number of calls to `malloc` while keeping capacity proportional to active size.

listeria•6h ago

They mention using this as the backing array for a power-of-two-sized hash table, but I don't think it would be very useful considering that the hash table won't provide stable pointers, given that you would need to rehash every element as the table grows. Even if you just wanted to reuse the memory, rehashing in-place would be a PITA.

Lichtso•5h ago

Another similar data structure which has a balanced tree (instead of a list) that references array segments is the https://en.wikipedia.org/wiki/Rope_(data_structure)

Its main advantages are the O(log n) time complexity for all size changes at any index, meaning you can efficiently insert and delete anywhere, and it is easy to implement copy-on-write version control on top of it.

monkeyelite•3h ago

A lot of standard libraries have tried to implement ropes for things like strings and it usually is reverted for a simpler structure.

josephg•2h ago

There's a good reason for that. Almost all strings ever created in programs are either very small, immutable or append-only. Eg, text labels in a user interface, body of a downloaded HTTP request or a templated HTML string, respectively. For these use cases, small string optimisations and resizable vecs are better choices. They're simpler and faster for the operations you actually care about.

The only time I've ever wanted ropes is in text editing - either in an editor or in a CRDT library. They're a good choice for text editing because they let users type anywhere in a document. But that comes at a cost: Rope implementations are very complex (skip lists have similar complexity to a b-tree) and they can be quite memory inefficient too, depending on how they're implemented. They're a bad choice for small strings, immutable strings and append only strings - which as I said, are the most common string types.

Ropes are amazing when you need them. But they don't improve the performance of the average string, or the average program.

leecommamichael•4h ago

I’d rather do it in Odin, and I do.

willtemperley•1h ago

Looks a bit like rust-array-stump [1] which was built to optimise insertions in the middle of an array in computational geometry.

[1] https://github.com/bluenote10/rust-array-stump

kazinator•1h ago

> In other words [because the access sequence is just 10 instructions], memory will be the bottleneck, not the instructions to calculate where an index is.

Ha, that is wishful thinking. If you do this in a tight loop in which everything is in the L1 cache, the instructions hurt!

"Memory bandwidth is the bottleneck" reasoning applies when you access bulk data without localized repetition.

Mac history echoes in current Mac operating systems

Claude Code IDE integration for Emacs

Rules by Which a Great Empire May Be Reduced to a Small One (1773)

A Candidate Giant Planet Imaged in the Habitable Zone of α Cen A

Project Hyperion: Interstellar ship design competition

Litestar is worth a look

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

The Day MOOCs Died: Coursera's Preview Mode Kills Free Learning

More than two hard disks in DOS

We'd be better off with 9-bit bytes

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

Jules, our asynchronous coding agent

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

You know more Finnish than you think

A fast, growable array with stable pointers in C

The Bluesky Dictionary

Apple increases US commitment to $600B, announces American Manufacturing Program

301party.com: Intentionally open redirect

Multics

Out-Fibbing CPython with the Plush Interpreter

Comptime.ts: compile-time expressions for TypeScript

A Man Who Beat IBM

Show HN: HMPL – Small Template Language for Rendering UI from Server to Client

Breaking the sorting barrier for directed single-source shortest paths

The Inkhaven Blogging Residency

Automerge 3.0

Zig Error Patterns

303Gen – 303 acid loops generator

Rethinking DOM from first principles

AI in Search is driving more queries and higher quality clicks

Mac history echoes in current Mac operating systems

Claude Code IDE integration for Emacs

Rules by Which a Great Empire May Be Reduced to a Small One (1773)

A Candidate Giant Planet Imaged in the Habitable Zone of α Cen A

Project Hyperion: Interstellar ship design competition

Litestar is worth a look

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

The Day MOOCs Died: Coursera's Preview Mode Kills Free Learning

More than two hard disks in DOS

We'd be better off with 9-bit bytes

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

Jules, our asynchronous coding agent

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

You know more Finnish than you think

A fast, growable array with stable pointers in C

The Bluesky Dictionary

Apple increases US commitment to $600B, announces American Manufacturing Program

301party.com: Intentionally open redirect

Multics

Out-Fibbing CPython with the Plush Interpreter

Comptime.ts: compile-time expressions for TypeScript

A Man Who Beat IBM

Show HN: HMPL – Small Template Language for Rendering UI from Server to Client

Breaking the sorting barrier for directed single-source shortest paths

The Inkhaven Blogging Residency

Automerge 3.0

Zig Error Patterns

303Gen – 303 acid loops generator

Rethinking DOM from first principles

AI in Search is driving more queries and higher quality clicks

A fast, growable array with stable pointers in C

Comments