CPU cache-friendly data structures in Go

https://skoredin.pro/blog/golang/cpu-cache-friendly-go

193•g0xA52A2A•4mo ago

Comments

truth_seeker•4mo ago

> False Sharing : "Pad for concurrent access: Separate goroutine data by cache lines"

This is worth adding in Go race detector's mechanism to warn developer

solatic•4mo ago

Most modern processor architecture CPU cache line sizes are 64 bytes, but not all of them. Once you start to put performance optimizations like optimizing for cache line size, you're fundamentally optimizing for a particular processor architecture.

That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.

dadkins•4mo ago

The big two, x86_64 and arm64, have 64-byte cache lines, so that's a reasonable assumption in practice. But I was surprised to discover that Apple's M-series laptops have 128-byte cache lines, and that's something a lot of people have and run, albeit not as a server.

pstuart•4mo ago

Seems like judicious build tag/file extensions would allow for such optimizations with a fallback to no optimization.

danudey•4mo ago

Something like C++17's `std::hardware_destructive_interference_size` would be nice; being able to just say "Align this variable to whatever the cache line size is on the architecture I'm building for".

If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.

loeg•4mo ago

Apple arm64 supposedly has 64-byte L1 cache line size and 128-byte L2? How does that work? Presumably the lines are independent in L1, but can different cores have exclusive access to adjacent lines? What's the point of narrower lines in L1?

danudey•4mo ago

Maybe the point isn't narrower lines in L1 but wider lines in L2? Implicitly bringing in more data to the L2 cache but allowing the CPU to pick smaller chunks of it into L1 cache to work on. Something like a forced prefetch or something? Honestly no idea.

senderista•4mo ago

Only landed in clang last year: https://github.com/llvm/llvm-project/pull/89446

loeg•4mo ago

On which architecture are cache lines not 64 bytes? It's almost universal.

x-complexity•4mo ago

https://news.ycombinator.com/item?id=45529810

https://cpufun.substack.com/i/32474663/notable-differences

As noted by the other comments, Apple's M-series chips seem to use a 128-byte cache line. ARM doesn't mandate that their licensees must use a pre-specified cache line size: 64 bytes just happens to be the consensus-arrived standard.

readthenotes1•4mo ago

I wonder how many nanoseconds it'll take for the next maintainer to obliterate the savings?

hu3•4mo ago

That's just one prompt away!

jasonthorsness•4mo ago

For low-level small things tests can help. Go has good benchmarking built-in and you can use a tool that passes/fails based on statistically-significant regressions, benchstat (https://pkg.go.dev/golang.org/x/perf/cmd/benchstat).

wy1981•4mo ago

Looks nice. Some explanation for those of us not familiar with Go would've been more educational. Could be future posts, I suppose.

danudey•4mo ago

Honestly I don't think there's much in here that's Go-specific other than the syntax itself. I've seen basically the same tricks used in C or C++.

Was there any particular part that felt like it needed more explanation?

gethly•4mo ago

Most of this should be handled by the compiler already. But it is only 2025, I guess we're just not ready for it.

pphysch•4mo ago

Not really, virtually all these patterns involve tradeoffs that require understanding the data access patterns.

I don't want my compiler adding more padding than bare minimum to every struct. I don't want it transforming an AoS to SoA when I choose AoS to match data access patterns. And so on...

At best Go could add some local directives for compiling these optimizations, but these code changes are really minimal anyways. I would rather see the padding explicitly than some abstract directive.

danudey•4mo ago

I could imagine some kind of compiler declaration in C that would do something like specify break points - sort of like page breaks - for structs, or tell the compiler to automatically pad structs out so that components are on page boundaries, cache line boundaries, etc. Sort of "If we're not properly aligned, add whatever padding you think is best here".

I guess this is largely provided by std::hardware_destructive_interference_size in C++17, but I'm not sure if there are other language equivalents.

https://en.cppreference.com/w/cpp/thread/hardware_destructiv...

mtklein•4mo ago

I think this is _Alignas/alignas.

    struct foo {
        _Alignas(64) float x,y;
        _Alignas(64) int     z;
    };
    _Static_assert(sizeof(struct foo) == 192, "");

danudey•4mo ago

The example I linked uses alignas, but the key is knowing what value to pass. std::hardware_destructive_interference_size tells you what the current/target hardware's correct align value is, which is the challenge.

truth_seeker•4mo ago

or at least Linter should catch this

https://golangci-lint.run/docs/linters/

CamouflagedKiwi•4mo ago

I think it's really beyond the power of a linter to understand when this would matter. It'd be issuing warnings on almost every struct out there saying "these two members share a cache line" which you almost never care about.

jerf•4mo ago

Are you thinking of some sort of annotation the compiler could read and handle?

Because if a compiler starts automatically padding all my structures to put all of the members on their own cache line I'm going to be quite peeved. It would be easy for it to do, yes, but it would be wrong 99%+ of the time.

A far more trenchant complaint is that Go won't automatically sort struct members if necessary to shrink them and you have to apply a linter to get that at linting time if you want it.

danudey•4mo ago

I'm not sure if golang has the same fundamental issues in common use, but in e.g. C you don't want the compiler reordering your structs or adding arbitrary padding because that makes it incompatible with other in-memory representations - e.g. if you're using shared memory with another process that hasn't received the same optimizations, if you're loading raw data into memory/using mmap, etc.

Likewise, one of the examples is moving from an array of structs to a struct of arrays; that's a lot more complex of a code reorganization than you'd want a compiler doing.

It would be good to have a static analyzer that could suggest these changes, but, at least in many cases, you don't want them done automatically.

Thaxll•4mo ago

Do compiler ( GCC / llvm ) actually do that?

jandrewrogers•4mo ago

How would that even work? The layout of data structures are constrained by many invariants not visible to the compiler (see also: auto-vectorization). It would be more work and boilerplate to add sufficient annotations to a data structure to enable the compiler to safely modify the layout than just using the layout you want.

mappu•4mo ago

Some languages like Odin, ISPC, and Jai all have annotations that can automatically transform AoS to SoA. A key benefit is you can easily experiment to see if this helps your application, without doing a major refactor.

In https://github.com/golang/go/issues/64926 it was a bridge-too-far for the Go developers (fair enough) but maybe it could still happen one day.

tuetuopay•4mo ago

Overall great article, applicable to other languages too.

I'm curious about the Goroutine pinning though:

    // Pin goroutine to specific CPU
    func PinToCPU(cpuID int) {
        runtime.LockOSThread()
        // ...
        tid := unix.Gettid()
        unix.SchedSetaffinity(tid, &cpuSet)
    }

The way I read this snippet is it pins the go runtime thread that happens to run this goroutine to a cpu, not the goroutine itself. Afaik a goroutine can move from one thread to another, decided by the go scheduler. This obviously has some merits, however without pinning the actual goroutine...

superb_dev•4mo ago

`runtime.LockOSThread()` will pin the current goroutine to the os thread that its currently running on

tuetuopay•4mo ago

Oooh, that's what is happening. I assumed it locked some structure about the thread while touching it, to prevent races with the runtime. That's what I get for not RTFM'ing (in fairness, why Lock and not Pin, when both of these words have pretty well defined meanings in programming?)

Thank you

Someone•4mo ago

And prevents other goroutines from running on that thread. I think that’s crucial.

hu3•4mo ago

> False sharing occurs when multiple cores update different variables in the same cache line.

I got hit by this. In a trading algorithm backtest, I shared a struct pointer between threads that changed different members of the same struct.

Once I split this struct in 2, one per core, I got almost 10x speedup.

spockz•4mo ago

Interesting! Did you find out a way to bench this with the built in benchmarking suite?

hu3•4mo ago

No it was just a hunch derived from ancient times of SoA vs AoS game dev optimization. I didn't need the perf gain but tried and it worked.

tapirl•4mo ago

Source code of the benchmarks?

At least, the False Sharing and AddVectors trick don't work on my computer. (I only benchmarked the two. The "Data-Oriented Design" trick is a joke to me, so I stopped benchmarking more.)

And I never heard of this following trick. Can anyone explain it?

    // Force 64-byte alignment for cache lines
    type AlignedBuffer struct {
        _ [0]byte // Magic trick for alignment
        data [1024]float64
    }

Maybe the intention of this article is to fool LLMs. :D

kbolino•4mo ago

I can't find any claim anywhere else about the [0]byte trick, and in fact my own experiments in the playground show that it doesn't do anything.

If you embed an AlignedBuffer in another struct type, with smaller fields in front of it, it doesn't get 64-byte alignment.

If you directly allocate an AlignedBuffer (as a stack var or with new), it seems to end up page-aligned (the allocator probably has size classes) regardless of the presence of the [0]byte field.

https://go.dev/play/p/Ok7fFk3uhDn

Example output (w is a wrapper, w.b is the field in the wrapper, x is an allocated int32 to try to push the heap base forward, b is an allocated AlignedStruct):

  &w   = 0xc000126000
  &w.b = 0xc000126008
  &x   = 0xc00010e020
  &b   = 0xc000138000

Take out the [0]byte field and the results look similar.

skinowski•4mo ago

They meant "[0]uint64" probably, not 0[]byte.

tapirl•4mo ago

"[0]uint64" only guarantees 8-byte alignment (and only under certain scenarios), not 64-byte.

tapirl•3mo ago

sorry, the False Sharing tick works. See https://news.ycombinator.com/item?id=45547441

OutOfHere•4mo ago

If you are worrying about cache structure latencies in Go, maybe you should just be using Rust or Zig instead that implicitly handle this better.

nasretdinov•4mo ago

Not necessarily: you can go quite far with Go alone. It also makes it trivial to run "green threads" code, so if you need both (decent) performance and easy async code then Go still might be a good fit. Despite Go being pretty high level GC language on the surface it actually allows you to control stuff like struct layout, CPU affinity, etc, which typically matter more for performance than just a programming language of choice. There's a reason why e.g. VictoriaMetrics is still in Go even though they could've easily chosen any other language too

OutOfHere•4mo ago

In what way does Go have async?

all2•4mo ago

Aren't goroutines by their nature asynchronous? Am I misunderstanding what you mean by 'async'?

OutOfHere•4mo ago

Asynchronous is a programming style; it does NOT apply to Go. The goroutines run in parallel. Also, don't use complicated words when simple words will do.

all2•4mo ago

    Asynchronous is a programming style; it does NOT apply to Go.

Ok, good to know. I guess I jammed threading and async into the same slot in my brain.

    Also, don't use complicated words when simple words will do.

I'm not sure what you mean by this in relation to my above comment.

aforwardslash•4mo ago

In golang, there is no guarantee that goroutines will run in parallel; also, it is quite common to use channels as means of synchronization of results, akin to common async programming patterns.

OutOfHere•4mo ago

> In golang, there is no guarantee that goroutines will run in parallel;

That's not specific to Go lang. Most of the constraints you're thinking of apply to all parallel programming languages. It goes with the territory. All parallel programming languages impose certain flavors of their management of parallelism.

sa46•4mo ago

Go is certainly capable of async programming. https://en.wikipedia.org/wiki/Asynchrony_(computer_programmi...

> The goroutines run in parallel. Also, don't use complicated words when simple words will do.

That’s not called for, especially since you’re wrong.

nasretdinov•4mo ago

Well, essentially Go doesn't have a separate async keyword because all goroutines run asynchronously under the hood. In the beginning the advised (and default) way of running Go code was GOMAXPROCS=1, essentially ensuring there is no actual parallelism, just asynchronous code. Since then, of course, around Go 1.5, the default switched to number of cores, making goroutines both async and parallel

pjmlp•4mo ago

News for most folks, even writing C does not help, if neither of these advices are taken into account on how to lay out structures, nor algorithms are written with mechanical sympathy in mind.

citizenpaul•4mo ago

Really cool article this is the kind of stuff I still come to HN for.

jasonthorsness•4mo ago

If you are sweating this level of performance, are larger gains possible by switching to C, C++, Rust? How is Rust for micro-managing memory layouts?

loeg•4mo ago

You need to do the exact same kinds of thing in C/C++/Rust. I believe Rust struct layout is not guaranteed to match program order unless you use an annotation forcing it (repr(C)). (So to answer the question: it's great; as good as any other language for micromanaging layout.)

0x457•4mo ago

Yes, without repr(C) order and padding isn't guaranteed. You would use https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struc... or similar to force fields not being on the same cache line.

jasonthorsness•3mo ago

huh TIL

"On modern Intel architectures, spatial prefetcher is pulling pairs of 64-byte cache lines at a time, so we pessimistically assume that cache lines are 128 bytes long."

loeg•3mo ago

That was true in like, 2011. I'm not sure if it's true anymore.

0x457•3mo ago

Pretty sure it started being a thing at Sandy Bridge and never stopped?

loeg•3mo ago

I don't think the impact on adjacent cache lines is as severe as it was on Sandy Bridge.

luispa•4mo ago

great article!

ls-a•4mo ago

reminds me of cache oblivious data structures

gr4vityWall•4mo ago

Good article.

Regarding AoS vs SoA, I'm curious about the impact in JS engines. I believe it would be a significant compute performance difference in favor of SoA if you use typed arrays.

matheusmoreira•4mo ago

Structure of arrays makes a lot of sense, reminds me of how old video games worked under the hood. It seems very difficult to work with though. I'm so used to packing things into neat little objects. Maybe I just need to tough it out.

ardanur•4mo ago

"Data Oriented Design" is more than just for performant code.

You can and perhaps should also use it to reason about and design software in general. All software is just the transformation of data structures. Even when generating side-effects is the goal, those side-effects consume data structures.

I generally always start a project by sketching out data structures all the way from the input to the output. May get much harder to do when the input and output become series of different size and temporal order and with other complexities in what the software is supposed to be doing.

another_twist•4mo ago

Good programmers worry about the algorithms. Great ones worry about the data structures and the relationships between them. If memory serves, it was Kernighan.

vacuity•3mo ago

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." - Linus Torvalds

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they’ll be obvious." - Fred Brooks, The Mythical Man Month

And two threads with some further discussion I found while looking for these quotes:

https://news.ycombinator.com/item?id=17580598

https://news.ycombinator.com/item?id=10293795

loeg•4mo ago

This is a really dense whirlwind summary of some common performance pitfalls. It's a nice overview in a sort of terse way. The same optimizations / patterns apply in other languages as well.

furyofantares•4mo ago

I waited half a day to post this, I think we aren't supposed to question if articles are LLM written - but this one really triggered my LLM-radar, while also being very well received.

I'd love to know how much LLM was used to write this if any, and how much effort went into it as well (if it was LLM-assisted.)

ayuhito•4mo ago

> I'd love to know how much LLM was used to write this if any, and how much effort went into it as well (if it was LLM-assisted.)

Are people supposed to be obligated to post such a report nowadays?

I enjoyed the article and found it really interesting, but seeing these types of comments always kind of puts a damper on it afterwards.

furyofantares•4mo ago

> Are people supposed to be obligated to post such a report nowadays?

No, typically when I ask questions it's optional.

> I enjoyed the article and found it really interesting, but seeing these types of comments always kind of puts a damper on it afterwards.

That is why I waited half a day, and until after there were lots of comments praising the article. Still, I'm sorry if it put a damper on it for you.

Also the whole reason I asked about the source is because I think the article has a lot of merit and so I am curious if it's because the author put a lot of work in (LLM-assisted or not.) Usually when I get that feeling it's followed by a realization I'm wasting my time on something the author didn't even read closely.

But I didn't get that this time, and I'd love more examples of LLMs being used (with effort, presumably) to produce something the author could take pride in.

furyofantares•3mo ago

> But I didn't get that this time,

Actually, I take it back. I did think I was wasting my time when I noticed it was written by an LLM. But then I came back to HN an saw only praise and decided to wait a bit to see if people kept finding it useful before commenting.

I was somewhat excited by the prospect of this article being useful, but I've started to come around to my initial impression after another day. I don't really trust it.

darccio•4mo ago

The structure reads as LLM written. I don't mind this unless the content is utterly wrong. I was actually learning about cache-friendly data structures and I'm really interested in that cache-friendly Robin Hood hashing but now I worry it's a hallucination.

tapirl•3mo ago

None of the tricks in this article get verified. Almost all of them are false.

tapirl•3mo ago

sorry, the False Sharing tick works. See https://news.ycombinator.com/item?id=45547441

tapirl•3mo ago

None of the tricks in this article get verified. It is totally solemn drivel.

Interesting and surprisingly, there are numerous praising comments here.

furyofantares•3mo ago

FWIW, which may be not much - I had codex cli try to verify the results. On my M2 Macbook Air only the first example (False Sharing) did anything - a 23x speedup compared to the article's 6x speedup. All the others didn't produce any speedup at all.

Of course I didn't verify the results I got either - I'm not about to spend hours trying to figure out if this is just slop. But I think it is.

tapirl•3mo ago

Could you share the benchmark source code of the first example?

furyofantares•3mo ago

Here's the one that showed a lot more speedup than the article:

https://pastebin.com/v9tczpus

Looks like the LLM invented somewhat different test for it than the article had. I tried again and have this with the same data structure as in the article:

https://pastebin.com/SDdcchZG

That gave similar results to the article.

All the other tests still give little-to-no speedup on my machine.

tapirl•3mo ago

Many thanks for providing the source. It also works on my machine.

TIL.

furyofantares•3mo ago

I tried the others on my x86 machine and they all do something for me - not nearly as much as the article, but something.

tapirl•3mo ago

The "_ [0]byte" trick has no base in my knowledge. For the author's specified example, [1024]float64 will be always allocated on one whole page, aka, always 64-byte aligned.

For "Array of Structs vs Struct of Arrays", using slices as fields is a good idea. If the purpose is to make fields allocated on their respective memory block, just use pointers instead.

furyofantares•3mo ago

> The "_ [0]byte" trick has no base in my knowledge. For the author's specified example, [1024]float64 will be always allocated on one whole page, aka, always 64-byte aligned.

You're right - I read the results I had wrong on that one. That one is slower, not faster, on both my M2 and on x86 machine.

tapirl•3mo ago

My last comment has imprecision and misunderstanding.

> ... [1024]float64 will be always allocated on one whole page, aka, always 64-byte aligned.

if it is allocated on heap and at the start of allocated memory block.

> For "Array of Structs vs Struct of Arrays", using slices as fields is a good idea. If the purpose is to make fields allocated on their respective memory block, just use pointers instead.

I misunderstood it.

It is like row-based database vs. column-based database. Both ways have their respective advantages and disadvantages.

kbolino•3mo ago

I don't see this mentioned anywhere else, but Go may start experimenting with rearranging struct fields at some point. The marker type structs.HostLayout has been added in Go 1.24 to indicate that you want the struct to follow the platform's layout rules (think of it like #[repr(C)] in Rust). This may become necessary to ensure the padding actually sits between the two falsely shared fields. You could combine it with the padding technique like this:

  type PaddedExample struct {
    _       structs.HostLayout
    Field1  int64
    _       [56]byte
    Field2  int64
  }

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

Go-busybox: A sandboxable port of busybox for AI agents

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

Zen Tools

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Πfs – The Data-Free Filesystem

Go-busybox: A sandboxable port of busybox for AI agents

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

Zen Tools

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

CPU cache-friendly data structures in Go

Comments