IRHash: Efficient Multi-Language Compiler Caching by IR-Level Hashing

https://www.usenix.org/conference/atc25/presentation/landsberg

30•matt_d•3d ago

Comments

orlp•3d ago

Every developer I've talked to has had the same experience with compilation caches as me: they're great. Until one day you waste a couple hours of your time chasing a bug caused by a stale cache. From that point on your trust is shattered, and there's always a little voice in the back of your head when debugging something which says "could this be caused by a stale cache?". And you turn it off again for peace of mind.

johnisgood•21h ago

What kind of compilation caches, something like ccache[1]? Do you use it, or would you? It is for C and C++. Check out the features, they are pretty neat, IMO!

The documentation may come in handy:

1. https://ccache.dev/manual/4.11.3.html#_how_ccache_works

2. https://ccache.dev/manual/4.11.3.html#_cache_statistics

and so forth.

[1] https://ccache.dev (ccache - a fast C/C++ compiler cache)

Y_Y•21h ago

There are three hard problems in computer science, cache invalidation and naming things.

aengelke•20h ago

Or rather: There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

(source: https://martinfowler.com/bliki/TwoHardThings.html)

ACCount37•19h ago

Don't you just do "flush the cache, rebuild" at the first suspicion? If the bug abruptly goes away, it was stale cache. It usually doesn't.

meisel•21h ago

Very interesting stuff. However, for my day-to-day work, I'm in a large C++ code base where most of the code has to be in headers due to templating. The bottlenecks are, very roughly:

- Header parsing (40% of time)

- Template instantiation (40% of time)

- Backend (20% of time)

For my use case, it seems like this cache would only kick in when 80% of the work has already been done. Ccache, on the other hand, doesn't require any of that work to be done. On a sidenote, template instantiation caching is a very interesting strategy, but today's compilers don't use it (there was some commercially sold compiler a while back that did have it, though).

aengelke•20h ago

Template instantiation caching is likely to help -- in an unoptimized LLVM build, I found that 40-50% of the compiled code at object file level is discarded at link-time as redundant.

Another thing I'd consider as interesting is parse caching from token to AST. Most headers don't change, so even when a TU needs to be recompiled, most parts of the AST could be reused. (Some kind of more clever and transparent precompiled headers.) This is likely to need some changes in the AST data structures for fast serialization and loading/inserting. And that makes me think that maybe the text book approach of generating an AST is a bad idea if we care about fast compilation.

Tangentially, I'm astonished that they claim correctness while a large amount of IR is inadequately (if at all) captured in the hash (comdat, symbol visibility, aliases, constant exprs, block address, calling convention/attributes for indirect calls, phi nodes, fast math flags, GEP type, ....). I'm also a bit annoyed, because this is the type of research that is very sloppily implemented, only evaluates projects where compile time is not a big problem and then only achieves small absolute savings, and papers over inherent difficulties (here: capturing the IR, parse time) that makes this unlikely to be used in practice.

meisel•19h ago

I knew that name looked familiar, I thought about mentioning tpde here :)

That's interesting to hear that IR is missing a lot. I'm also surprised that it could provide much gain over hashing the preprocessed output - maybe my workflow is different from others, but typically a change to the preprocessed output implies a change to the IR (e.g., it's a functional change and not just a variable name change or something). Otherwise, why would I recompile it?

Parse caching does sound interesting. Also, a lot of stuff that makes its way into the preprocessed output doesn't end up getting used (perhaps related to the 40-50% figure you gave). Lazy parsing could be helpful - just search for structural chars, to determine entity start/stop ranges, and add the names to a set, then do parsing lazily

aengelke•17h ago

> but typically a change to the preprocessed output implies a change to the IR (e.g., it's a functional change and not just a variable name change or something). Otherwise, why would I recompile it?

For C++, this could happen more often, e.g. when changing the implementation of an inline function or a non-instantiated template in a header that is not used in the compilation unit.

fsfod•18h ago

There was commercial fork of clang zapcc[1] that did caching of headers and template instantiations with an in memory client server system[2], but idk if they solved all the correctness issues or not before abandoning it.

[1] https://github.com/yrnkrn/zapcc

[2] https://lists.llvm.org/pipermail/cfe-dev/2015-May/043155.htm...

meisel•18h ago

Yes, that's the one I was thinking of, thank you

14 Killed in protests in Nepal over social media ban

ICEBlock handled my vulnerability report in the worst possible way

RSS Beat Microsoft

Package Managers Are Evil

Indiana Jones and the Last Crusade Adventure Prototype Recovered for the C64

Using Claude Code to modernize a 25-year-old kernel driver

VMware's in court again. Customer relationships rarely go this wrong

The MacBook has a sensor that knows the exact angle of the screen hinge

Why Is Japan Still Investing in Custom Floating Point Accelerators?

Formatting code should be unnecessary

GPT-5 Thinking in ChatGPT (a.k.a. Research Goblin) is good at search

How inaccurate are Nintendo's official emulators? [video]

Intel Arc Pro B50 GPU Launched at $349 for Compact Workstations

Look Out for Bugs

Meta suppressed research on child safety, employees say

Creative Technology: The Sound Blaster

Immich – High performance self-hosted photo and video management solution

How many SPARCs is too many SPARCs?

Writing by manipulating visual representations of stories

Analog optical computer for AI inference and combinatorial optimization

How many dimensions is this?

No more data centers: Ohio township pushes back against influx of Amazon, others

Show HN: Veena Chromatic Tuner

I am giving up on Intel and have bought an AMD Ryzen 9950X3D

Forty-Four Esolangs: The Art of Esoteric Code

Taking Buildkite from a side project to a global company

Garmin beats Apple to market with satellite-connected smartwatch

How to make metals from Martian dirt

No Silver Bullet: Essence and Accidents of Software Engineering (1986) [pdf]

What is the origin of the private network address 192.168.*.*? (2009)