at some point surely some dynamic linking is warranted
Why are debug symbols so big? For C++, they’ll include detailed type information for every instantiation of every type everywhere in your program, including the types of every field (recursively), method signatures, etc. etc., along with the types and locations of local variables in every method (updated on every spill and move), line number data, etc. etc. for every specialization of every function. This produces a lot of data even for “moderate”-sized projects.
Worse: for C++, you don’t win much through dynamic linking because dynamically linking C++ libraries sucks so hard. Templates defined in header files can’t easily be put in shared libraries; ABI variations mean that dynamic libraries generally have to be updated in sync; and duplication across modules is bound to happen (thanks to inlined functions and templates). A single “stuck” or outdated .so might completely break a deployment too, which is a much worse situation than deploying a single binary (either you get a new version or an old one, not a broken service).
Isn't the simple solution to use detached debug files?
I think Windows and Linux both support them. That's how phones like Android and iOS get useful crash reports out of small binaries, they just upload the stack trace and some service like Sentry translates that back into source line numbers. (It's easy to do manually too)
I'm surprised the author didn't mention it first. A 25 GB exe might be 1 GB of code and 24 GB of debug crud.
Detached debug files has been the default (only?) option in MS's compiler since at least the 90s.
I'm not sure at what point it became hip to do that around Linux.
[1] "debhelper: support for split debugging symbols"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=215670
[2] https://salsa.debian.org/debian/debhelper/-/commit/79411de84...
It should be. But the tooling for this kind of thing (anything to do with executable formats including debug info and also things like linking and cross-compilation) is generally pretty bad.
This also requires careful tracking of prod builds and their symbol files... A kind of symbol db.
Of course, separate binaries files make no difference at runtime since only the LOAD segments get loaded (by either the kernel or the dynamic loader, depending). The size of a binary on disk has little to do with the size of a binary in memory.
I don't think that's the case on Linux, when using -gsplit-dwarf the debug info is put in separate files at the object file level, they are never linked into binaries.
I am very sympathetic to wanting nice static binaries that can be shipped around as a single artifact[0], but... surely at some point we have to ask if it's worth it? If nothing else, that feels like a little bit of a code smell; surely if your actual executable code doesn't even fit in 2GB it's time to ask if that's really one binary's worth of code or if you're actually staring at like... a dozen applications that deserve to be separate? Or get over it the other way and accept that sometimes the single artifact you ship is a tarball / OCI image / EROFS image for systemd[1] to mount+run / self-extracting archive[2] / ...
[0] Seriously, one of my background projects right now is trying to figure out if it's really that hard to make fat ELF binaries.
[1] https://systemd.io/PORTABLE_SERVICES/
[2] https://justine.lol/ape.html > "PKZIP Executables Make Pretty Good Containers"
The answer to an ever-increasing size of binaries was always "let's make the infrastructure scale up!" instead of "let's... not do this crazy thing maybe?". By the time I left, there were some new initiatives towards the latter and the feeling that "maybe we should have put limits much earlier" but retrofitting limits into the existing bloat was going to be exceedingly difficult.
- google-wide profiling: the core C++ team can collect data on how much of fleet CPU % is spent in absl::flat_hash_map re-bucketing (you can find papers on this publicly)
- crashdump telemetry
- dapper stack trace -> codesearch
Borg literally had to pin the bash version because letting the bash version float caused bugs. I can't imagine how much harder debugging L7 proxy issues would be if I had to follow a .so rabbit hole.
I can believe shrinking binary size would solve a lot of problems, and I can imagine ways to solve the .so versioning problem, but for every problem you mention I can name multiple other probable causes (eg was startup time really execvp time, or was it networked deps like FFs).
As my https://maskray.me/blog/2023-05-14-relocation-overflow-and-c... (linked by author, thanks! But I maintain lld/ELF instead of "wrote" it - it's engineer work of many folks)
Quoting the relevant paragraphs below:
## Static linking
In this section, we will deviate slightly from the main topic to discuss static linking. By including all dependencies within the executable itself, it can run without relying on external shared objects. This eliminates the potential risks associated with updating dependencies separately.
Certain users prefer static linking or mostly static linking for the sake of deployment convenience and performance aspects:
* Link-time optimization is more effective when all dependencies are known. Providing shared object information during executable optimization is possible, but it may not be a worthwhile engineering effort.
* Profiling techniques are more efficient dealing with one single executable.
* The traditional ELF dynamic linking approach incurs overhead to support [symbol interposition](https://maskray.me/blog/2021-05-16-elf-interposition-and-bsy...).
* Dynamic linking involves PLT and GOT, which can introduce additional overhead. Static linking eliminates the overhead.
* Loading libraries in the dynamic loader has a time complexity `O(|libs|^2*|libname|)`. The existing implementations are designed to handle tens of shared objects, rather than a thousand or more.
Furthermore, the current lack of techniques to partition an executable into a few larger shared objects, as opposed to numerous smaller shared objects, exacerbates the overhead issue.
In scenarios where the distributed program contains a significant amount of code (related: software bloat), employing full or mostly static linking can result in very large executable files. Consequently, certain relocations may be close to the distance limit, and even a minor disruption (e.g. add a function or introduce a dependency) can trigger relocation overflow linker errors.
Those who do not understand dynamic linking are doomed to reinvent it.
See for more: https://google.github.io/building-secure-and-reliable-system...
It depends.
If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.
The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.
See https://google.github.io/building-secure-and-reliable-system...
However,
> […] which artefacts were linked against that bad version of libc.
There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.
In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.
In dynamically linked operating environments, only the libc needs to be updated.
The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.
I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.
But I feel like we already knew that practices don't work unless organizations actually follow them.
So what was your point?
Static linking actually destroys the boundaries that a provenance consumer would normally want due to erasure of the dependency identities rendering them irrecoverable in a trustworthy way from the binary by way of global code optimisation, inlining (sometimes heavy), LTO, dead code elimination and alike. It is harder to reason about and audit a single opaque blob than a set of separately versioned shared libraries.
Static linking, however, is very good at avoiding «shared/dynamic library dependency hell» which is a reliability and operability win. From a binary provenance standpoint, it is largely orthogonal.
Static linking can improve one narrow provenance-adjacent property: fewer moving parts at deploy and run time.
The «it depends» part of the comment concerned the FAANG-scale level of infrastructure and operational maturity where the organisation can reliably enforce hermetic builds and dependency pinning across teams, produce and retain attestations and SBOM's bound to release artefacts, rebuild the world quickly on demand and roll out safely with strong observability and rollback. Many organisations choose dynamic linking plus image sealing because it gives them similar provenance and incident response properties with less rebuild pressure at a substantially smaller cost.
So static linking mainly changes operational risk and deployment ergonomics, not evidentiary quality about where the code came from and how it was produced, whereas dynamic linking, on the other hand, may yield better provenance properties when the shared libraries themselves have strong identity and distribution provenance.
NB Please do note that the diatribe is not directed at you in any way, it is an off-hand remark and a reference to people who prescribe purported benefits to the static linking that it espouses because «Google does» it without taking into account the overall context, maturity and scale of the operating environment Google et al operate at.
Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying.
One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine.
But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine.
Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries.
Statically linking being necessary for scaling does not pass the smell test for me.
As I see it, the issue isn’t requiring static compiling to scale. It’s requiring it to make troubleshooting or measuring performance at scale easier. Not required, per se, but very helpful.
Google runs on a microservices architecture. It's done that since before that was cool. You have to do a lot to make a microservices architecture work. Google did not advertise a lot of that. Today we have things like Data Dog that give you some of the basics. But for a long time, people who left Google faced a world of pain because of how far behind the rest of the world was.
The biggest datasets that ChatGPT is aware of being processed in complex analytics jobs on Azure are roughly a thousand times smaller than an estimate of Google's regularly processed snapshot of the web. There is a reason why most of the fundamental advancements in how to parallelize data and computations - such as map-reduce and BigTable - all came from Google. Nobody else worked at their scale before they did. (Then Google published it, and people began to implement it. Then failed to understand what was operationally important to making it actually work at scale...)
So, despite how big it is, I don't think that Azure operates at Google scale.
For the record, back when I worked at Google, the public internet was only the third largest network that I knew of. Larger still was the network that Google uses for internal API calls. (Do you have any idea how many API calls it takes to serve a Google search page?) And larger still was the network that kept data synchronized between data centers. (So, for example, you don't lose your mail if a data center goes down.)
I think there were more basic reasons we didn't ship shared libraries to production.
1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking.
2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge.
3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod.
That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking.
> Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason!
I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs.
But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.)
It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.
Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.
When Forge has a problem, I might as well go on a nature hike.
You'd need multiple of those, because you have ECC. Not impossible, but getting all those dice rolled the same way requires even bigger scale than Google's.
But the compression ratio isn't magical (approx. 1:0.25, for both zlib and zstd in the examples given). You'd probably still want to set aside debuginfo in separate files.
With embedded firmware you only flash the .text and and flash to the device. But you still can debug using the .elf file. In my case if I get a bus fault I'll pull the offending address off the stack and use bintools and the .elf to show me who was naughty. I think if you have a crash dump you should be able to make sense of things as long as you keep the unstripped .elf file around.
I don’t think I’ve ever seen a 4gb binary yet. I have seen instances where a PDB file hit 4gb and that caused problems. Debug symbols getting that large is totally plausible. I’m ok with that at least.
This was a problem because code signing meant it needed to be completely replaced by updates.
Is this because they are embedding assets into the binary? I find it hard to believe anyone was carrying around enough code to fill 4GB in the PS3 era...
It varied between games, one of the battlefields (3 or bad company 2) was what I was thinking of. It generally improved with later releases.
The 4GB file size was significant, since it meant I couldn't run them from a backup on a fat32 usb drive. There are workarounds for many games nowadays.
Systemd and portable?
but the article says 25+GiB including debug symbols, in a single binary?
also, I appreciate your enthusiasm in assuming that because some people do something in an organization, it is applied consistently everywhere. Hell, if it were microsoft other departments would try to shoot down the "debug tooling optimization" dpt
For example I can imagine desktop operating system antivirus/integrity checks having this effect.
Makes sense, but in the assembly output just after, there is not a single JMP instruction. Instead, CALL <immediate> is replaced with putting the address in a 64-bit register, then CALL <register>, which makes even more sense. But why mention the JMP thing then? Is it a mistake or am I missing something? (I know some calls are replaced by JMP, but that's done regardless of -mcmodel=large)
It's performing a call, ABIs define registers that are not preserved over calls; writing the destination to one of those won't affect register pressure.
> The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute 64bit ones; kind of like a JMP.
(We still need to use CALL in order to push a return address.)
Regardless of whether you're FAANG or not, nothing you're running should require an executable with a 2 GB large .text section. If you're bumping into that limit, then your build process likely lacks dead code elimination in the linking step. You should be using LTO for release builds. Even the traditional solution (compile your object files with -ffunction-sections and link with --gc-sections) does a good job of culling dead code at function-level granularity.
https://research.google/pubs/thinlto-scalable-and-incrementa...
And other refs.
And yet...
But since each product in some different domains had to actively enable those optimizations for themselves, they were occasionally forgotten, and I found a few in the app I worked for (but not directly on).
Chromium is in the hundred and something MB range on mine last I looked. Might expand to more on install.
Move all the hot BBs near each other, right?
Facebook's solution: https://github.com/llvm/llvm-project/blob/main/bolt%2FREADME...
Google's:
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...
BOLT AFAIU is more about cache locality of putting hot code near each other and not really breaking the 2GiB barrier.
On x86-64 it would probably be easier to point the relative call to a synthesized trampoline that does a 64-bit one, but it seems nobody has bothered thus far. You have to admit that sounds pretty painful.
You can use thunks/trampolines. lld can make them for some architectures, presumably also for x86_64. Though I don't know why it didn't in your case.
But, like the large code model it can be expensive to add trampolines, both in icache performance and just execution if a trampoline is in a particularly hot path.
This is what my next post will explore. I ran into some issues with the GOT that I'll have to explore solutions for.
I'm writing this for myself mostly. The whole idea for code models when you have thunks feels unnecessary.
Does the linker have access to the same hotness information that the compiler uses during PGO? Well -- presumably it could, even if it doesn't now. But it would be like a heuristic with a hotness threshold? Do linkers "do" heuristics?
I see this often even in communities of software engineers, where people who are unaware of certain limitations at scale will announce that the research is unnecessary
(I wonder but have no particular insight into if LTO builds can do smarter things here -- most calls are local, but the handful of far calls can use the more expensive spelling.)
[0] https://research.google/pubs/ubiq-a-scalable-and-fault-toler...
… on the x86 ISA because it encodes the 32-bit jump/call offset directly in the opcode.
Whilst most RISC architecture do allow PC-relative branches, the offset is relatively small as 32-bit opcodes do not have enough room to squeeze a large offset in.
«Long» jumps and calls are indirect branches / calls done via registers where the entirety of 64 bits is available (address alignment rules apply in RISC architectures). The target address has to be loaded / calculated beforehand, though. Available in RISC and x86 64-bit architectures.
Also, we, as an industry of software engineers, need to re-examine these hard defaults we thought could never be achieved. Such as the .text limits.
Anyway, very good read.
However, Google, Meta, and ByteDance have encountered x86-64 relocation distance issue with their huge C++ server binaries. To my knowledge industry users in other domains haven't run into this problem.
To address this, Google adopted the medium code model approximately two years ago for its sanitizer and PGO instrumentation builds. CUDA fat binaries also caused problems. I suggest that linker script `INSERT BEFORE/AFTER` for orphan sections (https://reviews.llvm.org/D74375 ) served as a key mitigation.
I hope that a range extension thunk ABI, similar to AArch64/Power, is defined for the x86-64 psABI. It is better than the current long branch pessimization we have with -mcmodel=large.
---
It seems that nobody has run into this .eh_frame_hdr implementation limitation yet
* `.eh_frame_hdr -> .text`: GNU ld and ld.lld only support 32-bit offsets (`table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4;`) as of Dec 2025.
gerikson•1mo ago
acosmism•1mo ago
binaryturtle•1mo ago
bayindirh•1mo ago
Why not?
DHRicoF•1mo ago
fuzzfactor•1mo ago