Performance Hints

143•danlark1•1mo ago

Comments

xnx•1mo ago

This formatting is more intuitive to me.

  L1 cache reference                   2,000,000,000 ops/sec
  L2 cache reference                   333,333,333 ops/sec
  Branch mispredict                    200,000,000 ops/sec
  Mutex lock/unlock (uncontended)      66,666,667 ops/sec
  Main memory reference                20,000,000 ops/sec
  Compress 1K bytes with Snappy        1,000,000 ops/sec
  Read 4KB from SSD                    50,000 ops/sec
  Round trip within same datacenter    20,000 ops/sec
  Read 1MB sequentially from memory    15,625 ops/sec
  Read 1MB over 100 Gbps network       10,000 ops/sec
  Read 1MB from SSD                    1,000 ops/sec
  Disk seek                            200 ops/sec
  Read 1MB sequentially from disk      100 ops/sec
  Send packet CA->Netherlands->CA      7 ops/sec

barfoure•1mo ago

The reason why that formatting is not used is because it’s not useful nor true. The table in the article is far more relevant to the person optimizing things. How many of those I can hypothetically execute per second is a data point for the marketing team. Everyone else is beholden to real world data sets and data reads and fetches that are widely distributed in terms of timing.

twotwotwo•1mo ago

Your version only describes what happens if you do the operations serially, though. For example, a consumer SSD can do a million (or more) operations in a second not 50K, and you can send a lot more than 7 total packets between CA and the Netherlands in a second, but to do either of those you need to take advantage of parallelism.

If the reciprocal numbers are more intuitive for you you can still say an L1 cache reference takes 1/2,000,000,000 sec. It's "ops/sec" that makes it look like it's a throughput.

An interesting thing about the latency numbers is they mostly don't vary with scale, whereas something like the total throughput with your SSD or the Internet depends on the size of your storage or network setups, respectively. And aggregate CPU throughput varies with core count, for example.

I do think it's still interesting to think about throughputs (and other things like capacities) of a "reference deployment": that can affect architectural things like "can I do this in RAM?", "can I do this on one box?", "what optimizations do I need to fix potential bottlenecks in XYZ?", "is resource X or Y scarcer?" and so on. That was kind of done in "The Datacenter as a Computer" (https://pages.cs.wisc.edu/~shivaram/cs744-readings/dc-comput... and https://books.google.com/books?id=Td51DwAAQBAJ&pg=PA72#v=one... ) with a machine, rack, and cluster as the units. That diagram is about the storage hierarchy and doesn't mention compute, and a lot has improved since 2018, but an expanded table like that is still seems like an interesting tool for engineering a system.

chrisweekly•1mo ago

Great comment. I like your phrasing "capacities of a reference deployment", this is what I tend to refer to as the performance ceiling. In practical terms, if you're doing synthetic performance measurements in the lab, it's a good idea to try to recreate optimal field conditions so your benchmarks have a proper frame of reference.

zahlman•1mo ago

> For example, a consumer SSD can do a million (or more) operations in a second not 50K

The "Read 1MB from SSD" entry translates into a higher throughput (still not as high as you imply, but "SSD" is also a broad category ranging from SATA-connected devices though I think five generations of NVMe now); I assume the "Read 4KB" timing really describes a single, isolated page read which would be rather difficult to parallelize.

jcelerier•1mo ago

> Your version only describes what happens if you do the operations serially, though

and that's what the intuition should be based on, because serial problems sadly exist and a lot of common wisdom suddenly disappear when confronted to one.

jesse__•1mo ago

I prefer a different encoding: cycles/op

Both ops/sec and sec/op vary on clock rate, and clock rate varies across machines, and along the execution time of your program.

AFAIK, Cycles (a la _rdtsc) is as close as you can get to a stable performance measurement for an operation. You can compare it on chips with different clock rates and architectures, and derive meaningful insight. The same cannot be said for op/sec or sec/op.

cogman10•1mo ago

Unfortunately, what you'll find if you dig into this is that cycles/op isn't as meaningful as you might imagine.

Most modern CPUs are out of order executors. That means that while a floating point operation might take 4 cycles to complete, if you put a bunch of other instructions around it like adds, divides, and multiplies, those will all finish at roughly the same time.

That makes it somewhat hard to reason about exactly how long any given set of operations will be. A FloatMul could take 4 cycles on it's own, and if you have

    FloatMul
    ADD
    MUL
    DIV

That can also take 4 cycles to finish. It's simply not as simple as saying "Let's add up the cycles for these 4 ops to get the total cycle count".

Realistically, what you'll actually be waiting on is cache and main memory. This fact is so reliable that it underpins SMT. It's why most modern CPUs will do that in some form.

yvdriess•1mo ago

Your critique applies to measuring one or a handful of instructions. In practice you count the number of cycles over million or billion instructions. CPI is very meaningful and it is the main throughput performance metric for CPU core architects.

jesse__•1mo ago

I agree that what you're saying is true, but in the context of my comment, I stand by the statement that cycles/op is still a more meaningful measurement of performance than seconds.

---

Counter-nitpick .. your statement of "if you put a bunch of other instructions around it" assumes there are no data dependencies between instructions.

In the example you gave:

    FloatMul
    ADD
    MUL
    DIV

Sure .. if all of those are operating on independent data sources, they could conceivably retire on the same cycle, but in the context of the article (approximating the performance profile of a series of operations) we're assuming they have data dependencies on one another, and are going to be executed serially.

VorpalWay•1mo ago

Your suggestion confuses latency and throughput. So it isn't correct.

For example, a modern CPU will be able to execute other instructions while waiting for a cache miss, and will also be able to have multiple cache loads in flight at once (especially for caches shared between cores).

Main memory is asynchronous too, so multiple loads might be in flight, per memory channel. Same goes for all the other layers here (multiple SSD transactions in flight at once, multiple network requests, etc)

Approximately everything in modern computers is async at the hardware level, often with multiple units handling the execution of the "thing". All the way from the network and SSD to the ALUs (arithmetic logic unit) in the CPU.

Modern CPUs are pipelined (and have been since the mid to late 90s), so they will be executing one instruction, decoding the next instruction and retiring (writing out the result of) the previous instruction all at once. But real pipelines have way more than the 3 basic stages I just listed. And they can reorder, do things in parallel, etc.

SeanSullivan86•1mo ago

I'm aware of this to an extent. Do you know of any list of what degree of parallelization to expect out of various components? I know this whole napkin-math thing is mostly futile and the answer should mostly be "go test it", but just curious.

I was interviewing recently and was asked about implementing a web crawler and then were discussing bottlenecks (network fetching the pages, writing the content to disk, CPU usage for stuff like parsing the responses) and parallelism, and I wanted to just say "well, i'd test it to figure out what I was bottlenecked on and then iterate on my solution".

moab•1mo ago

Napkin math is how you avoid spending several weeks of your life going down ultimately futile rabbit holes. Yes, it's approximations, often very coarse ones, but done right they do work.

Your question about what degree of parallelization is unfortunately too vague to really answer. SSDs offer some internal parallelism. Need more parallelism / IOPS? You can stick a lot more SSDs on your machine. Need many machines worth of SSDs? Disaggregate them, but now you need to think about your network bandwidth, NICs, cross-machine latency, and fault-tolerance.

The best engineers I've seen are usually excellent at napkin math.

CalChris•1mo ago

I’ve seen this list many many times and I’m always surprised it doesn’t include registers.

yvdriess•1mo ago

CalChris•1mo ago

H+P says register allocation is one of the most important—if not the most important—optimizations.

yvdriess•1mo ago

In cpu uarch design, sure, but that's outside the context of the discussion. There's nothing you can do to that C++ library you are optimizing that will impact performance due to register allocation/renaming.

barchar•1mo ago

This is not always true. Compilers are quite good at register allocation but sometimes they get it wrong and sometimes you can make small changes to code that improve register allocation and thus performance.

Usually the problem is an unfortunately placed spill, so the operation is actually l1d$ traffic, but still.

zahlman•1mo ago

> l1d$

I don't know how to interpret this.

andrewf•1mo ago

Level 1 data cache

barfoure•1mo ago

Some of this can be reduced to a trivial form, which is to say practiced in reality on a reasonable scale, by getting your hands on a microcontroller. Not RTOS or Linux or any of that, but just a microcontroller without an OS, and learning it and learning its internal fetching architecture and getting comfortable with timings, and seeing how the latency numbers go up when you introduce external memory such as SD Cards and the like. Knowing to read the assembly printout and see how the instruction cycles add up in the pipeline is also good, because at least you know what is happening. It will then make it much easier to apply the same careful mentality to this which is ultimately what this whole optimization game is about - optimizing where time is spent with what data. Otherwise, someone telling you so-and-so takes nanoseconds or microseconds will be alien to you because you wouldn’t normally be exposed to an environment where you regularly count in clock cycles. So consider this a learning opportunity.

simonask•1mo ago

Just be careful not to blindly apply the same techniques to a mobile or desktop class CPU or above.

A lot of code can be pessimized by golfing instruction counts, hurting instruction-level parallelism and microcode optimizations by introducing false data dependencies.

Compilers outperform humans here almost all the time.

barfoure•1mo ago

It is not about outperforming the compiler - it’s about being comfortable with measuring where your clock cycles are spent, and for that you first need to be comfortable with clock cycle scale of timing. You’re not expected to rewrite the program in assembly. But you should have a general idea given an instruction what its execution entails, and where the data is actually coming from. A read from different busses means different timings.

Compilers make mistakes too and they can output very erroneous code. But that’s a different topic.

jesse__•1mo ago

Excellent corrective summary.

"Compilers can do all these great transformations, but they can also be incredibly dumb"

-Mike Acton, CPPCON 2014

jesse__•1mo ago

> Compilers outperform humans here almost all the time.

I'm going to be annoying and nerd-snipe you here. It's, generally, really easy to beat the compiler.

https://scallywag.software/vim/blog/simd-perlin-noise-i

dardeaup•1mo ago

"A lot of code can be pessimized by golfing instruction counts"

Can you explain what this phrase means?

simonask•1mo ago

An old approach to micro-optimization is to look at the generated assembly, and trying to achieve the same thing with fewer instructions. However, modern CPUs are able to execute multiple instructions in parallel (out-of-order execution), and this mechanism relies on detecting data dependencies between instructions.

It means that the shorter sequence of instructions is not necessarily faster, and can in fact make the CPU stall unnecessarily.

The fastest sequence of instructions is the one that makes the best use of the CPU’s resources.

sevensor•1mo ago

I’ve done this: I had a hot loop and I discovered that I could reduce instruction counts by adding a branch inside the loop. Definitely slower, which I expected, but it’s worth measuring.

Pannoniae•1mo ago

Compilers massively outperform humans if the human has to write the entire program in assembly. Even if a human could write a sizable program in assembly, it would be subpar compared to what a compiler would write. This is true.

However, that doesn't mean that looking at the generated asm / even writing some is useless! Just because you can't globally outperform the compiler, doesn't mean you can't do it locally! If you know where the bottleneck is, and make those few functions great, that's a force multiplier for you and your program.

simonask•1mo ago

It’s absolutely not useless, I do it often as a way to diagnose various kinds of problems. But it’s extremely rare that a handwritten version actually performs better.

jesse__•1mo ago

yo, completely off topic, but do you work on a voxel game/engine?

Pannoniae•1mo ago

yes and you already know me lol, we have been chatting on discord :P

jesse__•1mo ago

Wonderful article. I wish more people had this pragmatic approach when thinking about performance

menaerus•1mo ago

I actually wish the audience to take the opposite, or perhaps a more balanced view. Being pragmatic is like taking an extreme view and as much as this article is a great resource, and contains some legit advice otherwise difficult to find elsewhere in such a concise form, folks need to be aware that this advice is what Google found for their unfathomable scale codebase to gain some real world benefits.

The things this article is describing are more nuanced than just "think about the performance sooner than latter". I say this as someone who does these kind of optimizations for a living and all too often I see teams wasting time trying to micro-optimize codepaths which by the end of the day do not provide any real demonstrable value.

And this is a real trap you can get into really easily if you read this article as a general wisdom, which is not.

squirrellous•1mo ago

I wish Google would open source their gtl library. Similar utilities exist elsewhere but not in the same consistent quality and well-integrated package.

I particularly like the “what to do for flat profiles” ad “protobuf tips” sections. Similar advice distilled to this level is difficult to find elsewhere.

canyp•1mo ago

Interesting that the blog only runs until 2023. Have they been absorbed by AI, Rust, or both?

I think I'd rather be eaten by a giant crustacean than work on AI.

svat•1mo ago

The HN title here is currently “Performance Hints (2023)”, but this was only published externally recently (2025). (See e.g. https://x.com/JeffDean/status/2002089534188892256 announcing it.) And of course 2023 is when the document was first created, but much of the content is more recent than that. So IMO it's a bit misleading to put "(2023)" in the title.

danlark1•1mo ago

Surprisingly I didn't put 2023, it was merged with another submission possibly with the help of mods

dredmorbius•1mo ago

FYI: It's possible for that to be edited by others.

zahlman•1mo ago

If the numbers come from analyzing performance in 2023, that seems more important than the external publication time.

svat•1mo ago

The page is about tips for writing fast code. Much of it applied 20 years ago, and will apply 20 years from now. If by "the numbers" you mean specifically just the table ("rough costs for some basic low-level operations") in the "Estimation" section (which accounts for less than 0.5% of the words on the page), then that table was initially created in 2007, and is up-to-date as of 2025. Other numbers on the page are given with their dates, like 2001 and so on. So 2023 does not seem relevant in any way.

dang•1mo ago

(Ok, we've belatedly taken 2023 out of the title now)

justicehunter•1mo ago

Really helps to have all this good info in one page. I often find myself focusing on few aspects here while ignoring the rest. Definitely saved to remind myself that there is a lot more to performance that the few tricks I know.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Dark Alley Mathematics

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Unseen Footage of Atari Battlezone Arcade Cabinet Production

An Update on Heroku

PC Floppy Copy Protection: Vault Prolok

Delimited Continuations vs. Lwt for Threads

Show HN: ARM64 Android Dev Kit

Why I Joined OpenAI

How to effectively write quality code with AI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Introducing the Developer Knowledge API and MCP Server

Female Asian Elephant Calf Born at the Smithsonian National Zoo

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

FORTH? Really!?

Evaluating and mitigating the growing risk of LLM-discovered 0-days

WebView performance significantly slower than PWA

I'm going to cure my girlfriend's brain tumor

Show HN: Smooth CLI – Token-efficient browser for AI agents

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Dark Alley Mathematics

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Unseen Footage of Atari Battlezone Arcade Cabinet Production

An Update on Heroku

PC Floppy Copy Protection: Vault Prolok

Delimited Continuations vs. Lwt for Threads

Show HN: ARM64 Android Dev Kit

Why I Joined OpenAI

How to effectively write quality code with AI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Introducing the Developer Knowledge API and MCP Server

Female Asian Elephant Calf Born at the Smithsonian National Zoo

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

FORTH? Really!?

Evaluating and mitigating the growing risk of LLM-discovered 0-days

WebView performance significantly slower than PWA

I'm going to cure my girlfriend's brain tumor

Show HN: Smooth CLI – Token-efficient browser for AI agents

Performance Hints

Comments