If you CPU runs on 1000MHz that's 10^9 cycles per second. On that CPU the right hand side of the picture corresponds to 1ms. You can do 1 million register-register operations in 1ms, or 1 billion in 1sec.
Computers are fast.
This improvement is sufficient to tip the balance toward favoring division in some algorithms where historically programmers went out of their way to avoid it.
net01•5h ago
https://uops.info/table.html
supports most modern and old architectures
bee_rider•5h ago
Although, reasoning about performance can be hard anyway.
Liftyee•4h ago
bee_rider•3h ago
I have a sneaking suspicion that this table is satisfying for our brains as a vaguely technical and interesting thing, but I’m not sure how useful it really is. In general the compiler will be really creative in reordering instructions, and the CPU will also be creative about which ones it runs parallel (since it is good at discovering instruction level parallelism). So, I wonder if the level of study necessary to use this information also requires the level of data that is available in the detailed table.
I have not done much caring about instructions, it seems very hard. FWIW I have had some success caring about reducing the number of trips to memory and making sure the dependencies are obvious to the computer, so I’m not totally naive… but I think that caring about instruction timing is mostly for the real hardcore optimization badasses.
owlbite•2h ago
For the uninitiated, most high-performance CPUs of recent years:
- Are massively out-of-order. It will run any operation that has all inputs satisfied in the next slot of the right type available.
- Have multiple functional units. A recent apple CPU can and will run 5+ different integer ops, 3+ load/stores and 3+ floating point ops per cycle if it can feed them all. And it may well do zero-cost register renames on the fly for "free".
- Functional units are pipelined, you can throw 1 op in the front end of the pipe each cycle, but the result sometimes isn't available for consumption until maybe 3-20 cycles later (latency depends on the type of the op and if it can bypass into the next op executed).
- They will speculate on branch results and if they get them wrong it needs to flush the pipeline and do the right thing.
- Assorted hazards may give +/- on the timing you might get in a different situation.