This is a significant problem on AMD; Intel and Apple seems to be better.
When did this change? In my testing years ago (while I was writing Rosetta 2, so Icelake-era Intel), Intel only allowed a load to forward from a single store, and no partial forwarding (i.e. mixed cache/register) without a huge penalty, whereas AMD at least allowed partial forwarding (or had a considerably lower penalty than Intel).
I haven't tested Zen 4 or 5, but I haven't heard anything that indicates they should be a lot better.
haberman•3h ago