So, he black box tests the CPU to try and discover its innards.
Agner
That sounds like a weird design choice. Curious if this will affect memcpy-heavy workloads.
Writes aside, Zen5 is taking much longer to roll out than I thought, and some of AMD's positioning is (almost expectedly) misleading, especially around AI.
AMD's website claims Zen5 is the "Leading CPU for AI" (<https://www.amd.com/en/products/processors/server/epyc/ai.ht...>), but I strongly doubt that. First, they compare Zen5 (9965), which is still largely unavailable, to Xeon2 (8280), a 2 generations older processor. Xeon4 is abundantly available and comes with AMX, an exclusive feature to Intel. I doubt AVX-512 support with a 512-bit physical path and even twice as many cores will be enough to compete with that (if we consider just the ALU throughput rather than the overall system & memory).
Consider the standard matrix multiplication primitive the FMAC / multiply and accumulate: 3 reads and one write if I'm counting correctly .... (Output = A * B + C, three reads one output).
Whether the core does a 512-bit write in 1 cycle or 2 because it is two 256-bit writes is immaterial. Memory bandwidth is bottlenecked by 64GB/sec per CCX. You need to use cores from multiple CCXs to get full bandwidth.
That said, the EYPC 9175F has 614.4GB/sec memory bandwidth and should be able to use all of it. I have one, although the machine is not yet assembled (Supermicro took 7 weeks to send me a motherboard, which delayed assembly), so I have no confirmed that it can use all of it yet.
This was a typo. It should have been “inference is memory bandwidth bound”.
How much is it going to cost you to build the box?
Now, if we say "Zen5 is the leading consumer CPU for AI" then no objections can be made, consumer Intel models do not even support AVX-512.
Also, note that for inference they compare with Xeon 8592+ which is the top Emerald Rapids model. Not sure if comparison with Granite Rapids would have been more appropriate but they surely dodged the AMX bullet by testing FP32 precision instead of BF16.
On the right, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8592+ (launched Q4 23), a like for like comparison against Intel's competition at launch.
The argument is essentially in two pieces - "If you're upgrading, you should pick AMD. If you're not upgrading, you should be."
Still, if you decode the unreadable footnotes 2 & 3 in the bottom of the page - a few things stand out: avoiding AMX, using CPUs with different core-counts & costs, and even running on a different Linux kernel version, which may affect scheduling…
The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
There is no penalty for operations in the slow region that depend on longer latency operations or loads.
There is no penalty for any operations in the fast region.
To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
The latency test could interleave NOPs to prevent the scheduler from filling up.
Basically, short vector code sequences that don't fill up the scheduler will have better latency.[1] https://www.amd.com/content/dam/amd/en/documents/processor-t...
There is very little reason to use integers for anything anymore. Loop counter? Why not make it a double - you never know when you might need an extra 0.5 loops at the end!
Zen 5 breaks several performance "conventions" e.g. AMD went directly from one to three complex scalar integer units (multiplication, PDEP/PEXT, etc.).
Intel effectively has two vector pipelines and the shortest instruction latency is a single cycle while Zen 5 has four pipelines with a two cycle minimum latency. That's a *very* different optimisation target (aim for eight instead of two independent instructions in flight) for low level SIMD code going forward despite an identical instruction set.
If a laptop will need to be plugged in to deliver full performance, whilst blasting fans at full throttle, what is the point? (apart from server / workstation use, where you don't like MacOS or need different OS)
Desktops for gaming? AMD makes the best gaming CPUs with the X3D series.
Most of the workforce use Windows.
You can also use Linux if you want on Intel&AMD.
M CPUs are great but constrained by Apple.
In multi-threaded scenarios, for example, M chips are not better at all and AFAICR are worse than the Threadripper. So, a different trade-off really
Getting near desktop performance when plugged but portability and lower consumption when unplugged is a pretty good tradeoff.
And they suck big time. And, to add insult to injury, there are also desktops which use laptop CPUs, with the same (lack of) performance.
What sort of work load that sucks big time? Assuming the work load is even laptop focused in the first place.
For doing any kind of work that requires focus it is an absolute nightmare.
But she need a laptop to occasionally take it to Uni.
The CPU in itself should be pretty good by modern standards: https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+Ultra+9+...
> (like even struggles to open Word).
Her issue is not the form factor. Is it the RAM ? did she activate all the marketing apps ? Is a bitcoin farmer running in the background ? I don't know, but it's worth looking into it.
For comparison, I have at hand a Surface Pro 8 that should be 3x slower than hers on sheer CPU benchmarks, and I can throw any run of mill task at it (do the taxes with 3~4 word documents, Excel, dozens of tabs in firefox and a call session in the background) and it's fine. It will burn trough battery life within an two or three hours under that load, and yes the fans will be running, but I have no issue of having it crawl when unplugged.
I'm jealous of the m series macbooks, fast, quiet, and cool on wall or battery.
There is also AMD's "Software Optimization Guide" that might contain some background information. [4] has many direct attachments, AMD tends to break direct links. Intel should have similar docs, but I am currently more focused on AMD, so I only have those links at hand.
[1] https://www.agner.org/optimize/instruction_tables.pdf
[2] https://www.uops.info/background.html
It's often faster to use one less core than you hit constraints at so that the processor can juggle them between cores to balance the thermal load as opposed to trying to keep it completely saturated.
thanks
Perhaps instruction fusion somehow played into it?
Then I moved stuff into huge precalced arrays instead, and it became intensely memory bound. :-)
So, what did you end up having in the code? Ugly and fast or nice and slow? :)
Classic compiler games and similar happened to me just recently when I wrote a micro-optimized SIMD code for some monotonically increasing integer sequence utility that achieved like 80% of the theoretical IPC (for skylake-x) in ubenchmarks, however, once I moved the code from ubenchmark to the production code what I saw was surprising (or not really) - compiler merged my carefully optimized SIMD code with the surrounding code and largely nullified the optimizations I've done.
Edit: I ran the code on an Intel CPU (Kaby Lake, on my laptop) and there's no slowdown when removing the assert(). So it really seems to be something Zen-specific and weird.
> So it really seems to be something Zen-specific and weird.
Number and/or type of ports. Perhaps even the code generation is different so it could be the compiler backend differences too for different uarchs
You might want to download it and just take a look at it so you know that this content exists.
And that is also the case with Qualcomm's Oryon and ARM's own Cortex X93x series.
Still really looking forward to Zen 6 on server though. I cant wait to see 256 Zen 6c Core.
Not only is the Zen 5 slower, it also uses more energy to achieve the its results. Thinking about that the gap is staggering.
> since 2013, Intel offers a feature called "intel processor tracing [2]
> [not answered]
> When will AMD cpus introduce Intel-PT tech or the Intel branch trace store feature? (2024) [3]
> [not answered]
Is Intel-PT over-engineered and not really needed in practice?
[1] https://github.com/janestreet/magic-trace/wiki/How-could-mag...
[2] https://community.amd.com/t5/pc-processors/amd-ipt-intelpt-i...
[3] https://community.amd.com/t5/pc-processors/will-amd-cpus-hav...
In general, Intel is _way_ ahead of AMD in the performance monitoring game. For instance, IBS is a really poor replacement for PEBS (it still hits the wrong instructions, it just re-weights them and this rarely goes well), which makes profiling anything branchy or memory-bound really hard. This is the only real reason why I prefer to buy Intel CPUs still myself (although I understand this is a niche use case!).
eigenform•6mo ago
Tuna-Fish•6mo ago
JackYoustra•6mo ago
throwaway81523•6mo ago
Sesse__•6mo ago