AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

https://chipsandcheese.com/p/amds-epyc-9355p-inside-a-32-core

165•rbanffy•4mo ago

Comments

flumpcakes•4mo ago

The first picture has a typo on it's left hand side.

It says 16 cores per die with up 16 zen 5 dies per chip. For zen 5 it's 8 cores per die, 16 dies per chip giving a total of 128 cores.

For zen 5c it's 16 cores per die, 12 dies per chip giving a total of 192 cores.

Weirdly it's correct on the right side of the image.

haunter•4mo ago

>768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth

I know it's a server but I'd be so ready to use all of that as RAM disk. Crazy amount at a crazy high speed. Even 1% would be enough just to play around with something.

mtoner23•4mo ago

For our build servers for devs we utilize roughly this setup as a ram disk. It's amazing. Build times are lighting fast (compared to HDD/SSD)

privatelypublic•4mo ago

I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling, or network load/latency(one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms)

motorest•4mo ago

> I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling (...)

This has been the basic pattern for ages, particularly with large C++ projects. C++ builds, specially with the introduction of multi-CPU and multi-core systems, turns builds into IO-bound workflows, specially during linking.

Creating RAM disks to speed up builds is one of the most basic and low effort strategies to improve build times, and I think it was the main driver for a few commercial RAM drive apps.

john01dav•4mo ago

Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

p_l•4mo ago

Historical, but also there was a bunch of physical ram drives - RAMsan, for example, sold DRAM-based (with battery backup) appliances connected by fiber channel - they were used for all kinds of tasks but often as very fast scratch space for databases. Some VAXen had a "RAM disk" card that was IIRC used as NFS cache on some unix variants. etc. etc.

rbanffy•4mo ago

Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

p_l•4mo ago

It was often used to supplement memory available in cheaper ways or otherwise more flexible. For example many hardware solutions allowed connecting more RAM than otherwise possible to be accessed by main bus, or at lower cost than the main memory (for example due to differences in interfaces required, adding battery backup, etc.)

RAMsan line for example started in 2000 with 64GB DRAM-based SSD with up to 15 1Gbit FC interfaces, providing a shared SAN SSD for multiple hosts (very well utilized by some of the beefier cluster SQL databases like Oracle RAC) but the company itself has been providing high speed specialized DRAM-based SSDs since 1978

rbanffy•4mo ago

The way it makes sense is when you can't add that much memory to the system directly, or when directly attached memory would be significantly more expensive. For this you can get away with much slower memory than you would attach to the memory bus directly - all you need is to be faster than the storage bus you are using.

Last time I saw one was with a mainframe, which kind of makes sense if adding cheaper third party memory to the machine would void warranties or breach support contracts. People really depend on company support for those machines.

p_l•4mo ago

Main cases I've seen with mainframes involved network-attached ram disks (actually, even earliest S/360 could share a disk device between two mainframes, so...)

A fast scratch pad that can be shared between multiple machines can be ideal at times.

rbanffy•4mo ago

Makes sense in batch environment - you can lock the volume, do your thing, and then freeing it to another task running on a different partition or host.

Still seems like a kludge - The One Right Way to do it would be to add that memory directly to a CPU addressable space rather than across a SCSI (or channel, or whatever) link. Might as well be added to the RAM in the storage server and let it manage the memory optimally (with hints from the host).

p_l•4mo ago

There was no locking (at least not necessarily), it was a shared resource that allowed programs on multiple computers to utilize together (also major use case for RAMsan where I worked with them - it was not about not being able to add memory, it was about common fast quorum and cache between multiple maxed out database servers)

motorest•4mo ago

> Still odd. The OS should be able to manage the memory and balance performance more efficiently than that. There’s no reason to preallocate memory by hardware.

You are arguing hypotheticals, whereas for decades the world had to deal with practicals. I recommend you spend a few minutes looking into how to create RAM drives on, say, Windows, and think through how to achieve that when your build workstation has 8GB of RAM and you need a scratchpad memory of, say, 16GB of RAM.

Recommended reading: https://en.wikipedia.org/wiki/RAM_drive

rbanffy•4mo ago

I know all that - I was there and I saw products like these in person (although they were in the megabyte range back then). I still remember a 5.25 hard-drive shaped box with a lead acid battery and lots of memory boards with 4164's (IIRC).

These are only for when the OS and the machine itself can't deal with the extra memory and wouldn't know what to do with it, things you buy when you run out of sensible options (such as adding more memory to your machine and/or configuring a RAM disk).

motorest•4mo ago

> Why do we need commercial ram drive apps when Linux has tmpfs, or is this a historical thing?

A) this technique precedes the existence of Linux.

B) Linux is far from the most popular OS in use today.

C) some software development projects are developed and target non-Linux platforms (see Windows)

mikepurvis•4mo ago

For the ROS ecosystem you’re often building dozens or hundreds of small CMake packages, and those configure steps are very io bound— it’s a ton of does this file exist, what’s in this file, compile this tiny test program, etc.

I assume the same would be true for any project that is configure-heavy.

finaard•4mo ago

I'm running the same setup - our larger builders have 2 32-core epycs with 2TB RAM. We were doing that type of setup already almost two decades ago in a different company, and in that one for over a decade now - back then that was the only option for speed.

Nowadays nvmes might indeed be able to get close - but we'd probably need to still span over multiple SSDs (reducing the cost savings), and the developers there are incredible sensitive to build times. If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers.

Another reason is that it'd eat SSDs like candy. Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month. So we'd either get cheap consumer SSDs and replace them every few days, or enterprise SSDs and replace them every few months - or stick with the RAM setup, which over the live of the build system will be cheaper than constantly buying SSDs.

trogdor•4mo ago

> Current enterprise SSDs have something like a 10000 TBW rating, which we'd exceed in the first month

Wow. What’s your use case?

finaard•4mo ago

Same as the one earlier in the thread: Build servers, nicely loaded. A build generates a ridiculous amount of writes for stuff that just gets thrown out after the build.

We actually did try with SSDs about 15 years ago, and had a lot of dead SSDs in a very short time. After that we went for estimating data written, it's cheaper. While SSD durability increased a lot since then everything else got faster as well - so we'd have SSDs last a bit longer now (back then it was a weekly thing), but still nowhere near where it'd be a sensible thing to do.

rbanffy•4mo ago

> If a 5 minute build suddenly takes 30 seconds more we have some unhappy developers

They sound incredibly spoiled. Where should I send my CV?

finaard•4mo ago

You don't really want that. I'm keeping my sanity there just because my small company is running their CI and testing as contractor.

They indeed are quite spoiled - and that's not necessarily a good thing. Part of the issue is that our CI was good and fast enough that at some point a lot of the new hires never bothered to figure out how to build the code - so for quite a few the workflow is "commit to a branch, push it, wait for CI, repeat". And as they often just work on a single problem the "wait" is time lost for them, which leads to the unhappiness if we are too slow.

hamandcheese•4mo ago

It is still quite commendable that your outer dev loop (commit, push, build) is fast enough to work as a devs inter dev loop (edit, build/test).

jauntywundrkind•4mo ago

> Current enterprise SSDs have something like a 10000 TBW rating

Running the numbers to verify: a read-write-mixed enterprise SSD will typically have 3 DWPD (drive writes per day), across it's 5 year warranty. At 2TB, that would be 10950 TBW, so that sort of checks out. If endurance was a concern, upgrading to a higher capacity would linearly increase the endurance. For example the Kioxia CD8P-V. https://americas.kioxia.com/en-us/business/ssd/data-center-s...

Finding it a bit hard to imagine build machines working that hard, but I could believe it!

bob1029•4mo ago

> one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms

I don't know where you're buying your NVMe drives, but mine usually respond within a hundred microseconds.

tehlike•4mo ago

I have 1TB ram on my home server. It's 2666 though...

WarOnPrivacy•4mo ago

> I have 1TB ram on my home server. It's 2666 though...

this kit? https://www.newegg.com/nemix-ram-1tb/p/1X5-003Z-01930

mulmen•4mo ago

Wow. I tried to tap but the Newegg app has an unskippable 5 second ad for something I didn’t read. What a shame. My fault for having their app installed I guess.

userbinator•4mo ago

It's roughly $3/GB.

prodipto81•4mo ago

Bro just 3 ?

tehlike•4mo ago

I got for around 0.6$/gb iirc.

prodipto81•4mo ago

3 just !!!!

tehlike•4mo ago

No, 16*64 Samsung LRDIMM sticks off of ebay. 35$ each stick iirc.

saltcured•4mo ago

Man, here I am in 2025 and my home server is a surplus Thinkpad P70 with just 64 GB RAM...

tehlike•4mo ago

It's the joy that counts. It's my "homelab" but i serve pricetracker.wtf on it.

I also have M920Q 8500t, HP prodesk with 10500t, and a lenovo P520 -> these three are truly for home purposes.

IF i were to do the pricetracker machine again, i'd go much smaller, and get a jbod + and probably a P520.

skhameneh•4mo ago

12 memory channels per CPU and DDR5-6400 may be supported (for reference, I found incorrect specs when I was looking at Epyc CPU retail listings some weeks ago), see https://www.amd.com/en/products/processors/server/epyc/9005-...

bigiain•4mo ago

Indeed. I wonder what a system like that would cost (at consumer available prices)?

magicalhippo•4mo ago

From what I can find here in Norway the CPU would be $3800, mobo around $2000, and one stick of 64 GB 6400 MHz registered ECC runs about $530, so about $6400 for the full 768 GB. Couldn't find any kits for those.

So just those components would be just over $12k.

That's just from regular consumer shops, and includes 25% VAT. Without the VAT it's about $9800.

Problem for consumers is that a just about all the shops that sells such and you might get a deal from would be geared towards companies, and not interested in deal with consumers due to consumer protection laws.

mlrtime•4mo ago

The best deals on these high end servers for consumers is to find a local large server reseller. Meaning a company who buys used datacenter equipment in bulk then resells. It may not always be used equipment or old.

magicalhippo•4mo ago

True, though at least here that'll be older stuff, and seems almost exclusively Intel parts.

I found a used server with 768 GB DDR4 and dual Intel Gold 6248 CPUs for $4200 including 25% VAT.

That's a complete 2U server, the CPUs are a bit weak but not too bad all in all.

ksec•4mo ago

I have been waiting for Netflix using FreeBSD to serve video at 1600Gb/s. They announced their 800Gbps record in 2021, and they were previously limited by CPU and Memory bandwidth. With 500GB/s that is pretty much not a thing.

NaomiLehman•4mo ago

damn, that's a lot of gigabytes for a movie

elorant•4mo ago

Even better you could use it for inference and with that much RAM you could load any model.

summarity•4mo ago

> Crazy amount at a crazy high speed

That's 300GB/s slower than my old Mac Studio (M1 Ultra). Memory speeds in 2025 remain thouroughly unimpressive outside of high-end GPUs and fully integrated systems.

AnthonyMouse•4mo ago

The server systems have that much memory bandwidth per socket. Also, that generation supports DDR5-6400 but they were using DDR5-5200. Using the faster stuff gets you 614GB/s per socket, i.e. a dual socket system with DDR5-6400 is >1200GB/s. And in those systems that's just for the CPU; a GPU/accelerator gets its own.

The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not. And none of the more recent Apple chips have any more than that.

It's the GPUs that use integrated memory, i.e. GDDR or HBM. That actually gets you somewhere -- the RTX 5090 has 1.8TB/s with GDDR7, the MI300X has 5.3TB/s with HBM3. But that stuff is also more expensive which limits how much of it you get, e.g. the MI300X has 192GB of HBM3, whereas normal servers support 6TB per socket.

And it's the same problem with Apple even though there's no great reason for it to be. The 2019 Intel Xeon Mac Pro supported 1.5TB of RAM -- still in slots -- but the newer ones barely reach a third of that at the top end.

wtallis•4mo ago

> The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not.

The M1 Ultra has LPDDR5, not DDR5. And the M1 Ultra was running its memory at 6400MT/s about two and a half years before any EPYC or Xeon parts supported that speed—due in part to the fact that the memory on a M1 Ultra is soldered down. And as far as I can tell, neither Intel nor AMD has shipped a CPU socket supporting 16 channels of DRAM; they're having enough trouble with 12 channels per socket often meaning you need the full width of a 19-inch rack for DIMM slots.

AnthonyMouse•4mo ago

LPDDR5 is "low power DDR5". The difference between that and ordinary DDR5 isn't that it's faster, it's that it runs at a lower voltage to save power in battery-operated devices. DDR5-6400 DIMMs were available for desktop systems around the same time as Apple. Servers are more conservative about timings for reliability reasons, the same as they use ECC memory and Apple doesn't. Moreover, while Apple was soldering their memory, Dell was shipping systems using CAMM with LPDDR5 that isn't soldered, and there are now systems from multiple vendors with CAMM2 and LPDDR5X.

Existing servers typically have 12 channels per socket, but they also have two DIMMs per channel, so you could double the number of channels per socket without taking up any more space for slots. You could also use CAMM which takes up less space.

They don't currently use more than 12 channels per socket even though they could because that's enough to not be a constraint for most common workloads, more channels increase costs, and people with workloads that need more can get systems with more sockets. Apple only uses more because they're using the same memory for the GPU and that is often constrained by memory bandwidth.

jauntywundrkind•4mo ago

> Existing servers typically have 12 channels per socket, but they also have two DIMMs per channel, so you could double the number of channels per socket without taking up any more space for slots. You could also use CAMM which takes up less space.

Usually this comes at a pretty sizable hit to MHz available. For example STH notes that their Zen5 ASRock Rack EPYC4000D4U goes from DDR5-5600 down to DDR5-3600 with the second slot populated, a 35% drop in throughput. https://www.servethehome.com/amd-epyc-4005-grado-is-great-an...

AnthonyMouse•4mo ago

It comes with a drop in performance because there are then two sticks on the same channel. Having the same number of slots and twice as many channels would be a way around that.

(It's also because of servers being ultra-cautious again. The desktops say the same thing in the manual but then don't enforce it in the BIOS and people run two sticks per channel at the full speed all over the place.)

matja•4mo ago

Do you have a benchmark that shows the M1 Ultra CPU to memory throughput?

ashvardanian•4mo ago

Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

PunchyHamster•4mo ago

Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

phire•4mo ago

That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

So they have been really optimising that IO die for latency.

NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

afr0ck•4mo ago

NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.

iberator•4mo ago

Is it true that EPYC doesn't use the program counter as in: next instruction address is in the second operand for some operations?

nine_k•4mo ago

EPYC runs x64 code. In it, jump instructions work exactly as you describe.

themafia•4mo ago

Well, there are relative jumps, so a global program counter has to exist on some level.

Hims and Hers abandons copycat weight-loss drug in face of FDA probe

Show HN: Claude Code skill that uses Codex as MCP server for code review

The Great Reversal ( OCC and Crypto)

Show HN: I built a festival tracker that matches lineups to your music library

Ship Types, Not Docs

RIP Postman free tier. Here's an open-source local-first alternative

There is no Alignment Problem

Hid Remapper

Recursive Deductive Verification: A framework for reducing AI hallucinations

Bitcoin tumbles below $70K, heavy losses in cryptocurrencies in last three weeks

Electrobun v1: Build fast, tiny, and cross-platform desktop apps with TypeScript

Why are so many people joining cults? [video]

Apple to Allow ChatGPT, Claude, and Gemini in CarPlay

Startup Idea that stops consumers paying the full price

GitHub Agentic Workflows

Exploring hardware-authenticated file encryption in Python

Show HN: SEO v3 – Zero-dependency, Simple, powerful PHP SEO library

Show HN: Alerio – Turn Webhooks into Critical VoIP Calls (Overrides Silent Mode)

A Comprehensive Benchmark for Document Parsing and Evaluation (2025)

When 20 Watts Beats 20 Megawatts: Rethinking Computer Design

Canadian Province New Brunswick to Quit Using Elon Musk's X

Heterogeneous Processing: A Strategy for Augmenting Moore's Law (2006)

Show HN: Mvvmm – Firecracker-like mini virtual machine monitor in ~2000 LoC

Search anything said on a podcast, speaker-labeled and speaker-tracked

Canada, better the 28th EU member than the 51st US state

Show HN: Team of agent researchers read things I don't have time to and brief me

Show HN: Chaos Agents – Run chaos experiments with Agents

Almostnode – Node.js in the Browser

Mount Fuji cherry blossom festival canceled due to overtourism

Containers, cloud, blockchain, AI – it's all the same old BS, says RH veteran

Hims and Hers abandons copycat weight-loss drug in face of FDA probe

Show HN: Claude Code skill that uses Codex as MCP server for code review

The Great Reversal ( OCC and Crypto)

Show HN: I built a festival tracker that matches lineups to your music library

Ship Types, Not Docs

RIP Postman free tier. Here's an open-source local-first alternative

There is no Alignment Problem

Hid Remapper

Recursive Deductive Verification: A framework for reducing AI hallucinations

Bitcoin tumbles below $70K, heavy losses in cryptocurrencies in last three weeks

Electrobun v1: Build fast, tiny, and cross-platform desktop apps with TypeScript

Why are so many people joining cults? [video]

Apple to Allow ChatGPT, Claude, and Gemini in CarPlay

Startup Idea that stops consumers paying the full price

GitHub Agentic Workflows

Exploring hardware-authenticated file encryption in Python

Show HN: SEO v3 – Zero-dependency, Simple, powerful PHP SEO library

Show HN: Alerio – Turn Webhooks into Critical VoIP Calls (Overrides Silent Mode)

A Comprehensive Benchmark for Document Parsing and Evaluation (2025)

When 20 Watts Beats 20 Megawatts: Rethinking Computer Design

Canadian Province New Brunswick to Quit Using Elon Musk's X

Heterogeneous Processing: A Strategy for Augmenting Moore's Law (2006)

Show HN: Mvvmm – Firecracker-like mini virtual machine monitor in ~2000 LoC

Search anything said on a podcast, speaker-labeled and speaker-tracked

Canada, better the 28th EU member than the 51st US state

Show HN: Team of agent researchers read things I don't have time to and brief me

Show HN: Chaos Agents – Run chaos experiments with Agents

Almostnode – Node.js in the Browser

Mount Fuji cherry blossom festival canceled due to overtourism

Containers, cloud, blockchain, AI – it's all the same old BS, says RH veteran

AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

Comments