Potential and Limitation of High-Frequency Cores and Caches (2024)

https://arch.cs.ucdavis.edu/simulation/2024/08/06/potentiallimitationhighfreqcorescaches.html

30•matt_d•8mo ago

Comments

bob1029•8mo ago

> We also did not model the SERDES (serializer-deserializer) circuits that would be required to interface the superconductor components with the room-temperature components, which would have an impact on the performance of the workloads. Instead, we assumed that the interconnect is unchanged from CMOS.

I had a little chuckle when I got to this. I/O is the hard part. Getting the information from A to B.

IBM is probably pushing the practical limits with 5.5GHz base clock on every core. When you can chew through 10+ gigabytes of data per second per core, it becomes a lot less about what the CPU can do and more about what everything around it can do.

The software is usually the weak link in all of this. Disrespect the NUMA and nothing will matter. The layers of abstraction can make it really easy to screw this up.

PaulHoule•8mo ago

In a phase when I was doing a lot of networking I hooked up with a chip designer who familiarized me with the "memory wall", ASIC and FPGA aren't quite the panacea they seem to be because if you have a large working set you are limited by memory bandwidth and latency.

Note faster-than-silicon electronics have been around for a while, the DOD put out an SBIR for a microprocessor based on Indium Phosphide in the 1990s which I suspect is a real product today but secret. [1] Looking at what fabs provide it seems one could make something a bit better than a 6502 that clocks out at 60 GHz and maybe you can couple it to 64kb of static RAM, maybe more with 2.5-d packaging. You might imagine something like that would be good for electronic warfare and for the simplest algorithms and waveforms it could buy a few ns of reduced latency but for more complex algorithms modern chips get a lot of parallelism and are hard to beat on throughput.

[1] Tried talking with people who might know, nobody wanted to talk.

foota•8mo ago

I've rea confidential proposals for chips with very high available memory bandwidth, but otherwise reduced performance compared to a standard general purpose CPU.

Something somewhere between a CPU and a GPU, that could handle many parallel streams, but at lower throughput than a CPU, and with very high memory bandwidth for tasks that need to be done against main memory. The niche here is for things like serialization and compression that need lots of bandwidth, can't be done efficiently on the GPU (not parallel), and waste precious time on the CPU.

PaulHoule•8mo ago

https://en.wikipedia.org/wiki/UltraSPARC_T1

foota•8mo ago

Similar in concept, I think the idea is that it would be used as an application coprocessor though, as opposed to the main processor, and obviously a lot more threads.

I don't remember all the details, but picture a bunch of those attached to different parts of the processor hierarchy remotely, e.g., one per core or one per NUMA node etc.,. The connection between the coprocessor and the processor can be thin, because the processor would just be sending commands to the coprocessor, so they wouldn't consume much of the constrained processor bandwidth, and each coprocessor would have a high bandwidth connection to memory.

saltcured•8mo ago

There was also the Tera MTA and various "processor-in-memory" research projects in academia.

Eventually, it's all full circle to supercomputer versus "hadoop cluster" again. Can you farm out work locally near bits of data or does your algorithm effectively need global scope to "transpose" data and hit bisection bandwidth limits of your interconnect topology.

Veserv•8mo ago

I am not sure that is the case anymore. High Bandwidth Memory (HBM) [1] as used on modern ML training GPUs has immensely more memory bandwidth than traditional CPU systems.

DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100 GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz clock, you need to crunch 200 bytes/clock to become memory bandwidth limited.

[1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

[2] https://en.wikipedia.org/wiki/DDR5_SDRAM

ryao•8mo ago

The memory wall (also known as the Von Neumann bottleneck) is still true. Token generation on Nvidia GPUs is memory bound, unless you do very large batch sizes to become compute bound.

That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.

Veserv•8mo ago

You get a memory bound on GPUs because they have so much more compute per memory. The H100 has 144 SMs driving 4x32 threads per clock. That is 18,432 threads demanding memory.

Now to be fair, that is separated into 8 clusters which I assume are connected to their own memory so you actually only have 576 threads sharing memory bandwidth. But that is still way more compute than any single processing element could ever hope to have. You can drown any individual processor in memory bandwidth these days unless you somehow produce a processor clocked at multiple THz.

The problem does not seem to be memory bandwidth, but cost, latency, and finding the cost-efficient compute-bandwidth tradeoff for a given task.

ryao•7mo ago

You can predict the token generation performance of a GPU or CPU by dividing the memory bandwidth by the size of the active parameters. By definition, that is a memory bandwidth bottleneck. I have no idea why you think it is not.

Anyone who has worked on inference code knows that memory bandwidth is the principal bottleneck for token generation. For example:

https://github.com/ryao/llama3.c

PaulHoule•8mo ago

Certainly an ASIC or FPGA on a package with HBM could do more.

So far as exotic 10x clocked systems based on 3-5 semiconductors, squids, or something, I think memory does have to be packaged with the rest of it. Ecauss of speed of light issues.

markhahn•8mo ago

they're both DRAM, so have roughly the same performance per interface-bit-width and clock. you can see this very naturally by looking at higher-end CPUs, which have wider DDR interfaces (currently up to 12x64b per socket - not as wide as in-package HBM, but duh)

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Potential and Limitation of High-Frequency Cores and Caches (2024)

Comments