Potential and Limitation of High-Frequency Cores and Caches (2024)

https://arch.cs.ucdavis.edu/simulation/2024/08/06/potentiallimitationhighfreqcorescaches.html

18•matt_d•3d ago

Comments

bob1029•4h ago

> We also did not model the SERDES (serializer-deserializer) circuits that would be required to interface the superconductor components with the room-temperature components, which would have an impact on the performance of the workloads. Instead, we assumed that the interconnect is unchanged from CMOS.

I had a little chuckle when I got to this. I/O is the hard part. Getting the information from A to B.

IBM is probably pushing the practical limits with 5.5GHz base clock on every core. When you can chew through 10+ gigabytes of data per second per core, it becomes a lot less about what the CPU can do and more about what everything around it can do.

The software is usually the weak link in all of this. Disrespect the NUMA and nothing will matter. The layers of abstraction can make it really easy to screw this up.

PaulHoule•3h ago

In a phase when I was doing a lot of networking I hooked up with a chip designer who familiarized me with the "memory wall", ASIC and FPGA aren't quite the panacea they seem to be because if you have a large working set you are limited by memory bandwidth and latency.

Note faster-than-silicon electronics have been around for a while, the DOD put out an SBIR for a microprocessor based on Indium Phosphide in the 1990s which I suspect is a real product today but secret. [1] Looking at what fabs provide it seems one could make something a bit better than a 6502 that clocks out at 60 GHz and maybe you can couple it to 64kb of static RAM, maybe more with 2.5-d packaging. You might imagine something like that would be good for electronic warfare and for the simplest algorithms and waveforms it could buy a few ns of reduced latency but for more complex algorithms modern chips get a lot of parallelism and are hard to beat on throughput.

[1] Tried talking with people who might know, nobody wanted to talk.

foota•3h ago

I've rea confidential proposals for chips with very high available memory bandwidth, but otherwise reduced performance compared to a standard general purpose CPU.

Something somewhere between a CPU and a GPU, that could handle many parallel streams, but at lower throughput than a CPU, and with very high memory bandwidth for tasks that need to be done against main memory. The niche here is for things like serialization and compression that need lots of bandwidth, can't be done efficiently on the GPU (not parallel), and waste precious time on the CPU.

PaulHoule•3h ago

https://en.wikipedia.org/wiki/UltraSPARC_T1

foota•2h ago

Similar in concept, I think the idea is that it would be used as an application coprocessor though, as opposed to the main processor, and obviously a lot more threads.

I don't remember all the details, but picture a bunch of those attached to different parts of the processor hierarchy remotely, e.g., one per core or one per NUMA node etc.,. The connection between the coprocessor and the processor can be thin, because the processor would just be sending commands to the coprocessor, so they wouldn't consume much of the constrained processor bandwidth, and each coprocessor would have a high bandwidth connection to memory.

saltcured•2h ago

There was also the Tera MTA and various "processor-in-memory" research projects in academia.

Eventually, it's all full circle to supercomputer versus "hadoop cluster" again. Can you farm out work locally near bits of data or does your algorithm effectively need global scope to "transpose" data and hit bisection bandwidth limits of your interconnect topology.

Veserv•2h ago

I am not sure that is the case anymore. High Bandwidth Memory (HBM) [1] as used on modern ML training GPUs has immensely more memory bandwidth than traditional CPU systems.

DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100 GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz clock, you need to crunch 200 bytes/clock to become memory bandwidth limited.

[1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

[2] https://en.wikipedia.org/wiki/DDR5_SDRAM

ryao•1h ago

The memory wall (also known as the Von Neumann bottleneck) is still true. Token generation on Nvidia GPUs is memory bound, unless you do very large batch sizes to become compute bound.

That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.

PaulHoule•1h ago

Certainly an ASIC or FPGA on a package with HBM could do more.

So far as exotic 10x clocked systems based on 3-5 semiconductors, squids, or something, I think memory does have to be packaged with the rest of it. Ecauss of speed of light issues.

markhahn•1h ago

they're both DRAM, so have roughly the same performance per interface-bit-width and clock. you can see this very naturally by looking at higher-end CPUs, which have wider DDR interfaces (currently up to 12x64b per socket - not as wide as in-package HBM, but duh)

Tell HN: Help restore the tax deduction for software dev in the US (Section 174)

Containerization is a Swift package for running Linux containers on macOS

Apple announces Foundation Models and Containerization frameworks, etc

Show HN: Munal OS: a graphical experimental OS with WASM sandboxing

Apple introduces a universal design across platforms

Sly Stone Has Died

What methylene blue can (and can’t) do for the brain

Domains I Love

Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking

Go is a good fit for agents

Show HN: Somo – a human friendly alternative to netstat

Doctors could hack the nervous system with ultrasound

Hokusai Moyo Gafu: an album of dyeing patterns

Bruteforcing the phone number of any Google user

Pi in Pascal's Triangle

Algovivo an energy-based formulation for soft-bodied virtual creatures

Why quadratic funding is not optimal

The new Gödel Prize winner tastes great and is less filling

Show HN: Most users won't report bugs unless you make it stupidly easy

A bit more on Twitter/X's new encrypted messaging

How do you prototype a nice language?

Myanmar's chinlone ball sport threatened by conflict and rattan shortages

A man rebuilding the last Inca rope bridge

Finding Shawn Mendes (2019)

Astronomers have discovered a mysterious object flashing signals from deep space

Show HN: Glowstick – type level tensor shapes in stable rust

RFK Jr. ousts entire CDC vaccine advisory committee

Maypole Dance of Braid Like Groups (2009)

LLMs are cheap

Potential and Limitation of High-Frequency Cores and Caches (2024)

Tell HN: Help restore the tax deduction for software dev in the US (Section 174)

Containerization is a Swift package for running Linux containers on macOS

Apple announces Foundation Models and Containerization frameworks, etc

Show HN: Munal OS: a graphical experimental OS with WASM sandboxing

Apple introduces a universal design across platforms

Sly Stone Has Died

What methylene blue can (and can’t) do for the brain

Domains I Love

Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking

Go is a good fit for agents

Show HN: Somo – a human friendly alternative to netstat

Doctors could hack the nervous system with ultrasound

Hokusai Moyo Gafu: an album of dyeing patterns

Bruteforcing the phone number of any Google user

Pi in Pascal's Triangle

Algovivo an energy-based formulation for soft-bodied virtual creatures

Why quadratic funding is not optimal

The new Gödel Prize winner tastes great and is less filling

Show HN: Most users won't report bugs unless you make it stupidly easy

A bit more on Twitter/X's new encrypted messaging

How do you prototype a nice language?

Myanmar's chinlone ball sport threatened by conflict and rattan shortages

A man rebuilding the last Inca rope bridge

Finding Shawn Mendes (2019)

Astronomers have discovered a mysterious object flashing signals from deep space

Show HN: Glowstick – type level tensor shapes in stable rust

RFK Jr. ousts entire CDC vaccine advisory committee

Maypole Dance of Braid Like Groups (2009)

LLMs are cheap

Potential and Limitation of High-Frequency Cores and Caches (2024)

Potential and Limitation of High-Frequency Cores and Caches (2024)

Comments