> We also did not model the SERDES (serializer-deserializer) circuits that would be required to interface the superconductor components with the room-temperature components, which would have an impact on the performance of the workloads. Instead, we assumed that the interconnect is unchanged from CMOS.
I had a little chuckle when I got to this. I/O is the hard part. Getting the information from A to B.
IBM is probably pushing the practical limits with 5.5GHz base clock on every core. When you can chew through 10+ gigabytes of data per second per core, it becomes a lot less about what the CPU can do and more about what everything around it can do.
The software is usually the weak link in all of this. Disrespect the NUMA and nothing will matter. The layers of abstraction can make it really easy to screw this up.
PaulHoule•3h ago
In a phase when I was doing a lot of networking I hooked up with a chip designer who familiarized me with the "memory wall", ASIC and FPGA aren't quite the panacea they seem to be because if you have a large working set you are limited by memory bandwidth and latency.
Note faster-than-silicon electronics have been around for a while, the DOD put out an SBIR for a microprocessor based on Indium Phosphide in the 1990s which I suspect is a real product today but secret. [1] Looking at what fabs provide it seems one could make something a bit better than a 6502 that clocks out at 60 GHz and maybe you can couple it to 64kb of static RAM, maybe more with 2.5-d packaging. You might imagine something like that would be good for electronic warfare and for the simplest algorithms and waveforms it could buy a few ns of reduced latency but for more complex algorithms modern chips get a lot of parallelism and are hard to beat on throughput.
[1] Tried talking with people who might know, nobody wanted to talk.
foota•3h ago
I've rea confidential proposals for chips with very high available memory bandwidth, but otherwise reduced performance compared to a standard general purpose CPU.
Something somewhere between a CPU and a GPU, that could handle many parallel streams, but at lower throughput than a CPU, and with very high memory bandwidth for tasks that need to be done against main memory. The niche here is for things like serialization and compression that need lots of bandwidth, can't be done efficiently on the GPU (not parallel), and waste precious time on the CPU.
Similar in concept, I think the idea is that it would be used as an application coprocessor though, as opposed to the main processor, and obviously a lot more threads.
I don't remember all the details, but picture a bunch of those attached to different parts of the processor hierarchy remotely, e.g., one per core or one per NUMA node etc.,. The connection between the coprocessor and the processor can be thin, because the processor would just be sending commands to the coprocessor, so they wouldn't consume much of the constrained processor bandwidth, and each coprocessor would have a high bandwidth connection to memory.
saltcured•2h ago
There was also the Tera MTA and various "processor-in-memory" research projects in academia.
Eventually, it's all full circle to supercomputer versus "hadoop cluster" again. Can you farm out work locally near bits of data or does your algorithm effectively need global scope to "transpose" data and hit bisection bandwidth limits of your interconnect topology.
Veserv•2h ago
I am not sure that is the case anymore. High Bandwidth Memory (HBM) [1] as used on modern ML training GPUs has immensely more memory bandwidth than traditional CPU systems.
DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100 GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz clock, you need to crunch 200 bytes/clock to become memory bandwidth limited.
The memory wall (also known as the Von Neumann bottleneck) is still true. Token generation on Nvidia GPUs is memory bound, unless you do very large batch sizes to become compute bound.
That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.
PaulHoule•1h ago
Certainly an ASIC or FPGA on a package with HBM could do more.
So far as exotic 10x clocked systems based on 3-5 semiconductors, squids, or something, I think memory does have to be packaged with the rest of it. Ecauss of speed of light issues.
markhahn•1h ago
they're both DRAM, so have roughly the same performance per interface-bit-width and clock. you can see this very naturally by looking at higher-end CPUs, which have wider DDR interfaces (currently up to 12x64b per socket - not as wide as in-package HBM, but duh)
bob1029•4h ago
I had a little chuckle when I got to this. I/O is the hard part. Getting the information from A to B.
IBM is probably pushing the practical limits with 5.5GHz base clock on every core. When you can chew through 10+ gigabytes of data per second per core, it becomes a lot less about what the CPU can do and more about what everything around it can do.
The software is usually the weak link in all of this. Disrespect the NUMA and nothing will matter. The layers of abstraction can make it really easy to screw this up.
PaulHoule•3h ago
Note faster-than-silicon electronics have been around for a while, the DOD put out an SBIR for a microprocessor based on Indium Phosphide in the 1990s which I suspect is a real product today but secret. [1] Looking at what fabs provide it seems one could make something a bit better than a 6502 that clocks out at 60 GHz and maybe you can couple it to 64kb of static RAM, maybe more with 2.5-d packaging. You might imagine something like that would be good for electronic warfare and for the simplest algorithms and waveforms it could buy a few ns of reduced latency but for more complex algorithms modern chips get a lot of parallelism and are hard to beat on throughput.
[1] Tried talking with people who might know, nobody wanted to talk.
foota•3h ago
Something somewhere between a CPU and a GPU, that could handle many parallel streams, but at lower throughput than a CPU, and with very high memory bandwidth for tasks that need to be done against main memory. The niche here is for things like serialization and compression that need lots of bandwidth, can't be done efficiently on the GPU (not parallel), and waste precious time on the CPU.
PaulHoule•3h ago
https://en.wikipedia.org/wiki/UltraSPARC_T1
?
foota•2h ago
I don't remember all the details, but picture a bunch of those attached to different parts of the processor hierarchy remotely, e.g., one per core or one per NUMA node etc.,. The connection between the coprocessor and the processor can be thin, because the processor would just be sending commands to the coprocessor, so they wouldn't consume much of the constrained processor bandwidth, and each coprocessor would have a high bandwidth connection to memory.
saltcured•2h ago
Eventually, it's all full circle to supercomputer versus "hadoop cluster" again. Can you farm out work locally near bits of data or does your algorithm effectively need global scope to "transpose" data and hit bisection bandwidth limits of your interconnect topology.
Veserv•2h ago
DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100 GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz clock, you need to crunch 200 bytes/clock to become memory bandwidth limited.
[1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory
[2] https://en.wikipedia.org/wiki/DDR5_SDRAM
ryao•1h ago
That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.
PaulHoule•1h ago
So far as exotic 10x clocked systems based on 3-5 semiconductors, squids, or something, I think memory does have to be packaged with the rest of it. Ecauss of speed of light issues.
markhahn•1h ago