To me, this looks like a win.
Governments are there to finance projects like this that enable the country to have certain skillsets that wouldn't exist otherwise because of other countries having better solutions in the global market.
But what governments often can do, is break local optimums clustering around the quarter economy and take moonshot chances and find paths otherwise never taken. Hopefully one of these paths are great.
The difficult thing becomes deciding when to pull the plug. Is ITER a good thing or not? (Results wise, it is, but for the money? Who can tell really.)
Just like it doesn’t work to try an ecosystem based on one species, a society has to blend government and private spending. They work on different incentives and timeframes, and both have pitfalls that the other might handle better.
The fp64 GFLOPS per watt metric in the post is almost entirely meaningless to compare between these accelerators and NVIDIA GPUs, for example it says
> Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts)
But then if you consider H100 PCIe [0] instead, it's going to be 26000/350 = 74.29 GFLOPS per watt. If you go look harder you can find ones with better on-paper fp64 performance, for example AMD MI300X has 81.7 TFLOPs with typical board power of "750W Peak", which gives 108.9 GFLOPS per watt.
The truth is the power allocation of most GPGPUs are heavily tilted for Tensor usages. This has been the trend well before B300.
That's all for HPC.
And Pezy processors are certainly not designed for "AI" (i.e. linear algebra with lower input precision). For AI inference starting from 2020 everyone is talking about how many T(FL)OPS per watt, not G.
[0] which is a nerfed version of H200's precursor.
These Pezy chips are also made for large clusters. There is a whole system design around the chips that wasn't presented here. The Pezy-SC2, for instance, was built around liquid immersion cooling. I am not sure you could ever buy an air-cooled version.
Is the whole board submersed in liquid? Or just the processor?
"Each immersion tank can contain 16 Bricks. A Brick consists of a backplane board, 32 PEZY-SC2 modules, 4 Intel Xeon D host processors, and 4 InfiniBand EDR cards. Modules inside a Brick are connected by hierarchical PCI Express fabric switches, and the Bricks are interconnected by InfiniBand."
Well that was a disappointing end to a sentence. I was hoping another company would invest a few million in HPC to play SC2!
GPUs are great if your workload can use them, but not so great for more general tasks. These are more appropriate to more traditional supercomputing tasks, as in they're not optimized for lower precision AI stuff, like NVIDIA GPUs are.
> The Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts), and the Blackwell B200 is rated at 33.3 gigaflops per watt (40 teraflops divided by 1,200 watts). The Blackwell B300 has FP64 severely deprecated at 1.25 teraflops and burns 1,400 watts, which is 0.89 gigaflops per watt. (The B300 is really aimed at low precision AI inference.)
Nevertheless, my point is more that if FP64 performance is poor on purpose, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/watt(TDP) is misleading.
numpad0•9h ago
WithinReason•7h ago
saagarjha•7h ago
WithinReason•7h ago
saagarjha•7h ago
jacquesm•5h ago
I really can't complain, now, FPGAs, however... And if there ever is a company that comes out and improves substantially on this I'll be happy for sure but if you asked me off the bat what they should improve I honestly wouldn't know, especially not taking into account that this was an incremental effort over ~2 decades and that originated in an industry that has nothing to do with the main use case today and some detours into unrelated industries besides (crypto, for instance).
From fluid dynamics, FEA, crypto, gaming, genetics, AI and many others with a single generic architecture and delivering very good performance is no mean feat.
I'd love to hear in what way you would improve on their toolset.
programjames•4h ago
1. Memory indexing. It's a pain to avoid banking conflicts, and implement cooperative loading on transposed matrices. To improve this, (1) pop up a warning when banking conflicts are detected, (2) make cooperative loading solved by the compiler. It wouldn't be too hard to have a second form of indexing memory_{idx} that the compiler solves a linear programming problem for to maximize throughput (do you spend more thread cycles cooperative loading, or are banking conflicts fine because you have other things to work on?)
2. Why is there no warning when shared memory is unspecified? It isn't hard to check if you're accessing an index that might not have been assigned a value. The compiler should pop out a warning and assign it to 0.0, or maybe even just throw an error.
3. Timing - doesn't exist. Pretty much the gold standard is to run your kernel 10_000 times in a loop and subtract the time from before and after the loop. This isn't terribly important, I'm just getting flashbacks to before I learned `timeit` was a thing in Python.
jacquesm•3h ago
https://forums.developer.nvidia.com/c/accelerated-computing/...
They regularly have threads asking for such suggestions.
But I don't think they rise to the general conclusion that the tooling is bad.
numpad0•2h ago
There is a type of research called traffic surveys, which involves hiring few men with adequate education to sit or stand at an intersection for one whole day to count numbers of passing entities by types. YOLO wasn't accurate enough. I have gut feeling that vision enabled LLM would be. That doesn't require constant update or upgrades to latest NN innovations so no need to do full CUDA, so long one known good weight files work.
CoastalCoder•7h ago
I've only done a little work on CUDA, but I was pretty impressed with it and with their NSys tools.
I'm curious what you wish was different.
saagarjha•6h ago
Of course, one might mention that GPUs are nothing like CPUs–but the programming model works super hard to try to hide this. So it's not really well designed in my book. I actually quite like the compilers that people are designing these days to write block-level code, because I feel like it better represents the work people want to do and then you pick which way you want it lowered.
As for Nsight (Systems), it is…ok, I guess? It's fine for games and stuff I guess but for HPC or AI it doesn't really surface the information that you would want. People who are running their GPUs really hard know they have kernels running all the time and what the performance characteristics of them are. Nsight Compute is the thing that tells you that but it's kind of a mediocre profiler (some of this may be limitations of hardware performance counters) and to use it effectively you basically have to read a bunch of blog posts by people instead of official documentation.
Despite not having used it much, my impression was that Nvidia's "moat" was that they have good networking libraries, that they are pretty good (relatively) and making sure all their tools work, and they have had consistent investment on this for a decade.
electroglyph•6h ago
jandrewrogers•2h ago
The wide vectors on GPUs are somewhat irrelevant. Scalar barrel processors exist and have the same issues. A scalar barrel processor feels deceptively CPU-like and will happily compile and run normal CPU code. The performance will nonetheless be poor unless the C++ code is designed to be a good fit for the nature of a barrel processor, code which will look weird and non-idiomatic to someone who has only written code for CPUs.
There is no way to hide that a barrel processor is not a CPU even though they superficially have a lot of CPU-like properties. A barrel processor is extremely efficient once you learn to write code for them and exceptionally well-suited to HPC since they are not latency-sensitive. However, most people never learn how to write proper code for barrel processors.
Ironically, barrel processor style code architecture is easy to translate into highly optimized CPU code, just not the reverse.
glitchc•24m ago
[1] https://docs.nvidia.com/deeplearning/performance/dl-performa...
KeplerBoy•7h ago
rwmj•6h ago
DrNosferatu•6h ago
Seriously doubt that: free hardware (or 10s of bucks) would galvanize the community and achieve huge support - look at the Raspberry Pi project original prices and the consequences.
DrNosferatu•2h ago
Say, release has extensions to a RISC-V design.
DrNosferatu•31m ago
londons_explore•5h ago
I suspect that whenever you look like you're making good progress on this front, nvidia gives you a lot of chips for free on condition you shelve the effort though!
The latest example being Tesla, who were designing their own hardware and software stack for NN training, then suspiciously got huge numbers of H100's ahead of other clients and cancelled the dojo effort.
AlotOfReading•3h ago
To combat all of these issues, they were fighting with Nvidia (and losing) for access to leading edge nodes, which kept going up in price. Their personnel costs kept rising as the company became more politicized, people left to join other companies (e.g. densityai), and they became embroiled in the salary wars to replace them.
My suspicion is that Musk told them to just buy Nvidia instead of waiting around for years of slow iteration to get something competitive.
The custom silicon I was involved with experienced similar issues. It was too expensive and slow to try competing with Nvidia, and no one could stomach the costs to do so.
nromiun•4h ago
If people use PyTorch on a Nvidia GPU they are running layers and layers of code written by those that know how to write fast kernels for GPUs. In some cases they use assembly as well.
Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA.
woooooo•3h ago
CUDA and CUBLAS being capable of a bunch of other things is really cool, and would take a long time to catch up with, but getting the bare minimum to run LLMs on any platform with a bunch of GDDR7 channels and cores at a reasonable price would have people writing torch/ggml backends within weeks.
nromiun•3h ago
Here is an example of how hard it is: https://siboehm.com/articles/22/CUDA-MMM
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.
woooooo•3h ago
I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.
kadushka•44m ago
AMD should hire that one really smart person.
adgjlsfhk1•21m ago
pclmulqdq•6h ago
numpad0•2h ago
Another one of these I still sometimes think about is NEC VectorEngine - they had 5 TFLOPS FP32 with 48GB of HBM2 totaling 1.5TB/s bandwidth at $10k in 2020. That was within a digit or two against NVIDIA at basically the same price. But they then didn't capitalize on it, just kept delivering to national institutes in ritualistic manners.
I do have basic conceptual understanding of these grant businesses and have vague intuitions as to how bureaucracy wants substantial capital investments and report files without commercial capitalizations, with emphasis on the last part, as it would disrupt internal politics inside government agencies and also creates unfair government competitive pressure against civilian sectors, but at some point it starts looking like cash campfires. I don't know exactly how slow are M4 Mac Studios relative to NVIDIA Tesla clusters normalized for VRAM, but they're considered comparable regardless just because they run LLMs at 10-20 tok/s. So it's just, unfortunate, that these accelerators of basically same nature as M-series CPUs are built, kept on idle, and then recycled.
The one that is in my mind as "no way these brochure figures are real" is PFN MN-Core - though it looks like they might be doing an LLM specific variant in the future. Hopefully they retail them.