Basic Facts about GPUs

https://damek.github.io/random/basic-facts-about-gpus/

344•ibobev•7mo ago

Comments

kittikitti•7mo ago

This is a really good introduction and I appreciate it. When I was building my AI PC, the deep dive research into GPU's took a few days but this lays it out in front of me. It's especially great because it touches on high-value applications like generative artificial intelligence. A notable diagram from the page that I wasn't able to find represented well elsewhere was the memory hierarchy of the A100 GPU's. The diagrams were very helpful. Thank you for this!

b0a04gl•7mo ago

been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

jcelerier•7mo ago

did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild

zargon•7mo ago

Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.

tough•7mo ago

if you open a PR, even if it doesnt get merged, anyone with the same issue can find it, and use your PR/branch/fix if it suits better their needs than master

zargon•7mo ago

Yeah good point. I have applied such PRs myself in the past. Eventually the code churn can sometimes make it too much of a pain to maintain them, but they’re useful for a while.

buildxyz•7mo ago

Any speed up that is 2x is definitely worth fixing. Especially since someone has already figured out the issue and performance testing [1] shows that llamacpp* is lagging behind vLLM by 2x. This is a positive for all running LLMs locally using llamacpp.

Even if llamacpp isnt used for batch inference now, this can allow those to finally run llamacpp for batching and on any hardware since vLLM supports only select hardware. Maybe finally we can stop all this gpu api software fragmentation and cuda moat as llamacpp benchmarks have shown Vulkan to be as or more performant than cuda or sycl.

[1] https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lab...

menaerus•7mo ago

So, what exactly is batch inference workload and how would someone running inference on local setup benefit from it? Or how would I even benefit from it if I had a single machine hosting multiple users simultaneously?

I believe batching is a concept only useful when during the training or fine tuning process.

zargon•7mo ago

Batch inference is just running multiple inferences simultaneously. If you have simultaneous requests, you’ll get incredible performance gains, since a single inference doesn’t leverage any meaningful fraction of a GPU’s compute capability.

For local hosting, a more likely scenario where you could use batching is if you had a lot of different data you wanted to process (lots of documents or whatever). You could batch them in sets of x and have it complete in 1/x the time.

A less likely scenario is having enough users that you can make the first user wait a few seconds while you wait to see if a second user submits a request. If you do get a second request, then you can batch them and the second user will get their result back much faster than if they had had to wait for the first user’s request to complete first.

Most people doing local hosting on consumer hardware won’t have the extra VRAM for the KV cache for multiple simultaneous inferences though.

menaerus•7mo ago

Wouldn't batching the multiple inference requests from multiple different users with multiple different contexts simultaneously impact the inference results for each of those users?

pests•7mo ago

The different prompts being batched do not mathematically affect each other. When running inference you have massive weights that need to get loaded and unloaded just to serve the current prompt and however long its context is (maybe even just a few tokens even). This batching lets you manipulate and move the weights around less to serve the same amount of combined context.

menaerus•7mo ago

Batching isn't about "moving weights around less". Where do you move the weights anyway once they are loaded into the GPU VRAM? Batching, as always in CS problems, is about maximizing the compute for a unit of a single round trip, and in this case DMA-context-from-CPU-RAM-to-GPU-VRAM.

Self attention premise is exactly that it isn't context free so it is also incorrect to say that batched requests do not mathematically affect each other. They do, and that's by design.

zargon•7mo ago

> Where do you move the weights anyway once they are loaded into the GPU VRAM?

The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.

So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.

Also, batched prompts and outputs are indeed mathematically independent from each other.

menaerus•7mo ago

Round-trip between VRAM and GPU registers? That's what the cache hierarchies are for. I think you confused quite a bit of concepts here.

Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.

And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.

pests•7mo ago

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

menaerus•7mo ago

Not really. Registers are irrelevant. They are not the bottleneck.

pests•7mo ago

Computation happens in the registers. If you’re not moving data to registers you aren’t doing any compute.

menaerus•7mo ago

Obviously yes but NVIDIA Ampere/Hopper architecture has 64k 32-bit registers per SM. A100 has 108 SMs and H100 has 132 SMs so go figure - registers aren't a bottleneck.

hexaga•7mo ago

Model weights are significantly larger than cache in almost all cases. Even an 8B parameter model is ~16G in half precision. The caches are not large enough to actually cache that.

Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.

The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.

menaerus•7mo ago

You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.

namibj•7mo ago

You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling. Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.

menaerus•7mo ago

Why are you mad? Take a chill pill dude

spwa4•7mo ago

If you add a dimension to the input vector you can do them independently and more efficiently. Look at this. Let's say you have a 2x2 network, and you apply it to an input vector of two values:

[i1 i2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅w1 +i2 ⋅w3 i1 ⋅w2 +i2 ⋅w4 ]

Cool. Now what happens if we make the input vector a 2x2 matrix with, for some reason, a second set of two input values:

[i1 i2 ; j1 j2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅ w1 +i2 ⋅ w3 i1 ⋅ w2 +i2 ⋅ w4 ; j1 ⋅ w1 +j2 ⋅ w3 j1 ⋅ w2 +j2 ⋅w4 ]

Look at that! The input has 2 rows, each row has an input value for the network and the output matrix has 2 rows, each containing the outputs for the respective inputs. So you can "just" apply your neural network to any number of input values by just putting one to each row. You could do 2, or 1000 this way ... and a number of values would only need to be calculated once.

zozbot234•7mo ago

It depends, if the optimization is too hardware-dependent it might hurt/regress performance on other platforms. One would have to find ways to generalize and auto-tune it based on known features of the local hardware architecture.

amelius•7mo ago

Yes, easiest is to separate it into a set of options. Then have a bunch of Json/yaml files, one for each hw configuration. From there, the community can fiddle with the settings and share new settings if new hardware is released.

tough•7mo ago

did you see yesterday nano-vllm [1] from a deepseek employee 1200LOC and faster than vanilla vllm?

1. https://github.com/GeeeekExplorer/nano-vllm

Gracana•7mo ago

Is it faster for large models, or are the optimizations more noticeable with small models? Seeing that the benchmark uses a 0.6B model made me wonder about that.

tough•7mo ago

I have not tested it but its from a deepseek employee i don't know if it's used in prod there or not!

leeoniya•7mo ago

try https://github.com/ikawrakow/ik_llama.cpp

chickenzzzzu•7mo ago

>GPU was never the botteneck >it was memory layout

ah right so the GPU was the bottleneck then

CardenB•7mo ago

No because he was able to achieve the speedup without changing the GPU.

chickenzzzzu•7mo ago

A more technically correct way to express this feeling is:

"The computational power of the cores on the GPU was never the issue-- however the code that I wrote resulted in a memory bandwidth bottleneck that starved the GPU cores of data to work on, which is firmly within my responsibilities as a programmer -- to fully understand the bandwidth and latency characteristics of the device(s) i'm running on"

saagarjha•7mo ago

I mean they didn't write the code

chickenzzzzu•7mo ago

And that's the reason why they misspoke

SoftTalker•7mo ago

Contrasting colors. Use them!

jasonjmcghee•7mo ago

If the author stops by- the links and the comments in the code blocks were the ones that I had to use extra effort to read.

It might be worth trying to increase the contrast a bit.

The content is really great though!

cubefox•7mo ago

The website seems to use alpha transparency for text. A grave, contrast-reducing, sin.

xeonmc•7mo ago

It’s just liquid-glass text and you’ll get used to it soon enough.

currency•7mo ago

The author might be formatting for and editing in dark mode. I use edge://flags/#enable-force-dark and the links are readable.

Yizahi•7mo ago

font-weight: 300;

I'm 99% sure that author had designed this website on an Apple Mac with so called "font smoothing" enabled, which makes all regular fonts artificially "semi-bold". So to make a normal looking font, Mac designers use this thinner font weight and then Apple helpfully makes it kinda "normal".

https://news.ycombinator.com/item?id=23553486

neuroelectron•7mo ago

Jfc

elashri•7mo ago

Good article summarizing good chunk of information that people should have some idea about. I just want to comment that the title is a little bit misleading because this is talking about the very choices that NVIDIA follows in developing their GPU archs which is not what always what others do.

For example, the arithmetic intensity break-even point (ridge-point) is very different once you leave the NVIDIA-land. If we take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27 FLOPs/byte which is about double that of the A100’s 13 FLOPs/byte. The larger on-package HBM (128 – 256 GB) GPU memory also shifts the practical trade-offs between tiling depth and occupancy. Although this is very expensive and does not have CUDA (which can be good and bad at the same time).

apitman•7mo ago

Unfortunately Nvidia GPUs are the only ones that matter until AMD starts taking their computer software seriously.

fooblaster•7mo ago

They are. It's just not at the consumer hardware level.

have-a-break•7mo ago

You could argue it's all the nice GPU debugging tools nVidia provides which makes GPU programming accessible.

There are so many potential bottlenecks (normally just memory access patterns, but without tools to verify you have to design and run manual experiments).

tucnak•7mo ago

This misconception is repeated time and time again; software support of their datacenter-grade hardware is just as bad. I've had the displeasure of using MI50, MI100 (a lot), MI210 (very briefly.) All three are supposedly enterprise-grade computing hardware, and yet, it was a pathetic experience with a myriad of disconnected components which had to be patched, & married with a very specific kernel version to get ANY kind of LLM inference going.

Now, the last of it I bothered with was 9 months ago; enough is enough.

fooblaster•7mo ago

this hardware is ancient history. mi250 and mi300 are much better supported

tucnak•7mo ago

What a load of nonsense. MI210 effectively hit the market in 2023, similarly to H100. We're talking about datacenter-grade, two-year out of date card, and it's already "ancient history?"

No wonder nobody on this site trusts AMD.

bluescrn•7mo ago

Unless you're, you know, using GPUs for graphics...

Xbox, Playstation, and Steam Deck seem to be doing pretty nicely with AMD.

MindSpunk•7mo ago

The quantity of people on this site now that care about GPUs all of a sudden because of the explosion of LLMs, who fail to understand that GPUs are _graphics_ processors that are designed for _graphics_ workloads is insane. It almost feels like the popular opinion here is that graphics is just dead and AMD and NVIDIA should throw everything else they do in the bin to chase the LLM bag.

AMD make excellent graphics hardware, and the graphics tools are also fantastic. AMD's pricing and market positioning can be questionable but the hardware is great. They're not as strong with machine learning tasks, and they're in a follower position for tensor acceleration, but for graphics they are very solid.

almostgotcaught•7mo ago

The quantity of people on this site now that think they understand modern GPUs because back in the day they wrote some opengl...

1. Both AMD and NVIDIA have "tensorcore" ISA instructions (ie real silicon/data-path, not emulation) which have zero use case in graphics

2. Ain't no one playing video games on MI300/H100 etc and the ISA/architecture reflects that

> but for graphics they are very solid.

Hmmm I wonder if AMD's overfit-to-graphics architectural design choices are a source of friction as they now transition to serving the ML compute market... Hmmm I wonder if they're actively undoing some of these choices...

MindSpunk•7mo ago

AMD isn't overfit to graphics. AMD's GPUs were friendly to general purpose compute well before Nvidia was. Hardware-wise anyway. AMD's memory access system and resource binding model was well ahead of Nvidia for a long time. When Nvidia was stuffing resource descriptors into special palettes with addressing limits, AMD was fully bindless under the hood. Everything was just one big address space, descriptors and data.

Nvidia 15 years ago was overfit to graphics. Nvidia just made smarter choices, sold more hardware and re-invested their winnings into software and improving their hardware. Now they're just as good at GPGPU with a stronger software stack.

AMD has struggled to be anything other than a follower in the market and has suffered quite a lot as a result. Even in graphics. Mesh shaders in DX12 was the result of NVIDIA dictating a new execution model that was very favorable to their new hardware while AMD had already had a similar (but not perfectly compatible) system since the Vega called primitive shaders.

averne_•7mo ago

Matrix instructions do of course have uses in graphics. One example of this is DLSS.

Agentlien•7mo ago

This feels backwards to me when GPUs were created largely because graphics needed lots of parallel floating point operations, a big chunk of which are matrix multiplications.

When I think of matrix multiplication in graphics I primarily think of transforms between spaces: moving vertices from object space to camera space, transforming from camera space to screen space, ... This is a big part of the math done in regular rendering and needs to be done for every visible vertex in the scene - typically in the millions in modern games.

I suppose the difference here is that DLSS is a case where you primarily do large numbers of consecutive matrix multiplications with little other logic, since it's more ANN code than graphics code.

lomase•7mo ago

Imagine thinking you know more than others because you use a different abstraction layer.

_carbyau_•7mo ago

Just having fun with an out of context quote.

> graphics is just dead and AMD and NVIDIA should throw everything else they do in the bin to chase the LLM bag

No graphics means that games of the future will be like:

"You have been eaten by a ClautGemPilot."

fooblaster•7mo ago

my experience with the mi300 does not mirror yours. If I have a complaint, it's that it's performance does not live up to expectations.

tucnak•7mo ago

Unfortunately, GPU's are old news now. When it comes to perf/watt/dollar, TPU's are substantially ahead for both training and inference. There's a sparsity disadvantage with the trailing-edge TPU devices such as v4 but if you care about large-scale training of any sort, it's not even close. Additionally, Tenstorrent p300 devices are hitting the market soon enough, and there's lots of promising stuff is coming on Xilinx side of the AMD shop: the recent Versal chips allow for AI compute-in-network capabilities that puts NVIDIA Bluefield's supposed programmability to shame. NVIDIA likes to say Bluefield is like a next-generation SmartNIC, but compared to actually field-programmable Versal stuff, it's more like 100BASE-T cards from the 90s.

I think it's very naive to assume that GPU's will continue to dominate the AI landscape.

menaerus•7mo ago

So, where does one buy a TPU?

tucnak•7mo ago

The actual lead times on similarly-capable GPU systems are so long, by the time your order is executed, you're already losing money. Even assuming perfect utilization, and perfect after-market conditions—you won't be making any money on the hardware anyway.

Buy v. rent calculus is only viable if there's no asymmetry between the two. Oftentimes, what you can rent you cannot buy, and vice-versa, what you can buy—you could never rent. Even if you _could_ buy an actual TPU, you wouldn't be able to run it anyway, as it's all built around sophisticated networking and switching topologies[1]. The same goes for GPU deployments of comparable scale: what made you think that you could buy and run GPU's at scale?

It's a fantasy.

[1] https://arxiv.org/abs/2304.01433

almostgotcaught•7mo ago

Is your answer to "where can I buy a TPU" that you can't buy a GPU either? That's a new one.

First of all I don't understand how that's an answer. Second of all it's laughably wrong - I can name 5 firms (outside of FAANG) off the top of my head with >1k Blackwell devices and they're making very good money (have you ever heard of quantfi....). Third of all, how is TPU going to conquer absolutely anything when (as you admit) you couldn't run one even if you could buy one?

tucnak•7mo ago

I'd never claimed that "TPU is going to conquer everything," it's a matter of fact that the latest-generation TPU is currently the most cost-effective solution for large-scale training. I'm not even saying that NVIDIA has lost, just that GPU's have lost. Maybe NVIDIA comes up with a non-GPU based system, and it includes programmable fabric to enable compute-in-network capabilities, sure, anything other than Bluefield nonsense, but it's already clear from the engineering standpoint that the large HBM-stacks attached to a "GPU"+Bluefield formula is over.

almostgotcaught•7mo ago

> NVIDIA has lost, just that GPU's have lost

i hope you realize how silly you sound when

1. NVDA's market cap is 70% more than GOOG's

2. there is literally not a single other viable competitor to GPGPU amongst the 30 or so "accelerator" companies that all swear their thing will definitely be the one, even with many of them approaching 10 years in the market by now (cerebras, samba nova, groq, dmatrix, blah blah blah).

menaerus•7mo ago

Right. Your argument doesn't really follow. Since I cannot buy a TPU, which you agree with, then a single viable option is really only a GPU, which I _can_ buy.

So, according to that, GPUs aren't really going anywhere unless there's a new player in a town who will compete with the Nvidia and sell at lower prices.

almostgotcaught•7mo ago

> Unfortunately, GPU's are old news now

...

> the recent Versal chips allow for AI compute-in-network capabilities that puts NVIDIA Bluefield's supposed programmability to shame

I'm always just like... who are you people. Like what is the profile of a person that just goes around proclaiming wild things as if they're completely established. And I see this kind of comment on hn very frequently. Like you either work for Tenstorrent or you're an influencer or a zdnet presenter or just ... because none of this even remotely true.

Reminds me of

"My father would womanize; he would drink. He would make outrageous claims like he invented the question mark. Sometimes, he would accuse chestnuts of being lazy."

> I think it's very naive to assume that GPU's will continue to dominate the AI landscape

I'm just curious - how much of your portfolio is AMD and how much is NVDA and how much is GOOG?

timeinput•7mo ago

Listen, I'm ~~not~~ all in on Ferrero Rocher, and chestnuts *are* lazy. No where near as productive as hazelnuts.

tucnak•7mo ago

> I'm just curious - now much of your portfolio is AMD

I'm always just like... who are you people: financiers, or hackers? :-) I don't work for TT, but I am a founder in the vertical AI space. Firstly, every major player is making AI accelerators of their own now, and guess what, most state-of-the-art designs have very little in common with a GPGPU design of yester-year. We have thoroughly evaluated various options, including buying/renting NVIDIA hardware; unfortunately, it didn't make any sense—neither in terms of cost, nor capability. Buying (and waiting _months_ for) NVIDIA rack-fuls is the quickest way to bankrupt your business with CAPEX. Renting the same hardware is merely moving the disease to OPEX, and in post-ZIRP era this is equally devastating.

No matter how much HBM memory you get for whatever individual device, no matter the packaging—it's never going to be enough. The weights alone are quickly dwarfed by K/V cache pages anyway. This is doubly true, if you're executing highly-concurrent agents that share a lot of the context, or doing dataset-scale inference transformations. The only thing that matters, truly, is the ability to scale-out, meaning fabrics, RDMA over fabrics. Even the leading-edge GPU systems aren't really good at it, because none of the interconnect is actually programmable.

The current generation of TT cards (7nm) has four 800G NIC's per card, and the actual Blackhole chips[1] support up to 12x400G. You can approach TT, they will license you the IP, and you get to integrate it at whatever scale you please (good luck even getting in a room with Arm people!) and because TT's whole stack is open source, you get to "punch in" whatever topology you want[2]. In other words, at least with TT you would get a chance to scale-out without bankrupting your business.

The compute hierarchy is fresh and in line with the latest research, their toolchain is as as hackable as it gets, and stands multiple heads above anything that AMD or Intel had ever released. Most importantly, because TT is currently under-valued, it presents an outstanding opportunity for businesses like ours in navigating around the established cost-centers. For example, TT still offers "Galaxy" deployments which used to contain 32 previous-generation (Wormhole) devices in a 6U air-cooled chassis. It's not a stretch that a similar setup, composed of 32 liquid-cooled Blackholes (2 TB GDDR6, 100 Tbps interconnect) would fit in a 4U chassis. AFAIK, There's no GPU deployment in the world at that density. Similarly to TPU design, it's also infinitely scalable by means of 3+D twisted torus topologies.

What's currently missing in the TT ecosystem: (1) the "superchip" package including state of the art CPU cores, like TT-Ascalon, that they would also happily license to you, and perhaps more importantly, (2) compute-in-network capability, so that the stupidly-massive TT interconnect bandwidth could be exploited/informed by applications.

Firstly, the Grendel superchip is expected to hit the market by the end of next year.

Secondly, because the interconnect is not some proprietary bullshit from Mellanox, you get to introduce the programmable-logic NIC's into the topology, and maybe even avoid IP encapsulation altogether! There are many reasons to do so, and indeed, Versal FPGA's have lots to offer in terms of hard IP in addition to PL. K/V cache management with offloading to NVMe-oF clusters, prefix-matching, reshaping, quantization, compression, and all the other terribly-parallel tasks which are basically intractable for anything other than FPGA's.

Today, if we wanted to do a large-scale training run, we would simply go for the most cost-effective option available at scale, which is renting TPU v6 from Google. This is a temporary measure, if anything, because compute-in-network in AI deployments is still a novelty, and nobody can really do it at sufficiently-large scale yet. Thankfully, Xilinx is getting there[3]. AWS offers f1 instances, it does offer NVMe-accelerated ones, as well as AI acclerators, but there's a good reason they're unable to offer all three at the same time.

[1] https://riscv.epcc.ed.ac.uk/assets/files/hpcasia25/Tenstorre...

[2] https://github.com/tenstorrent/tt-metal/blob/main/tech_repor...

[3] https://www.amd.com/en/products/accelerators/alveo/v80.html

_zoltan_•7mo ago

comparing the MI300 to the A100 is not a fair comparison. they aren't of the equivalent generation.

MI250 to A100 would be which would be very similar.

eapriv•7mo ago

Spoiler: it’s not about how GPUs work, it’s about how to use them for machine learning computations.

oivey•7mo ago

It’s a pretty standard run down of CUDA. Nothing to do with ML other than using relu in an example and mentioning torch.

neuroelectron•7mo ago

ASCII diagrams, really?

LarsDu88•7mo ago

Maybe this should be titled "Basic Facts about Nvidia GPUs" as the WARP terminology is a feature of modern Nvidia GPUs.

Again, I emphasize "modern"

An NVIDIA GPU from circa 2003 is completely different and has baked in circuitry specific to the rendering pipelines used for videogames at that time.

So most of this post is not quite general to all "GPUs" which a much broader category of devices that don't necessarily encompass the type of general purpose computation we use modern Nvidia GPUs for.

_zoltan_•7mo ago

2003 was 22 years ago. those GPUs are either in a landfill or a museum by now

LarsDu88•7mo ago

By limiting your definition of GPU to what currently exists, your sandboxing the possibilities of what a GPU can be.

bjornsing•7mo ago

So how are we doing with whole program optimization on the compiler level? Feels kind of backwards that people are optimizing these LLM architectures, one at a time.

geoffbp•7mo ago

“Arithmetic Intensity (AI)”

Hmm

gdiamos•7mo ago

Wow - the title is "basic facts" - but it should be "key insights"

You wouldn't believe how many PhDs I've met who have no idea what a roofline is.

Agentlien•7mo ago

I wasn't expecting the strong CUDA/ML focus. My own work is primarily in graphics and performance in video games; while this is all familiar and useful it feels like a very different view of the hardware than mine.

saagarjha•7mo ago

> The “Peak Compute” roof of 19.5 TFLOPS is an ideal, achievable only with highly optimized instructions like Tensor Core matrix multiplications and high enough power limits.

As mentioned below, 19.5 TFLOPS is the FP32 compute roofline, which doesn't support Tensor Cores. If you want to use those you need to use FP16 and you can get substantially improved performance.

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

Basic Facts about GPUs

Comments