Touching the Elephant – TPUs

199•giuliomagnifico•2mo ago

Comments

Simplita•2mo ago

This was a nice breakdown. I always feel most TPU articles skip over the practical parts. This one actually connects the concepts in a way that clicks.

Zigurd•2mo ago

The extent to which TPU architecture is built for the purpose also doesn't happen in a single design generation. Ironwood is the seventh generation of TPU, and that matters a lot.

alecco•2mo ago

I'm surprised the perspective of China making TPUs at scale in a couple of years is not bigger news. It could be a deadly blow for Google, NVIDIA, and the rest. Combine it with China's nuclear base and labor pool. And the cherry on top, America will train 600k Chinese students as Trump agreed to.

The TPUv4 and TPUv6 docs were stolen by a Chinese national in 2022/2023: https://www.cyberhaven.com/blog/lessons-learned-from-the-goo... https://www.justice.gov/opa/pr/superseding-indictment-charge...

And that's just 1 guy that got caught. Who knows how many other cases were there.

A Chinese startup is already making clusters of TPUs and has revenue https://www.scmp.com/tech/tech-war/article/3334244/ai-start-...

Workaccount2•2mo ago

Manufacturing is the hard part. China certainly has the knowledge to build a TPU architecture without needing to steal the plans. What they don't have is the ability to actually build the chips. This is even in spite of also stealing lithography plans.

There is a dark art to semiconductor manufacturing that pretty much only TSMC really has the wizards for. Maybe intel and samsung a bit too.

tomrod•2mo ago

Lot of retired fab folks in the Austin area if you needed to spin up a local fab. It's really not a dark art, there are plenty of folks that have experience in the industry.

Workaccount2•2mo ago

This is sort of like saying there are lots of kids in the local community college shop class if you want to spin up an F1 team.

The knowledge of making 2008 era chips is not a gating factor for getting a handful of atoms to function as a transistor in current SOTA chips. There are probably 100 people on earth who know how to do this, and the majority of them are in Taiwan.

Again, China has literally stolen the plans for EUV lithography, years ago, and still cannot get it to work. Even Samsung and Intel, using the same machines as TSMC, cannot match what they are doing.

It's a dark art in the most literal sense.

Nevermind that new these cutting edge fabs cost ~$50 Billion each.

checker659•2mo ago

I've always wondered. If you have fuck you money, wouldn't it be possible to build GPUs to do LLM matmul with 2008 technology. Again, assuming energy costs / cooling costs don't matter.

Zigurd•2mo ago

Energy, cooling, and how much of the building you're taking up do matter. They matter less and in a more manageable way for hyperscalers that have a long established resource management practice in lots of big data centers because they can phase in new technologies as they phase out the old. But it's a lot more daunting to think about building a data center big enough to compete with one full of Blackwell systems there are more than 10 times more performant per watt and per square foot.

pixl97•2mo ago

Building the clean rooms at this scale is a limitation in itself. Just getting the factory setup to and the machines put in so they don't generate particulate matter in operation is an art that compares in difficulty to making the chips themselves.

Workaccount2•2mo ago

IIRC people have gotten LLMs to run on '80s hardware. Inference isn't overly compute heavy.

The killer really is training, which is insanely compute intensive and really only recently hardware practical on the scale needed.

adgjlsfhk1•2mo ago

you could probably train a gpt 2 sized model with sota architecture on a 2008 supercomputer. it would take a while though.

Zigurd•2mo ago

The mask shops at TSMC and Samsung kind of are a dark art. It's one of the interesting things about the contract manufacturing business in chips. It's not just a matter of having access to state of the art equipment.

aunty_helen•2mo ago

For China there is no plan B for semiconductor manufacturing. Invading Taiwan would be a dice roll and the consequences would be severe. They will create their own SOTA semiconductor industry. Same goes for their military.

The question is when? Does that come in time to deflate the US tech stock bubble? Or will the bubble start to level out and reality catch up, or will the market crash for another reason beforehand?

snek_case•2mo ago

China has their own fabs. They are behind TSMC in terms of technology, but that doesn't mean they don't have fabs. They're currently ~7nm AFAIK. That's behind TSMC, but also not useless. They are obviously trying hard to catch up. I don't think we should just imagine that they never will. China has a lot of smart engineers and they know how strategically important chip manufacturing is.

This is like this funny idea people had in the early 2000s that China would continue to manufacture most US technology but they could never design their own competitive tech. Why would anyone think that?

Wrt invading Taiwan, I don't think there is any way China can get TSMC intact. If they do invade Taiwan (please God no), it would be a horrible bloodbath. Deaths in the hundreds of thousands and probably relentless bombing. Taiwan would likely destroy its own fabs to avoid them being taken. It would be sad and horrible.

renewiltord•2mo ago

If they invade Taiwan, we will scuttle the plants and direct ASML to disable their machines which they will do because that’s the condition under which we gave them the tech. They’re not going to get it this way.

They’ll just catch the next wave of tech or eventually break into EUV.

adgjlsfhk1•2mo ago

imo the most likely answer is that asml funds a second source for the optics that isn't US controlled and starts shipping to China. The US is losing influence fast.

renewiltord•2mo ago

We’re not above stuxnetting them if it comes to it. They operate at the pleasure of the US with US tech.

jandrewrogers•2mo ago

It would likely take ASML decades to develop an alternative EUV light source not encumbered by US defense technology, at which time it may not matter.

Everyone is still dependent on a single American manufacturer for this tech after decades of development. This strongly suggests that it is considerably more difficult than just "funding a second source".

mr_toad•2mo ago

> Wrt invading Taiwan, I don't think there is any way China can get TSMC intact.

There are so many trade and manufacturing links between China and Taiwan that an outright war would be economically disastrous for both countries.

dpe82•2mo ago

That doesn't mean they won't try anyway; political ideology often trumps rational planning.

overfeed•2mo ago

> Why would anyone think that?

That'd be the belief in good old American exceptionalism. Up until recently, a common meme on HN was "freedom" is fundamental to innovation, and naturally the country with the most Freedom(TM) wins. This even persisted after it was clear that DJI was kicking all kinds of ass, outcompeting multiple western drone companies.

snek_case•2mo ago

It's probably true that free enterprise helps a lot, but China has that in large part. Even though the CCP calls itself communist, China is very capitalist in a number of ways. But I guess China is showing us that capitalism can exist without democracy.

overfeed•2mo ago

> capitalism can exist without democracy.

It's not exactly a new idea. This was the CIA's operating principle in the western hemisphere since before the cold war.

radialstub•2mo ago

The software is the hard part. Western software still outclasses what the chinese produce by a good amount.

PunchyHamster•2mo ago

This. The amount of investment into CUDA is high enough most companies won't even consider competition, even if it was lower cost.

We desperately need more open frameworks for competition to work

mr_toad•2mo ago

> What they don't have is the ability to actually build the chips.

China has fabs. Most are older nodes and are used to manufacture chips used in cars and consumer electronics. They have companies that design chips (manufactured by TSMC), like the Ascend 910, which are purpose built for AI. They may be behind, but they’re not standing still.

fullofideas•2mo ago

>Combine it with China's nuclear base and labor pool. And the cherry on top, America will train 600k Chinese students as Trump agreed to.

I dont understand this part. What has nuclear base got to do with chip manufacturing? And surely, not all 600k students are learning chip design or stealing plans

pixl97•2mo ago

Nuclear power is what they are talking about, not weapons.

alecco•2mo ago

I mean they have the power grid to run TPUs at 10x the scale of USA.

About students, have you seen the microelectronic labs in American universities lately? A huge chunk are Chinese already. Same with some of the top AI labs.

dylanowen•2mo ago

I assume the nuclear reactors are to power the data centers using the new chips. There have been a few mentions on HN about the US being very behind in building enough power plants to run LLM workloads

renewiltord•2mo ago

We should ask ourselves: is it worth ruining local communities in order to beat China in the global sphere?

pstuart•2mo ago

That question was asked and answered years ago and the answer is YES (not me personally, but the people in charge)

There are things about China not to be celebrated but one cannot help but admire the way that they invest in their country as a whole. The US is all about "what's in it for me".

renewiltord•2mo ago

Fortunately, we have environmentalists who can protect us from a future of towering nuclear plants and wind turbines with hills covered in solar panels.

Is all that construction really worth it when we could be protecting neighborhoods and historic views?

pstuart•2mo ago

That's absolutely a fair dig but it's far more complex than that. Our whole manufacturing base being outsourced is on the corporations who chose that "cost-cutting" path.

And it's not an entirely binary choice on protecting neighborhoods and views; for example what's happening in south Memphis with the power plant that's powering the Grok center there is a classic case of environmental racism -- they are cutting costs on pollution regulation because they have a community that they can dump the externalized costs on via their emissions.

Nobody's saying Grok shouldn't have the power, it's just a small detail on how that impact is managed.

renewiltord•2mo ago

I don’t think anyone is convinced that if the small detail were managed that there wouldn’t be another God of the Gaps small detail.

mr_toad•2mo ago

The frenetic pace of data center construction in the US means that nuclear is not a short-term option. No way are they going to wait a decade or more for generation to come on line. It’s going to be solar, batteries, and gas (turbines, and possibly fuel cells).

tormeh•2mo ago

Thankfully LLMs are a dead end, so nobody will make it to AGI by just throwing more electricity at the problem. Now if we could only have a new AI winter we could postpone the end of mankind as the dominant species on earth by another couple of decades.

Spooky23•2mo ago

The current narrative is that we are out of power so we must shut down power projects that are not politically prioritized, and build nuclear and coal capacity, which is.

llm_nerd•2mo ago

>It could be a deadly blow for Google, NVIDIA, and the rest.

How would this be a deadly blow to Google? Google makes TPUs for their own services and products, avoiding paying the expensive nvidia tax. If other people make similar products, this has effectively zero impact on Google.

nvidia knew their days were numbered, at least in their ownership of the whole market. And China hardly had to steal the great plans for a TPU to make one, and a FMA/MAC unit is actually a surprisingly simple bit of hardware to design. Everyone is adding "TPUs" in their chips - Apple, Qualcomm, Google, AMD, Amazon, Huawei, nvidia (that's what tensor cores are) and everyone else.

And that startup isn't the big secret. Huawei already has solutions matching the H20. Once the specific need that can be serviced by an ASIC is clear, everyone starts building it.

>America will train 600k Chinese students as Trump agreed to

What great advantage do you think this is?

America isn't remotely the great gatekeeper on this. If anything, Taiwan + the Netherlands (ASML) are. China would yield infinitely more value in learning manufacturing and fabrication secrets than cloning some specific ASIC.

lukasb•2mo ago

Yeah I'm terrified that TPUs will get cheaper, that would be awful.

pests•2mo ago

Half the article was about the extensive software codependency between TPU's, Borb, lilpunet, their optical switching network, etc. How much of that is manufacturing and not just software and engineering experience, which won't be so easy to copy?

desideratum•2mo ago

The Scaling ML textbook also has an excellent section on TPUs. https://jax-ml.github.io/scaling-book/tpus/

jauntywundrkind•2mo ago

I also enjoyed https://henryhmko.github.io/posts/tpu/tpu.html https://news.ycombinator.com/item?id=44342977 .

The work that XLA & schedulers are doing here is wildly impressive.

This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.

I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla

desideratum•2mo ago

Thanks for sharing this. I agree w.r.t. XLA. I've been moving to JAX after many years of using torch and XLA is kind of magic. I think torch.compile has quite a lot of catching up to do.

> XLA isn't at present particularly useful at scheduling across machines,

I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...

alevskaya•2mo ago

I do think it's a lot simpler than the problem Itanium was trying to solve. Neural nets are just way more regular in nature, even with block sparsity, compared to generic consumer pointer-hopping code. I wouldn't call it "easy", but we've found that writing performant NN kernels for a VLIW architecture chip is in practice a lot more straightforward than other architectures.

JAX/XLA does offer some really nice tools for doing automated sharding of models across devices, but for really large performance-optimized models we often handle the comms stuff manually, similar in spirit to MPI.

jauntywundrkind•2mo ago

I agree with regards to the actual work being done by the systolic arrays, which sort of are VLIW-ish & have a predictable plannable workflow for them. Not easy, but there's a very direct path to actually executing these NN kernels. The article does an excellent job setting up how great at win it is that the systolic MXU's can do the work, don't need anything but local registers and local communication across cells, don't need much control.

But if you make it 2900 words through this 9000 word document, to the "Sample VLIW Instructions" and "Simplified TPU Instruction Overlay" diagrams, trying to map the VLIW slots ("They contain slots for 2 scalar, 4 vector, 2 matrix, 1 miscellaneous, and 6 immediate instructions") to useful work one can do seems incredibly incredible challenging. Given the vast disparity of functionality and style of the attached units that that governs, and given the extreme complexity in keeping that MXU constantly fed, keeping very tight timing so that it is constantly well utilized.

> Subsystems operate with different latencies: scalar arithmetic might take single digit cycles, vector arithmetic 10s, and matrix multiplies 100s. DMAs, VMEM loads/stores, FIFO buffer fill/drain, etc. all must be coordinated with precise timing.

Where-as Itanium's compilers needed to pack parallel work into a single instruction, there's maybe less need for that here. But that quote there feels like an incredible heart of the machine challenge, to write instruction bundles that are going to feed a variety of systems all at once, when these systems have such drastically different performance profiles / pipeline depths. Truly an awe-some system, IMO.

Still though, yes: Itanium's software teams did have an incredibly hard challenge finding enough work at compile time to pack into instructions. Maybe it was a harder task. What a marvel modern cores are, having almost a dozen execution units that cpu control can juggle and keep utilized, analyzing incoming instructions on the fly, with deep out-of-order depenency-tracking insight. Trying to figure it all out ahead of time & packing it into the instructions apriori was a wildly hard task.

cpgxiii•2mo ago

In Itanium's heyday, the compilers and libraries were pretty good at handling HPC workloads, which is really the closest anyone was running then to modern NN training/inference. The problem with Itanium and its compilers was that people obviously wanted to run workloads that looked nothing like HPC (databases, web servers, etc) and the architecture and compilers weren't very good at that. There have always been very successful VLIW-style architectures in more specialized domains (graphics, HPC, DSP, now NPU) it just hasn't worked out well for general-purpose processors.

jauntywundrkind•2mo ago

Side note, just ran into this article that mentions how Amazon is planning to have XLA / JAX support in the future for their Trainium's. https://newsletter.semianalysis.com/p/aws-trainium3-deep-div...

ddtaylor•2mo ago

Are TPUs still stuck to their weird Google bucket thing when using GCP? I hated that.

randomtoast•2mo ago

I don't think Moore's Law is dead.

Starting point: In 1965, the most advanced chips contained roughly 50 to 100 transistors (e.g., early integrated logic).

Lets take 1965 -> 2025, which is 60 years.

Number of doubling intervals: 60 years / 2 years per doubling = 30 doublings

So the theoretical prediction is:

Transistors in 2025 (predicted) = 100 × 2^30 ≈ 107 billion transistors

The Apple M1 Ultra has 114 billion transistors.

KolenCh•2mo ago

Some people take Moore’s law in a strong sense: doubling rate is a constant. That is long dead.

But if we relax it to be a slowly varying constant, then it is not dead. That constant has been changed (by consensus) for a few times already.

Your mistake is to (1) take that constant literally (ie using the strong law) and (2) uses the boundary points to find the “average” effect. The latter is a really flawed argument as it cannot prove it hasn’t been dead (a recent effect) because you haven’t considered it’s change over time.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

A century of hair samples proves leaded gas ban worked

Dark Alley Mathematics

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Microsoft open-sources LiteBox, a security-focused library OS

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Delimited Continuations vs. Lwt for Threads

How to effectively write quality code with AI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Introducing the Developer Knowledge API and MCP Server

FORTH? Really!?

I'm going to cure my girlfriend's brain tumor

Evaluating and mitigating the growing risk of LLM-discovered 0-days

Show HN: Smooth CLI – Token-efficient browser for AI agents

How virtual textures work

Show HN: Slack CLI for Agents

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Evolution of car door handles over the decades

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

A century of hair samples proves leaded gas ban worked

Dark Alley Mathematics

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Microsoft open-sources LiteBox, a security-focused library OS

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

PC Floppy Copy Protection: Vault Prolok

An Update on Heroku

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Delimited Continuations vs. Lwt for Threads

How to effectively write quality code with AI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Learning from context is harder than we thought

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Introducing the Developer Knowledge API and MCP Server

FORTH? Really!?

I'm going to cure my girlfriend's brain tumor

Evaluating and mitigating the growing risk of LLM-discovered 0-days

Show HN: Smooth CLI – Token-efficient browser for AI agents

How virtual textures work

Show HN: Slack CLI for Agents

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Evolution of car door handles over the decades

Touching the Elephant – TPUs

Comments