TPU Deep Dive

https://henryhmko.github.io/posts/tpu/tpu.html

451•transpute•7mo ago

Comments

almostgotcaught•7mo ago

> In essence, caches allow hardware to be flexible and adapt to a wide range of applications. This is a large reason why GPUs are very flexible hardware (note: compared to TPUs).

this is correct but mis-stated - it's not the caches themselves that cost energy but MMUs that automatically load/fetch/store to cache on "page faults". TPUs don't have MMUs and furthermore are a push architecture (as opposed to pull).

RossBencina•7mo ago

Can you suggest a good reference for understanding which algorithms map well onto the regular grid systolic arrays used by TPUs? The fine article says dese matmul and convolution are good, but is there anything else? Eigendecomposition? SVD? matrix exponential? Solving Ax = b or AX = B? Cholesky?

WithinReason•7mo ago

Anything that you can express as 128x128 (but ideally much larger) dense matrix multiplication and nothing else

musebox35•7mo ago

I think https://jax-ml.github.io/scaling-book/ is one of the best references to go through. It details how single device and distributed computations map to TPU hardware features. The emphasis is on mapping the transformer computations, both forwards and backwards, so requires some familiarity with how transformer networks are structured.

cdavid•7mo ago

SVD/eigendecomposition will often boil down to making many matmul (e.g. when using Krylov-based methods, e.g. Arnoldi, Krylov-schur, etc.), so I would expect TPU to work well there. GMRES, one method to solve Ax = b is also based on Arnoldi decomp.

Straw•7mo ago

You can do all of these in terms of matmul to some extent:

Solving AX=B can be done with Newton's method to invert A, which boils down to matmuls.

Matrix exponential is normally done with matmuls- the scale down, Taylor/Pade and square approach.

Why do you need Cholesky? It's typically a means to an end, and when matmul is your primitive, you reach for it much less often.

Eigendecomposition is hard. If we limit ourselves to symmetric, we could use a blocked Jacobi algorithm where we run a non-matmul Jacobi to do 128x128 off-diagonal blocks and then use the matmul unit to apply to the whole matrix- for large enough matrices, still bottlenecked on matmul.

SVD we can get from Polar decomposition, which has purely-matmul iterations, and symmetric eigendecomposition.

One does have to watch out for numerical stability and precision very carefully when doing all these!

RossBencina•7mo ago

Cholesky for generating so-called sigma points in the Unscented Transformation.

serf•7mo ago

does that cooling channel have a NEMA stepper on it as a pump or metering valve?[0]

If so, wild. That seems like overkill.

[0]: https://henryhmko.github.io/posts/tpu/images/tpu_tray.png

fellowmartian•7mo ago

definitely closed-loop, might even be a servo

frays•7mo ago

How can someone have this level of knowledge about TPUs without working at Google?

musebox35•7mo ago

From the acknowledgment at the end, I guess the author has access to TPUs through https://sites.research.google/trc/about/

This is not the only way though. TPUs are available to companies operating on GCP as an alternative to GPUs with a different price/performance point. That is another way to get hands-on experience with TPUs.

erwincoumans•7mo ago

A quick free way to access TPUs is through https://colab.research.google.com, Runtime / Change Runtime Type / v2-8 TPU

ipsum2•7mo ago

Everything thats in the blog post is basically well known already. Google publishes papers and gives talks about their TPUs. Many details are lacking though, and require some assumptions/best guesses. Jax and XLA are (partially) open source and give clues about how TPUs work under the hood as well.

https://arxiv.org/abs/2304.01433

https://jax-ml.github.io/scaling-book/

antognini•7mo ago

There's a pretty detailed description of TPUs in the latest edition of Hennessy's Computer Architecture. (Hennessy was also involved with the design of the early TPU architectures if I recall right.)

ariwilson•7mo ago

Cool article!

sgt101•7mo ago

ELI5: how (specifically) do GPU and TPU optimisations effect determinism in LLMs? Or is this all a myth?

barrkel•7mo ago

LLMs are generally deterministic. The token sampling step is usually randomized to some degree because it gets better results (creativity) and helps avoid loops, but you can turn that off (temp zero for simple samplers).

perching_aix•7mo ago

+ can also just pin the seed instead, right?

sgeisenh•7mo ago

This is an oversimplification. When distributed, the nondeterministic order of additions during reductions can produce nondeterministic results due to floating point error.

It’s nitpicking for sure, but it causes real challenges for reproducibility, especially during model training.

Der_Einzige•7mo ago

This belief (LLMs are deterministic except for samplers) is very wrong and will get you into hilariously large amounts of trouble for assuming it's true.

Also greedy sampling considered harmful: https://arxiv.org/abs/2506.09501

From the abstract:

"For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices."

sgt101•7mo ago

Great reference - thanks.

marcinzm•7mo ago

Does this apply to TPUs or just GPUs?

recursivecaveat•7mo ago

It's more a system level property. Even if you used CPUs, if you're not careful in your design to control how results are distributed and combined, you will get variance.

jpgvm•7mo ago

They don't affect determinism of the results but different architectures have different determinism guarantees with respect to performance, as a result of scheduling and other things.

TPUs share a similar lineage to the Groq TPU accelerators (disclaimer: I work at Groq) which are actually fully deterministic which means not only do you get deterministic output, you get it in a deterministic number of cycles.

There is a trade off though, making the hardware deterministic means you give up HW level scheduling and other sources of non-determinism. This makes the architecture highly dependent on a "sufficiently smart compiler". TPUs and processors like them are generally considered VLIW and are all similarly dependent on the compiler doing all the smart scheduling decisions upfront to ensure good compute/IO overlap and eliminating pipeline bubbles etc.

GPUs on the other hand have very sophisticated scheduling systems on the chips themselves along with stuff like kernel swapping etc that make them much more flexible, less dependent on the compiler and generally easier to reach a fairly high utilisation of the processor without too much work.

TLDR: TPUs MAY have deterministic cycle guarantees. GPUs (of the current generation/architectures) cannot because they use non-deterministic scheduling and memory access patterns. Both still produce deterministic output for deterministic programs.

sgt101•7mo ago

This is gold dust. Thank you for taking the time to share your knowledge.

lanthissa•7mo ago

can someone help me understand how the following can be true:

1. TPU's are a serious competitor to nvidia chips.

2. Chip makers with the best chips are valued at 1-3.5T.

3. Google's market cap is 2T.

4. It is correct for google to not sell TPU's.

i have heard the whole, its better to rent them thing, but if they're actually good selling them is almost as good a business as every other part of the company.

smokel•7mo ago

Selling them and supporting that in the field requires quite some infrastructure you'd have to build. Why go through all that trouble if you already make higher margins renting them out?

Also, if they are so good, it's best to not level the playing field by sharing that with your competitors.

Also "chip makers with the best chips" == Nvidia, there aren't many others. And Alphabet does more than just produce TPUs.

matt-p•7mo ago

Does Google cloud offer them on a "aws outpost" style model? I think that plus cloud access is probably the easiest and ' best ' way to offer them. Last thing you need to be dealing with is super micro, gigabyte etc building a box for them and so on - I can definitely understand not selling the raw chip.

dismalaf•7mo ago

Nvidia is selling a ton of chips on hype.

Google is saving a ton of money by making TPUs, which will pay off in the future when AI is better monetized, but so far no one is directly making a massive profit from foundation models. It's a long term play.

Also, I'd argue Nvidia is massively overvalued.

CalChris•7mo ago

Common in gold rushes but then they are selling chips. Are they overvalued? Maybe. Are they profitable (something WeWork and Uber aren't) ? Yes, quite.

rwmj•7mo ago

Aren't Google's TPUs a bit like a research project with practical applications as a nice side effect?

hackernudes•7mo ago

Why do you say that? They are on their seventh iteration of hardware and even from the beginning (according to the article) they were designed to serve Google AI needs.

My take is "sell access to TPUs on Google cloud" is the nice side effect.

surajrmal•7mo ago

On what basis do you make that claim? It's incredibly misleading and wrong.

silentsea90•7mo ago

All of Google ML runs on TPUs tied to $ billions in revenue. You make it sound like TPUs are a Google X startup that's going to get killed tomorrow.

lmm•7mo ago

What revenue is that? Hardly anyone's paying for Google's AI directly, and it doesn't seem to have dramatically changed their ad business.

happyopossum•7mo ago

GP said ML, not Ai - Google’s entire search and ad businesses are run on ML, as are many of the moonshots of alphabet (deep mind, Waymo, etc).

rossjudson•7mo ago

Like all 5 of your AI friends aren't paying for Google AI? ;)

wkat4242•7mo ago

I think you're thinking of the coral ones. The only ones they've sold directly to the public.

mft_•7mo ago

If they think they’ve got a competitive advantage vs. GPUs which benefits one of their core products, it would make sense to retain that competitive advantage for the long term, no?

Uehreka•7mo ago

No. If they sell the TPUs for “what they’re worth”, they get to reap a portion of the benefit their competitors would get from them. There’s money they could be making that they aren’t.

Or rather, there would be if TPUs were that good in practice. From the other comments it sounds like TPUs are difficult to use for a lot of workloads, which probably leads to the real explanation: No one wants to use them as much as Google does, so selling them for a premium price as I mentioned above won’t get them many buyers.

radialstub•7mo ago

I believe Broadcom is also very involved in the making of the TPU's and networking infrastructure and they are valued at 1.2T currently. Maybe consider the combined value of Broadcom and Google.

lftl•7mo ago

Wouldn't you also need to add TSMC to Nvidia's side in that case?

radialstub•7mo ago

Broadcom is fabless. I think they aid in hardware design, while google mostly does the software stack. Nvidia does both hardware and software stack.

santaboom•7mo ago

Not sure what you mean. Who do you think fabs broadcomm and google chips

lftl•7mo ago

Ah, I didn't realize broadcomm was fabless and only helping in design.

Velorivox•7mo ago

Wall street undervalued Google even on day one (IPO). Bezos has said that some of the times the stock had been doing the worst were when the company was doing great.

So, to help you understand how they can be true: market cap is governed by something other than what a business is worth.

As an aside, here's a fun article that embarrasses wall street. [0]

[0] https://www.nbcnews.com/id/wbna15536386

mrlongroots•7mo ago

Hah thanks for sharing this, fun read!

YZF•7mo ago

I remember sitting around the lunch table in a tech company when Google IPO'd and none of us understood the IPO valuation. I didn't buy any stocks. I also didn't get "cloud" either. Sometimes new business is essentially created out of thin air. Google and Amazon's valuation did not increase only due to their efforts, it also increased because the broader market shifted.

I guess that means don't take investment advice from me ;) I've done OK buying indices though.

ethbr1•7mo ago

The fact that Wall Street sometimes misses revolutions and misvalues stocks does not mean that all perceived misvalued stocks are revolutionary.

Plenty of companies have screwed up execution, and the market has correctly noticed and penalized them for that.

ghoshbinayak•7mo ago

Reading through this news article is hilarious.

P.S. I did not have access to internet in 2006, so I guess the skepticism was normal at the time.

jeffbee•7mo ago

Like other Google internal technologies, the amount of custom junk you'd need to support to use a TPU is pretty extreme, and the utility of the thing without the custom junk is questionable. You might as well ask why they aren't marketing their video compression cards.

michaelt•7mo ago

nvidia, who make AI chips with kinda good software support, and who have sales reflecting that, is worth 3.5T

google, who make AI chips with barely-adequate software, is worth 2.0T

AMD, who also make AI chips with barely-adequate software, is worth 0.2T

Google made a few decisions with TPUs that might have made business sense at the time, but with hindsight haven't helped adoption. They closely bound TPUs with their 'TensorFlow 1' framework (which was kinda hard to use) then they released 'TensorFlow 2' which was incompatible enough it was just as easy to switch to PyTorch, which has TPU support in theory but not in practice.

They also decided TPUs would be Google Cloud only. Might make sense, if they need water cooling or they have special power requirements. But it turns out the sort of big corporations that have multi-cloud setups and a workload where a 1.5x improvement in performance-per-dollar is worth pursuing aren't big open source contributors. And understandably, the academics and enthusiasts who are giving their time away for free aren't eager to pay Google for the privilege.

Perhaps Google's market cap already reflects the value of being a second-place AI chipmaker?

que-encrypt•7mo ago

jax is very much a working (and in my view better, aside from the lack of community) software support. Especially if you use their images (which they do). > > Tensorflow They have been using jax/flax/etc rather than tensorflow for a while now. They don't really use pytorch from what I see on the outside from their research works. For instance, they released siglip/siglip2 with flax linen: https://github.com/google-research/big_vision

TPUs very much have software support, hence why SSI etc use TPUs.

P.S. Google gives their tpus for free at: https://sites.research.google/trc/about/, which I've used for the past 6 months now

throwaway314155•7mo ago

> They have been using jax/flax/etc rather than tensorflow for a while now

Jax has a harsher learning curve than Pytorch in my experience. Perhaps it's worth it (yay FP!) but it doesn't help adoption.

> They don't really use pytorch from what I see on the outside from their research works

Of course not, there is no outside world at Google - if internal tooling exists for a problem their culture effectively mandates using that before anything else, no matter the difference in quality. This basically explains the whole TF1/TF2 debacle which understandably left a poor taste in people's mouths. In any case while they don't use Pytorch, the rest of us very much do.

> P.S. Google gives their tpus for free at: https://sites.research.google/trc/about/, which I've used for the past 6 months now

Right and in order to use it effectively you basically have to use Jax. Most researchers don't have the advantage of free compute so they are effectively trying to buy mindshare rather than winning on quality. This is fine, but it's worth repeating as it biases the discussion heavily - many proponents of Jax just so happen to be on TRC or have been given credits for TPU's via some other mechanism.

throwaway314155•7mo ago

Also - getting access to a TPU on GCP (particularly when you don't have a <fancy_school>.edu email address) has historically been a _fucking nightmare_. Absolute shit show.

que-encrypt•7mo ago

I am a high schooler, and easily got a tpuv4-64. No fancy school or edu email address, just a dream of winning geoguessr. They are very receptive to emails, I asked for more and they got more for me.

throwaway314155•7mo ago

I did say historically. Maybe they finally improved things. It left me and others with a poor first impression however

roughly•7mo ago

Aside from the specifics of Nvidia vs Google, one thing to note regarding company valuations is that not all parts of the company are necessarily additive. As an example (read: a thing I’m making up), consider something like Netflix vs Blockbuster back in the early days - once Blockbuster started to also ship DVDs, you’d think it’d obviously be worth more than Netflix, because they’ve got the entire retail operation as well, but that presumes the retail operation is actually a long-term asset. If Blockbuster has a bunch of financial obligations relating to the retail business (leases, long-term agreements with shippers and suppliers, etc), it can very quickly wind up that the retail business is a substantial drag on Blockbuster’s valuation, as opposed to something that makes it more valuable.

santaboom•7mo ago

Good questions, below I attempt to respond to each point then wrap it up. TLDR: even if TPU is good (and it is good for Google) it wouldn’t be “almost as good a business as every other part of their company” because the value add isn’t FROM Google in the form of a good chip design(TPU). Instead the value add is TO Google in form of specific compute (ergo) that is cheap and fast FROM relatively simple ASICs(TPU chip) stitched together into massively complex systems (TPU super pods).

If interesting in further details:

1) TPUs are a serious competitor to Nvidia chips for Google’s needs, per the article they are not nearly as flexible as a GPU (dependence on precompiled workloads, high usage of PEs in systolic array). Thus for broad ML market usage, they may not be competitive with Nvidia gpu/rack/clusters.

2)chip makers with the best chips are not valued at 1-3.5T, per other comments to OC only Nvidia and Broadcomm are worth this much. These are not just “chip makers”, they are (the best) “system makers” driving designs for chips and interconnect required to go from a diced piece of silicon to a data center consuming MWs. This part is much harder, this is why Google (who design TPU) still has to work with Broadcomm to integrate their solution. Indeed every hyperscalar is designing chips and software for their needs, but every hyperscalar works with companies like Broadcomm or Marvell to actually create a complete competitive system. Side note, Marvell has deals with Amazon, Microsoft and Meta to mostly design these systems they are worth “only” 66B. So, you can’t just design chips to be valuable, you have to design systems. The complete systems have to be the best, wanted by everyone (Nvidia, Broadcomm) in order to be in Ts, otherwise you’re in Bs(Marvell).

4. I see two problems with selling TPU, customers and margins. If you want to sell someone a product, it needs to match their use, currently the use only matches Google’s needs so who are the customers? Maybe you want to capture hyperscalars / big AI labs, their use case is likely similar to google. If so, margins would have to be thin, otherwise they just work directly with Broadcomm/Marvell(and they all do). If Google wants everyone using cuda /Nvidia as a customer then you massively change the purpose of TPU and even Google.

To wrap up, even if TPU is good (and it is good for Google) it wouldn’t be “almost as good a business as every other part of their company” because the value add isn’t FROM Google in the form of a good chip design(TPU). Instead the value add is TO Google in form of specific compute (ergo) that is cheap and fast FROM relatively simple ASICs(TPU chip) stitched together into massively complex systems (TPU super pods).

Sorry that got a bit long winded, hope it’s helpful!

throwaway31131•7mo ago

This also all assumes that there is excess foundry capacity in the world for Google to expand into, which is not obvious. One would need exceptionally good operations to compete here and that has never been Google's forte.

https://www.tomshardware.com/tech-industry/artificial-intell...

"Nvidia to consume 77% of wafers used for AI processors in 2025: Report...AWS, AMD, and Google lose wafer share."

matt-p•7mo ago

AMD and even people like Huawei also make somewhat acceptable chips but using them is a bit of a nightmare. Is it a similar thing here? Using TPUs is more difficult, only exists inside Google cloud etc

epolanski•7mo ago

> can someone help me understand how the following can be true

You're conflating price with intrinsic value with market analysis. All different things.

foota•7mo ago

5. The efficient market hypothesis is true :-)

Workaccount2•7mo ago

Ironically, despite Google ultimately being an advertising company, it is the absolute worst company at advertising itself.

cdg007•7mo ago

What will competitors say?

b0a04gl•7mo ago

tpu's predictable latency under scale. when you control the compiler, the runtime, the interconnect and the chip, you eliminate so much variance that you can actually schedule jobs efficiently at data center scale. so the obvious question why haven't we seen anyone outside Google replicate this full vertical stack yet? is it because the hardware's hard or because no one has nailed the compiler/runtime contract at that scale?

kevindamm•7mo ago

you mean other than NVidia and AMD?

transpute•7mo ago

Groq, https://en.wikipedia.org/wiki/Groq & https://news.ycombinator.com/item?id=44345738

Neywiny•7mo ago

What's not mentioned is a comparison vs FPGAs. You can have a systolic, fully pipelined system for any data processing not just vectorized SIMD. Every primitive is able to work independently of everything else. For example, if you have 240 DSP slices (which is far from outrageous on low scale), a perfect design could use those as 240 cores at 100% throughput. No memory, caching, decoding, etc overhead.

adrian_b•7mo ago

True, but FPGAs are suitable only for things that will not be produced in great numbers, because their cost and power consumption are many times higher than those of an ASIC.

For a company of the size of Google, the development costs for a custom TPU are quickly recovered.

Comparing a Google TPU with an FPGA is like comparing an injection-moulded part with a 3D-printed part.

Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes, which are much less efficient than the state-of-the-art CMOS manufacturing processes.

Neywiny•7mo ago

When you can ASIC, yes, do an ASIC. But my point was that there was a lot of GPU comparison. GPUs are also not ASICs relative to AI.

QuadmasterXLII•7mo ago

They’re close, they’re basically matmul asics

Neywiny•7mo ago

Arguably so are the DSP heavy FPGAs. And the unused logic will have a minimal static power draw relative to the unused but clocked G-only parts of the GPU.

daxfohl•7mo ago

I have to imagine google considered this and decided against it. I assume it's that all the high-perf matmul stuff needs to be ASIC'd out to get max performance, quality heat dissipation, etc. And for anything reconfigurable, a CPU-based controller or logic chip is sufficient and easier to maintain.

FPGA's kind of sit in this very niche middle ground. Yes you can optimize your logic so that the FPGA does exactly the thing that your use case needs, so your hardware maps more precisely to your use case than a generic TPU or GPU would. But what you gain in logic efficiency, you'll lose several times over in raw throughput to a generic TPU or GPU, at least for AI stuff which is almost all matrix math.

Plus, getting that efficiency isn't easy; FPGAs have a higher learning curve and a way slower dev cycle than writing TPu or GPU apps, and take much longer to compile and test than CUDA code, especially when they get dense and you have to start working around gate timing constraints and such. It's easy to get to a point where even a tiny change can exceed some timing constraint and you've got to rewrite a whole subsystem to get it to synthesize again.

c-c-c-c-c•7mo ago

fpga's are not expensive when ordered in bulk, the volume prices you see on mouser are way higher than the going rates.

monocasa•7mo ago

The actual cost of the part (within reason) doesn't matter all that much for a hyperscaler. The real cost is in the perf/watt, which an FPGA is around an order of magnitude worse for the same RTL.

cpgxiii•7mo ago

> True, but FPGAs are suitable only for things that will not be produced in great numbers, because their cost and power consumption are many times higher than those of an ASIC.

While common folk wisdom, this really isn't true. A surprising number of products ship with FPGAs inside, including ones designed to be "cheap". A great example of this is that Blackmagic, a brand known for being a "cheap" option in cinema/video gear, bases everything on Xilinx/AMD FPGAs (for some "software heavy" products they use the Xilinx/AMD Zynq line, which combines hard ARM cores with an FPGA). Pretty much every single machine vision camera on the market uses an FPGA for image processing as well. These aren't "one in every pocket" level products, but they are widely produced.

> Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes

This isn't true either. At the high end, FPGAs are made on whatever the best process available is. Particularly in the top-end models that combine programmable fabric with hard elements, it would be insane not to produce them on the best process available. What is the big hindrance with FPGAs is that almost by definition the cell structures needed to produce programability are inherently more complex and less efficient than the dedicated circuits of an ASIC. That often means a big hit to maximum clock rate, with resulting consequences to any serial computation being performed.

santaboom•7mo ago

All very informative, I had some quibbles.

While it is true that cheap and expensive FPGAs exist, an FPGA system to replace TPU would not use a $0.50 or even $100 FPGA it would use a Versal or Ultrascale+ FPGA that costs thousands, compared to the (rough guess) $100/die you might spend for largest chip on most advanced process. Furthermore, overhead of FPGA means every single one my support a few million logic gates (maybe 2-5x if you use hardened blocks), compare to billions of transistors on largest chips in most advanced node —> cost per chip to buy is much much higher.

To the second point, afaik, leading edge Versal FPGAs are in 7nm, not ancient also not cutting edge used for asic(n3).

adrian_b•7mo ago

By high end I assume that you mean something like some of the AMD Versal series, which are made on a TSMC 7 nm process, like the AMD Zen 2 CPUs from 6 years ago.

While TSMC 7 nm is much better than what most FPGAs use, it is still ancient in comparison with what the current CPUs and GPUs use.

Moreover, I have never seen such FPGAs sold for less than thousands of $, i.e. they are much more expensive than GPUs of similar throughput.

Perhaps they are heavily discounted for big companies, but those are also the companies which could afford to design an ASIC with better energy efficiency.

I always prefer to implement an FPGA solution over any alternatives, but unfortunately much too often the FPGAs with high enough performance have also high enough prices to make them unaffordable.

The FPGAs with the highest performance that are still reasonably priced are the AMD UltraScale+ families, made with a TSMC 14 nm process, which is still good in comparison with most FPGAs, but nevertheless it is a decade old manufacturing process.

cheptsov•7mo ago

It’s so ridiculous to see TPUs being compared to NVIDIA GPUs. IMO proprietary chips such as TPU had no future sure to the monopoly on the cloud services. There is no competition across the cloud services providers. The only way to access TPUs is through GCP. As the result nobody wants to use them regardless of the technology. This is the biggest fault of GCP. Further the road, the gap between NVIDIA GPUs and Google TPUs (call it „moat“ or CUDA) is going to grow.

The opposite situation is with AMD which are avoiding the mistakes of Google.

My hope though is that AMD doesn’t start to compete with cloud service providers, e.g. by introducing their own cloud.

hiddencost•7mo ago

TPUs will thrive regardless of public adoption; Google's internal demand for TPU is such that they could buy every TPU ever produced.

roughly•7mo ago

One thing worth note here - TPUs are optimized for a fairly constrained set of operations. Google’s had good success with them, but, like many of the other Google architectural choices, this will constrain Google’s technical choice space in the future - if they’ve gone all in on TPUs, future Google machine learning projects will be using the sets of operations the TPUs excel at because that’s what Google has a lot of, not necessarily because that’s the optimal choice. This will have knock-on effects across the industry due to Google’s significant influence on industry practice and technical direction.

hustwindmaple1•7mo ago

Every major Cloud vendor is trying to develop their custom AI ASIC. Putting Google aside, Amazon has trainium/inferentia, which Anthropic uses quite extensively. Microsoft is doing sth. similar, although they are quite behind. OpenAI is doing it. Meta is doing it. That's why the stock price of Broadcom/Marvell soared.

trostaft•7mo ago

Excellent write up, thank you. The benefits section was illustrative

mdaniel•7mo ago

Related: OpenTPU: Open-Source Reimplementation of Google Tensor Processing Unit (TPU) - https://news.ycombinator.com/item?id=44111452 - May, 2025 (23 comments)

wkat4242•7mo ago

Haha I thought this was about 3D printing with thermostatic poly urethane. It's one of the harder materials to print and it also took me some time to get my head around it.

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

What rare disease AI teaches us about longitudinal health

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

What rare disease AI teaches us about longitudinal health

TPU Deep Dive

Comments