What else would one expect when their core value is hiring generalists over specialists* and their lousy retention record?
*Pay no attention to the specialists they acquihire and pay top dollar... And even they don't stick around.
In practice it doesnt quite work out that way.
They do voluntarily offer a way to signal that the data GoogleBot sees is not to be used for training, for now, and assuming you take them at their word, but AFAIK there is no way to stop them doing RAG on your content without destroying your SEO in the process.
Arguably main OpenAI raison d'être was to be a counterweight to that pre-2023 Google AI dominance. But I'd also argue that OpenAI lost its way.
I think we can be reasonably sure that search, Gmail, and some flavor of AI will live on, but other than that, Google apps are basically end-of-life at launch.
Agree there are lots of other contributing causes like culture, incentives, security, etc.
nothing changed...
[1] Ref: he mistakenly posted what was meant to be an internal memo, publicly on G+. He quickly took it down but of course The Internet Never Forgets https://gist.github.com/chitchcock/1281611
https://aibusiness.com/companies/google-ceo-sundar-pichai-we...
Google will have no problem discontinuing Google "AI" if they finally notice that people want a computer to shut up rather than talk at them.
how you define big? My understanding they failed to compete with facebook, and decided to redirect resources somewhere else.
The hype when it was first coming to market was intense. But then nobody could get access because they heavily restricted sign ups.
By the time it was in "open beta" (IIRC like 6-7 mos later), the hype had long died and nobody cared about it anymore.
I and a lot of other googlers were really confused by all of this because at the time we were advocating that Google put more effort into its nascent cloud business (often to get the reply "but we already have appengine" or "cloud isn't as profitable as ads") and that social, while getting a lot of attention, wasn't really a good business for google to be in (with a few exceptions like Orkut and Youtube, Google's attempts at social have been pretty uninspired).
There were even books written at the time that said Google looked lazy and slow and that Meta was going to eat their lunch. But shortly after Google+ tanked, Google really began to focus on Cloud (in a way that pissed off a lot of Googlers in the same way Google+ did- by taking resources and attention from other projects). Now, Meta looks like its going to have a challenging future while Google is on to achieving what Larry Page originally intended: a reliable revenue stream that is reinvested into development of true AI.
The only lunch that will be eaten is Apple's own, since it would probably cannibalize their own sales of the MacBook air
Nvidia is tied down to support previous and existing customers while Google can still easily shift things around without needing to worry too much about external dependencies.
Totally possible, but the second order effects are much more complex than "leader once for all". The path for victory for China is not war despite the west, but a war when the west would not care.
Even the US suffers (their veterans do, anyway) but that's the country that in general least suffers from their constant involvement in warfare. They have this industry down to a "T". However, you cannot generalize from a nation that can project military force almost anywhere in the globe, with little fear of repercussion back home; most countries cannot afford this. China certainly cannot.
So what about the rest? Internecine conflicts are outrageously wasteful, and sadly common in the modern age. Russia's war with Ukraine has turned incredibly wasteful and costly, and Russians are suffering (and dying) regardless of whatever Putin says.
I think China is not generally oriented towards waging war. They do have their military, military projects, and their nationalistic things (what I learn from Wikipedia is called "irredentism"), but generally they seem to be trying to become an economic world power. War would mess and interfere with that. War is too fucking risky.
As long as "tomorrow" is a better day to invade Taiwan than today is, China will wait for tomorrow.
Zeihan's predictions on China have been fabulously wrong for 20+ years now.
I'd guess most of their handicap comes from their hardware and software not being as refined as the US's
What I'm sure about is having a programming unit more purposed to a task is more optimal than a general programming unit designed to accommodate all programming tasks.
More and more of the economics of programming boils down to energy usage and invariably towards physical rules, the efficiency of the process has the benefit of less energy consumed.
As a Layman is makes general sense. Maybe a future where productivity is based closer on energy efficiency rather than monetary gain pushes the economy in better directions.
Cryptocurrency and LLMs seem like they'll play out that story over the next 10 years.
Am I misunderstanding "TPU" in the context of the article?
The fact that they also support vector operations or matrix multiplication is kind of irrelevant and not a defining characteristic of DSPs. If you want to go that far, then everything is a DSP, because all signals are analog.
Maybe also note that Qualcomm has renamed their Hexagon DSP to Hexagon NN. Likely the change was adding activation functions but otherwise its a VLIW architecture with accelerated MAC operations, aka a DSP architecture.
The basic operation that a NN needs accelerating is... go figure multiply and accumulate with the added activation function.
See for example how the Intel NPU is structured here: https://intel.github.io/intel-npu-acceleration-library/npu.h...
What makes a DSP different from a GPU is the algorithms typically do not scale nicely to large matrices and vectors. For example, recursive filters. They are also usually much cheaper and lower power, and the reason they lost popularity was because Arm MCUs got good enough and economy of scale kicked in.
I've written code for DSPs both in college and professionally. It's much like writing code for CPUs or MCUs (it's all C or C++ at the end of the day). But it's very different from writing compute shaders or designing an ASIC.
With simulations becoming key to training models doesn't this seem like a huge problem for Google?
To quote The Next Platform: "An Ironwood cluster linked with Google’s absolutely unique optical circuit switch interconnect can bring to bear 9,216 Ironwood TPUs with a combined 1.77 PB of HBM memory... This makes a rackscale Nvidia system based on 144 “Blackwell” GPU chiplets with an aggregate of 20.7 TB of HBM memory look like a joke."
Nvidia may have the superior architecture at the single-chip level, but for large-scale distributed training (and inference) they currently have nothing that rivals Google's optical switching scalability.
While the B200 wins on raw FP8 throughput (~9000 vs 4614 TFLOPs), that makes sense given NVIDIA has optimized for the single-chip game for over 20 years. But the bottleneck here isn't the chip—it's the domain size.
NVIDIA's top-tier NVL72 tops out at an NVLink domain of 72 Blackwell GPUs. Meanwhile, Google is connecting 9216 chips at 9.6Tbps to deliver nearly 43 ExaFlops. NVIDIA has the ecosystem (CUDA, community, etc.), but until they can match that interconnect scale, they simply don't compete in this weight class.
Same logic when NVidia quote the "bidirectional bandwidth" of high speed interconnects to make the numbers look big, instead of the more common BW per direction, forcing everyone else to adopt the same metric in marketing materials.
Ecosystem is MASSIVE factor and will be a massive factor for all but the biggest models
Also I feel you completely misunderstand that the problem isn't how fast is ONE gpu vs ONE tpu, what matters is the costs for the same output. If I can fill a datacenter at half the cost for the same output, does it matters I've used twice the TPUs and that a single Nvidia Blackwell was faster? No...
And hardware cost isn't even the biggest problem, operational costs, mostly power and cooling are another huge one.
So if you design a solution that fits your stack (designed for it) and optimize for your operational costs you're light years ahead of your competition using the more powerful solution, that costs 5 times more in hardware and twice in operational costs.
All I say is more or less true for inference economics, have no clue about training.
No surprises there, Google is not the greatest company at productizing their tech for external consumption.
> The other players are certainly more than just competing with Google.
TBF, its easy to stay in the game when you're flush with cash, and for the past N-quarters, investors have been throwing money at AI companies, Nvidia's margins have greatly benefited from this largesse. There will be blood on the floor once investors start demanding returns to their investments.
If Google’s TPUs were really substantially superior, don’t you think that would result in at least short term market advantages for Gemini? Where are they?
Seemingly fast interconnects benefit training more than inference since training can have more parallel communication between nodes. Inference for users is more embarrassingly parallel (requires less communication) than updating and merging network weights.
My point: cool benchmark, what does it matter? The original post says Nvidia doesn’t have anything to compete with massively interconnected TPUs. It didn’t merely say Google’s TPUs were better. It said that Nvidia can’t compete. That’s clearly bullshit and wishful thinking, right? There is no evidence in the market to support that, and no actual technical points have been presented in this thread either. OpenAI, Anthropic, etc are certainly competing with Google, right?
And then people explained why the effects are smoothed over right now but will matter eventually and you rejected them as if they didn't understand your question. They answered it, take the answer.
> It didn’t merely say Google’s TPUs were better. It said that Nvidia can’t compete.
Can't compete at clusters of a certain size. The argument is that anyone on nVidia simply isn't building clusters that big.
I _really_ want an alternative but the architecture churn imposed by targeting ROCm for say an MI350X is brutal. The way their wavefronts and everything work is significantly different enough that if you're trying to get last-mile perf (which for GPUs unfortunately yawns back into the 2-5x stretch) you're eating a lot of pain to get the same cost-efficiency out of AMD hardware.
FPGAs aren't really any more cost effective unless the $/kwh goes into the stratosphere which is a hypothetical I don't care to contemplate.
Pytorch, Jax, tensorflow are all examples to me of very capable products, that compete very well in ML space.
But more broadly work like XLA and IREE are very interesting toolkits for mapping a huge variety of computation onto many types of hardware. While Pytorch et al are fine example applications, are things you can do, XLA is the Big Tent idea, the toolkit to erode not just specific CUDA use cases, but to allow hardware in general to be more broadly useful.
a generation is 6 months
* Turing: September 2018
* Ampere: May 2020
* Hopper: March 2022
* Lovelace (designed to work with Hopper): October 2022
* Blackwell: November 2024
* Next: December 2025 or later
With a single exception for Lovelace (arguably not a generation), there are multiple years between generations.
The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.
FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.
And the question is what do programs that max out Ironwood look like vs TPU programs written 5 years ago?
Just because it's still called CUDA doesn't mean it's portable over a not-that-long of a timeframe.
You are aware that Gemini was trained on TPU, and that most research at Deepmind is done on TPU?
It’s better to have a faster, smaller network for model parallelism and a larger, slower one for data parallelism than a very large, but slower, network for everything. This is why NVIDIA wins.
For example the currently very popular Mixture of Experts architectures require a lot of all to all traffic (for expert parallelism) which works a lot better on the switched NVlink fabric as opposed where it doesn't need to traverse multiple links in the torus.
Bisection bandwidth is a useful metric, but is hop count? Per-hop cost tends to be pretty small.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
They had a plan.
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
[1] https://developer.nvidia.com/blog/introducing-nvfp4-for-effi...
[2] https://rocm.blogs.amd.com/software-tools-optimization/mxfp4...
^Even now I get capacity related error messages, so many days after the Gemini 3 launch. Also, Jules is basically unusable. Maybe Gemini 3 is a bigger resource hog than anyone outside of Google realizes.
Why? To me, it seems better for the market, if the best models and the best hardware were not controlled by the same company.
The truth is the LLM boom has opened the first major crack in Google as the front page of the web (the biggest since Facebook), in the same way the web in the long run made Windows so irrelevant Microsoft seemingly don’t care about it at all.
Sparse models have same quality of results but have less coefficients to process, in case described in the link above sixteen (16) times as less.
This means that these models need 8 times less data to store, can be 16 and more times faster and use 16+ times less energy.
TPUs are not all that good in the case of sparse matrices. They can be used to train dense versions, but inference efficiency with sparse matrices may be not all that great.
https://docs.cloud.google.com/tpu/docs/system-architecture-t...
Here's another inference-efficient architecture where TPUs are useless: https://arxiv.org/pdf/2210.08277
There is no matrix-vector multiplication. Parameters are estimated using Gumbel-Softmax. TPUs are of no use here.
Inference is done bit-wise and most efficient inference is done after application of boolean logic simplification algorithms (ABC or mockturtle).
In my (not so) humble opinion, TPUs are example case of premature optimization.
Does anyone have a sense of why CUDA is more important for training than inference?
What does it even mean in neural net context?
> numerical stability
also nice to expand a bit.
Further it's worth noting that the Ironwood, Google's v7 TPU, supports only up to BF16 (a 16-bit floating point that has the range of FP32 minus the precision. Many training processes rely upon larger types, quantizing later, so this breaks a lot of assumptions. Yet Google surprised and actually training Gemini 3 with just that type, so I think a lot of people are reconsidering assumptions.
Another factor is that training is always done with batches. Inference batching depends on the number of concurrent users. This means training tends to be compute bound where supporting the latest data types is critical, whereas inference speeds are often bottlenecked by memory which does not lend itself to product differentiation. If you put the same memory into your chip as your competitor, the difference is going to be way smaller.
A real shame, BTW, all that silicon doesn't do FP32 (very well). After training ceases to be that needed, we could use all that number crunching for climate models and weather prediction.
Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.
If this turns out to be the future use-case for NNs(it is today), then Google are better set.
Once you settle on a design then doing ASICs to accelerate it might make sense. But I'm not sure the gap is so big, the article says some things that aren't really true of datacenter GPUs (Nvidia dc gpus haven't wasted hardware on graphics related stuff for years).
"Meta in talks to spend billions on Google's chips, The Information reports"
https://www.reuters.com/business/meta-talks-spend-billions-g...
Perhaps the assumptions are true. The mere presence of LLMs seems to have lowered the IQ of the Internet drastically, sopping up financial investors and resources that might otherwise be put to better use.
And outside of Google this is a very academic debate. Any efficiency gains over GPUs will primarily turn into profit for Google rather than benefit for me as a developer or user of AI systems. Since Google doesn't sell TPUs, they are extremely well-positioned to ensure no one else can profit from any advantages created by TPUs.
First part is true at the moment, not sure the second follows. Microsoft is developing their own “Maia” chips for running AI on Azure with custom hardware, and everyone else is also getting in the game of hardware accelerators. Google is certainly ahead of the curve in making full-stack hardware that’s very very specialized for machine learning. But everyone else is moving in the same direction: lots of action is in buying up other companies that make interconnects and fancy networking equipment, and AMD/NVIDIA continue to hyper specialize their data center chips for neural networks.
Google is in a great position, for sure. But I don’t see how they can stop other players from converging on similar solutions.
As you note, they'll set the margins to benefit themselves, but you can still eke out some benefit.
Also, you can buy Edge TPUs, but as the name says these are for edge AI inference and useless for any heavy lifting workloads like training or LLMs.
https://www.amazon.com/Google-Coral-Accelerator-coprocessor-...
The influence goes both ways however - both the Steam Deck and Steam Machine are attempts to build a PC that resembles a game console (the former being Switch-like and the latter as something of a "GabeCube.") Steam Deck's Game Mode UI and Steam's Big Picture mode try to provide a console-like experience as well.
If Google wins, we all lose.
But they're not.
There's a few confounding problems:
1. Actually using that hardware effectively isn't easy. It's not as simple as jacking up some constant values and reaping the benefits. Actually using the hardware is hard, and by the time you've optimized for it, you're already working on the next model.
2. This is a problem that, if you're not Google, you can just spend your way out of. A model doesn't take a petabyte of memory to train or run. Regular old H100s still mostly work fine. Faster models are nice, but Gemini 3 Pro being 50% of the latency as Opus 4.5 or GPT 5.1 doesn't add enough value to matter to really anyone.
3. There's still a lot of clever tricks that work as low hanging fruit to improve almost everything about ML models. You can make stuff remarkably good with novel research without building your own chips.
4. A surprising amount of ML model development is boots on the ground work. Doing evals. Curating datasets. Tweaking system prompts. Having your own Dyson sphere doesn't obviate a lot of the typing and staring at a screen that necessarily has to be done to make a model half decent.
5. Fancy bespoke hardware means fancy bespoke failure modes. You can search stack overflow for CUDA problems, you can't just Bing your way to victory when your fancy TPU cluster isn't doing the thing you want it to do.
Slightly more seriously: what you say makes sense if and only if you're projecting Sam Altman and assuming that a) real legit superhuman AGI is just around the corner, and b) all the spoils will accrue to the first company that finds it, which means you need to be 100% in on building the next model that will finally unlock AGI.
But if this is not the case -- and it's increasingly looking like it's not -- it's going to continue to be a race of competing AIs, and that race will be won by the company that can deliver AI at scale the most cheaply. And the article is arguing that company will be Google.
I think you are missing the point. They are saying "weeks old" isn't very old.
> it's going to continue to be a race of competing AIs, and that race will be won by the company that can deliver AI at scale the most cheaply.
I don't see how that follows at all. Quality and distribution both matter a lot here.
Google has some advantages but some disadvantages here too.
If you are on AWS GovCloud, Anthropic is right there. Same on Azure, and on Oracle.
I believe Gemini will be available on the Oracle Cloud at some point (it has been announced) but they are still behind in the enterprise distribution race.
OpenAI is only available on Azure, although I believe their new contract lets them strike deals elsewhere.
On the consumer side, OpenAI and Google are well ahead of course.
Last week it looked like Google had won (hence the blog post) but now almost nobody is talking about antigravity and Gemini 3 anymore so yeah what op says is relevant
I am fairly pro-google(they invented the LLM, FFS...) and recognize the advantages(price/token, efficiency, vertical integration, established DCs w/ power allocations) but also know they have a habit of slightly sucking at everything but search.
For example, OpenAI has announced trillion-dollar investments in data centers to continue scaling. They need to go through a middle-man (Nvidia), while Google does not, and will be able to use their investment much more efficiently to train and serve their own future models.
Performance per dollar doesn't "win" anything though. Performance (as in speed) hardly cracks the top five concerns that most folks have when choosing a model provider, because fast, good models already exist at price points that are acceptable. That might mean slightly better margins for Google, but ultimately isn't going to make them "win"
Also, performance and user choice are definitely impacted by compute. If they ever find a way to replace a job with LLMs, those who can throw more compute at it for a lower price point will win.
Arguably indeed, because I think it still is.
Which is to say, if Google was set up to win, it shouldn't even be a question that 3 Pro is the best. It should be obvious. But it's definitely not obvious that it's the best, and many benchmarks don't support it as being the best.
https://www.anthropic.com/news/expanding-our-use-of-google-c...
The biggest problem though is trust, and I'm still holding back from letting anyone under my authority in my org use Gemini because of the lack of any clear or reasonable statement or guidelines on how they use your data. I think it won't matter in the end if they execute their way to domination - but it's going to give everyone else a chance at least for a while.
Yes, but Google will never be able to compete with their greatest challenge... Google's attention span.
They’ve been very clear, in my opinion: https://cloud.google.com/gemini/docs/discover/data-governanc...
I suppose there will always be the people who refuse to trust them or choose to believe they’re secretly doing something different.
However I’m not sure what you’re referring to by saying they haven’t said anything about how data is used.
If you’re a business/enterprise, you get a different ToS that very clearly states that your data is yours.
If you use the free/consumer options, that’s where they are vague or direct about vacuuming up data.
Cerebras CS-3 specs:
• 4 trillion transistors
• 900,000 AI cores
• 125 petaflops of peak AI performance
• 44GB on-chip SRAM
• 5nm TSMC process
• External memory: 1.5TB, 12TB, or 1.2PB
• Trains AI models up to 24 trillion parameters
• Cluster size of up to 2048 CS-3 systems
• Memory B/W of 21 PB/s
• Fabric B/W of 214 Pb/s (~26.75 PB/s)
Comparing GPU to TPU is helpful for showcasing the advantages of the TPU in the same way that comparing CPU to Radeon GPU is helpful for showcasing the advantages of GPU, but everyone knows Radeon GPU's competition isn't CPU, it's Nvidia GPU!
TPU vs GPU is new paradigm vs old paradigm. GPUs aren't going away even after they "lose" the AI inference wars, but the winner isn't necessarily guaranteed to be the new paradigm chip from the most famous company.
Cerebras inference remains the fastest on the market to this day to my knowledge due to the use of massive on-chip SRAM rather than DRAM, and to my knowledge, they remain the only company focused on specialized inference hardware that has enough positive operating revenue to justify the costs from a financial perspective.
I get how valuable and important Google's OCS interconnects are, not just for TPUs or inference, but really as a demonstrated PoC for computing in general. Skipping the E-O-E translation in general is huge and the entire computing hardware industry would stand to benefit from taking notes here, but that alone doesn't automatically crown Google the victor here, does it?
to quote from their paper "In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster."
The downside is the same thing that makes them fast: they’re very specialized. If your code already fits the TPU stack (JAX/TensorFlow), you get great performance per dollar. If not, the ecosystem gap and fear of lock-in make GPUs the safer default.
Google is a giant without a direction. The ads money is so good that it just doesn't have the gut to leave it on the table.
In AI, we're still in the explosion phase. If you build the perfect ASIC for Transformers today, and tomorrow a paper drops with a new architecture, your chip becomes a brick. NVIDIA pays the "legacy tax" and keeps CUDA specifically as insurance against algorithm churn. As long as the industry moves this fast, flexibility beats raw efficiency
- Google is not owning the technology but builds a cohesive cloud around it, Tesla, Meta work on their own asic ai chips and I guess others
- A signal is already given: Softbank sold it's entire Nvidia stock and berkshire added google on their portfolio.
Microsoft "has" a lot of companies data, and google is probably building the most advanced ai cloud.
However, I can't think they had a cloud which was light-years ahead of aws 15 years ago and now GCP is no 3, they also released opensource gpt models more than 5 years ago that constituted the foundation for openai closed sourced models.
- they were way ahead, and they didn't make any big mistakes
- they weren't waiting for others to catch up. They were aggressively improving
- memory bandwidth is almost always the bottleneck. Hence systolic array is "overrated". Furthermore, interconnect is the new bottleneck now
- cuda offers the most flexibility in the world of ever changing model requirements
TAKE MY MONEY!!!
sbarre•2mo ago
blibble•2mo ago
turning a giant lumbering ship around is not easy
sbarre•2mo ago
coredog64•2mo ago
sofixa•2mo ago
Nothing prevents them per se, but it would risk cannibalising their highly profitable (IIRC 50% margin) higher end cards.
numbers_guy•2mo ago
mindv0rtex•2mo ago
ezekiel68•2mo ago
fooker•2mo ago
bjourne•2mo ago
saagarjha•2mo ago
fooker•2mo ago
If not, what's fundamentally difficult about doing 32 vs 256 here?
neilmovva•2mo ago
see this blog for a reference on Blackwell:
https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...
llm_nerd•2mo ago
To put it into perspective, the tensor cores deliver about 2,000 TFLOPs of FP8, and half that for FP16, and this is all tensor FMA/MAC (comprising the bulk of compute for AI workloads). The CUDA cores -- the rest of the GPU -- deliver more in the 70 TFLOP range.
So if data centres are buying nvidia hardware for AI, they already are buying focused TPU chips that almost incidentally have some other hardware that can do some other stuff.
I mean, GPUs still have a lot of non-tensor general uses in the sciences, finance, etc, and TPUs don't touch that, but yes a lot of nvidia GPUs are being sold as a focused TPU-like chip.
sorenjan•2mo ago
qcnguy•2mo ago
LogicFailsMe•2mo ago
The real challenge is getting the TPU to do more general purpose computation. But that doesn't make for as good a story. And the point about Google arbitrarily raising the prices as soon as they think they have the upper hand is good old fashioned capitalism in action.
timmg•2mo ago
The big difference is that Google is both the chip designer *and* the AI company. So they get both sets of profits.
Both Google and Nvidia contract TSMC for chips. Then Nvidia sells them at a huge profit. Then OpenAI (for example) buys them at that inflated rate and them puts them into production.
So while Nvidia is "selling shovels", Google is making their own shovels and has their own mines.
1980phipsi•2mo ago
m4rtink•2mo ago
And Google will end up with lots of useless super specialized custom hardware.
acoustics•2mo ago
iamtheworstdev•2mo ago
ithkuil•2mo ago
bryanlarsen•2mo ago
pksebben•2mo ago
timmg•2mo ago
If it gets to the point where this hardware is useless (I doubt it), yes Google will have it sitting there. But it will have cost Google less to build that hardware than any of the companies who built on Nvidia.
immibis•2mo ago
kolbe•2mo ago
UncleOxidant•2mo ago
heisenbit•2mo ago
skybrian•2mo ago
Even if TPU’s weren’t all that useful, they still own the data centers and can upgrade equipment, or not. They paid for the hardware out of their large pile of cash, so it’s not debt overhang.
Another issue is loss of revenue. Google cloud revenue is currently 15% of their total, so still not that much. The stock market is counting on it continuing to increase, though.
If the stock market crashes, Google’s stock price will go down too, and that could be a very good time to buy, much like it was in 2008. There’s been a spectacular increase since then, the best investment I ever made. (Repeating that is unlikely, though.)
nutjob2•2mo ago
Meanwhile OpenAI et al dumping GPUs while everyone else is doing the same will get pennies on the dollar. It's exactly the opposite to what you describe.
I hope that comes to pass, because I'll be ready to scoop up cheap GPUs and servers.
qcnguy•2mo ago
blinding-streak•2mo ago
pzo•2mo ago
throwawayffffas•2mo ago
sagarm•2mo ago
Citation needed. But the vertical integration is likely valuable right now, especially with NVidia being supply constrained.
ForHackernews•2mo ago
Having your own mines only pays off if you actually do strike gold. So far AI undercuts Google's profitable search ads, and loses money for OpenAI.
veunes•2mo ago
sojuz151•2mo ago
HarHarVeryFunny•2mo ago
UncleOxidant•2mo ago
mr_toad•2mo ago
If LLMs become unfashionable they’ll still be good for other ML tasks like image recognition.
Workaccount2•2mo ago
Everyone using Nvidia hardware has a lot of overlap in requirements, but they also all have enough architectural differences that they won't be able to match Google.
OpenAI announced they will be designing their own chips, exactly for this reason, but that also becomes another extremely capital intensive investment for them.
This also doesn't get into that Google also already has S-tier dataceters and datacenter construction/management capabilities.
wood_spirit•2mo ago
saagarjha•2mo ago
overfeed•2mo ago
saagarjha•2mo ago
overfeed•2mo ago
saagarjha•2mo ago
overfeed•2mo ago
saagarjha•2mo ago
01100011•2mo ago
You don't think Nvidia has field-service engineers and applications engineers with their big customers? Come on man. There is quite a bit of dialogue between the big players and the chipmaker.
Workaccount2•2mo ago
Deepmind can do whatever they want, and get the exact hardware to match it. It's a massive advantage when you can discover a bespoke way of running a filter, and you can get a hardware implementation of it without having to share that with any third parties. If OpenAI takes a new find to Nvidia, everyone else using Nvidia chips gets it too.
01100011•2mo ago
In your example, if OpenAI makes a massive new find they aren't taking it to NVDA.
Nvidia has the advantage of a broad base of customers that gives it a lot of information on what needs work and it tries to quickly respond to those deficiencies.
Workaccount2•2mo ago
Right, and therefore they are stuck doing it in software, while google can do it in hardware.
jauntywundrkind•2mo ago
They could make a systolic array TPU and software, perhaps. But it would mean abandoning 18 years of CUDA.
The top post right now is talking about TPU's colossal advantage in scaling & throughput. Ironwood is massively bigger & faster than what Nvidia is shooting for, already. And that's a huge advantage. But imo that is a replicateable win. Throw gobs more at networking and scaling and nvidia could do similar with their architecture.
The architectural win of what TPU is more interesting. Google sort of has a working super powerful Connection Machine CM-1. The systolic array is a lot of (semi-)independent machines that communicate with nearby chips. There's incredible work going on to figure out how to map problems onto these arrays.
Where-as on a GPU, main memory is used to transfer intermediary results. It doesn't really matter who picks up work, there's lots of worklets with equal access time to that bit of main memory. The actual situation is a little more nuanced (even in consumer gpu's there's really multiple different main memories, which creates some locality), but there's much less need for data locality in the GPU, and much much much much tighter needs, the whole premise of the TPU is to exploit data locality. Because sending data to a neighbor is cheap, sending storing and retrieving data from memory is slower and much more energy intense.
CUDA takes advantage of, relies strongly on the GPU's reliance in main memory being (somewhat) globally accessible. There's plenty of workloads folks do in CUDA that would never work on TPU, on these much more specialized data-passing systolic arrays. That's why TPUs are so amazing, because they are much more constrained devices, that require so much more careful workload planning, to get the work to flow across the 2D array of the chip.
Google's work on projects like XLA and IREE is a wonderful & glorious general pursuit of how to map these big crazy machine learning pipelines down onto specific hardware. Nvidia could make their own or join forces here. And perhaps they will. But the CUDA moat would have to be left behind.
zzzoom•2mo ago
Tensor cores are specialized and have CUDA support.
jauntywundrkind•2mo ago
But it's still something grafted onto the existing architecture, of many grids with many blocks with many warps, and lots and lots of coordination and passing intermediary results around. It's only a 4x4x4 unit, afaik. There's still a lot of main memory being used to combine data, a lot of orchestration among the different warps and blocks and grids, to get big matrices crunched.
The systolic array is designed to allow much more fire and forget operations. It's inputs are 128 x 128 and each cell is its own compute node basically, shuffling data through and across (but not transitting a far off memory).
TPU architecture has plenty of limitations. It's not great at everything. But if you can design work to flow from cell to neighboring cell, you can crunch very sizable chunks of data with amazing data locality. The efficiency there is unparalleled.
Nvidia would need a radical change of their architecture to get anything like the massive data locality wins a systolic array can do. It would come with massively more constraints too.
Would love if anyone else has recommended reading. I have this piece earmarked. https://henryhmko.github.io/posts/tpu/tpu.html https://news.ycombinator.com/item?id=44342977
baron816•2mo ago
storus•2mo ago
torginus•2mo ago
It might be even 'free' to fill it with more complicated logic (especially one that allows you write clever algorithms that let you save on bandwidth).