TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit deployed on FPGA

https://github.com/Alanma23/tinytinyTPU-co

135•Xenograph•1mo ago

Comments

hinkley•1mo ago

I think I could trust AI more if we used it to do heuristics for expensive deterministic processes. Sort of a cross between Bloom Filters and speculative execution. Determine the odds the expensive operation 1 will indicate that expensive operation 2 needs to happen, and then start expensive operation 2 while we determine if it’s actually needed. If its right 95% of the time, which is the sort of ranges AI can aspire to, that’s skipping the high latency task chaining 19 times out of 20, which would be pretty good.

rjsw•1mo ago

There have been comments that some leading AI researchers were switching away from working on language models to do stuff with "real world data".

p1esk•1mo ago

What do you mean?

tornikeo•1mo ago

Meaning a GPT but next token is a live sensor reading or a servo angle or accelerometer state. Then connect that GPT with an actual LLM as a controller and you (hopefully) have a physical machine with arms, legs and a mind.

hnuser123456•1mo ago

There are Bayesian neural networks that could apparently track probability rather than just e.g. randomly selecting one output from the top-k based on probability, but I'm still learning up on them myself. Sounds like they're not normally combined with language models.

inkysigma•1mo ago

Iirc, the problem with Bayesian neural networks is that they're significantly more difficult to train. Using stuff like SVI reduces a lot of the representational ability of the distribution over weights. It's also questionable how useful the uncertainty over weights is.

I suppose in the tradition of Bayesian influence, VAEs and the like are still common though.

0-_-0•1mo ago

CPU branch predictors use neural networks

hinkley•1mo ago

CPU branch predictors aren’t going to run long expensive operations in the background. This is like saying bloom filters are speculative memory fetches. That’s not completely untrue but it misses the point.

aunty_helen•1mo ago

I think it’s only a matter of time before we see asic vendors making TPU devices. Same thing happened with BTC. There was enough money there to spawn an industry. Nvidias 70% margins are too hard to ignore. And if playing on the open market seems too rough, there’s always acquisition potential like what happened to groq.

NitpickLawyer•1mo ago

Aren't high end accelerators already closer to ASICs than to og GPUs, tho?

tonetegeatinst•1mo ago

Yes, but not as much as you think.

A lot of silicon on a GPU is dedicated to upscaling and matrix multiply.

Ultimately GPU's main use is multimedia and graphics focused.

See all the miners that used to do GPU based mining...or the other niche markets where eventually the cost of custom asic becomes to attractive to ignore even if you as a consume have to handle a few years of growing pains.

ssivark•1mo ago

> Ultimately GPU's main use is multimedia and graphics focused

This has long ceased to be true, especially for data center focused gpus from the last few years; the "gpu" moniker is really a misnomer / historical artifact.

alanma•1mo ago

hard to argue today's GPUs are really graphics focused anymore in the training / inference race :O

really excited about Rubin CPX / Feynman generations, let's see what the LPU does to the inference stack

fooblaster•1mo ago

Great! How do you program it?

alanma•1mo ago

A couple core commands in our ISA detailed on our GitHub, map your problem to matrix ops, here's a brief excerpt, but our tpu_compiler and tpu_driver are the core to programming your own:

from tpu_compiler import TPUCompiler, TPURuntime

class Custom(nn.Module):

    def __init__(self):
        super().__init__()

        self.layer1 = nn.Linear(2, 2, bias=False)
        self.layer2 = nn.Linear(2, 2, bias=False)

    def forward(self, x):
        x = self.layer1(x)
        x = torch.relu(x)
        x = self.layer2(x)
        return x

model = train_model(your_data)

# compile to the tiny tiny TPU format

compiler = TPUCompiler()

compiled = compiler.compile(model)

# run and enjoy :)

runtime = TPURuntime(tpu)

result = runtime.inference(compiled, input_data)

Will update soon with some better documentation, but hopefully this will get you started!

- Alan and Abiral

ph4evers•1mo ago

Such a cool project! Next one is to run jaxprs via the driver?

alanma•1mo ago

Definitely thinking about that! Would be very cool to run the JAX / Pallas stack, noted on our end :)

- Alan and Abiral

mrinterweb•1mo ago

I've been wondering when we will see general purpose consumer FPGAs, and eventually ASICs, for inference. This reminds me of bitcoin mining. Bitcoin mining started with GPUs. I think I remember a brief FPGA period that transitioned to ASIC. My limited understanding of Google's tensor processing unit chips are that they are effectively a transformer ASIC. That's likely a wild over-simplification of Google's TPU, but Gemini is proof that GPUs are not needed for inference.

I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.

tucnak•1mo ago

It all comes down to memory and fabric bandwidth. For example, the state of the art developer -friendly (PCIe 5.0) FPGA platform is Alveo V80 which rocks four 200G NIC's. Basically, Alveo currently occupies this niche where it's the only platform on the market to allow programmable in-network compute. However, what's available in terms of bandwidth—lags behind even pathetic platforms like Bluefield. Those in the know are aware of what challenges are there to actually saturate it for inference in practical designs. I think, Xilinx is super well-positioned here, but without some solid hard IP it's still a far cry from purpose silicon.

mrinterweb•1mo ago

As far as I understand all the inference purpose-build silicon out there is not being sold to competitors and kept in-house. Google's TPU, Amazon's Inferentia (horrible name), Microsoft's Maia, Meta's MTIA. It seems that custom inference silicon is a huge part of the AI game. I doubt GPU-based inference will be relevant/competitive soon.

almostgotcaught•1mo ago

[flagged]

mrinterweb•1mo ago

Soon was wrong. I should have said it is already happening. Google Gemini already uses their own TPU chips. Nvidia just dropped $20B to buy the IP for Groq's LPU (custom silicon for inference). $20B says Nvidia sees the writing on the wall for GPU-based inference. https://www.tomshardware.com/tech-industry/semiconductors/nv...

almostgotcaught•1mo ago

There are so many people on here that are outsiders commenting way out of their depth:

> Google Gemini already uses their own TPU chips

Google has been using TPUs in prod for like a decade.

nomel•1mo ago

> It seems that custom inference silicon is a huge part of the AI game.

Is there any public info about % inference on custom vs GPU, for these companies?

mrinterweb•1mo ago

Gemini is likely the most widely used gen AI model in the world considering search, Android integration, and countless other integrations into the Google ecosystem. Gemini runs on their custom TPU chips. So I would say a large portion of inference is already using ASIC. https://cloud.google.com/tpu

nightshift1•1mo ago

According to this semianalysis article, the Google/Broadcom TPU are being sold to others like Anthropic.

https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

fooblaster•1mo ago

FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore. 50% of the die area or more is for fixed function matrix multiplication units and associated dedicated storage. This just isn't general purpose anymore. FPGAs cannot rival this with their configurable DSP slices. They would need dedicated systolic blocks, which they aren't getting. The closest thing is the versal ML tiles, and those are entire peoxessors, not FPGA blocks. Those have failed by being impossible to program.

ithkuil•1mo ago

Turns out that a lot of interesting computation can be expressed as a matrix multiplication.

fooblaster•1mo ago

Yeah, I wouldn't have guessed it would be helping me write systemverilog.

alanma•1mo ago

yup, GBs are so much tensor core nowadays :)

Lerc•1mo ago

I think it'll get to a point with quantisation that GPUs that run them will be more FPGA like than graphics renderers. If you quantize far enough things begin to look more like gates than floating point units. At that level a FPGA wouldn't run your model, it would be one your model.

fpgaminer•1mo ago

> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.

Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.

beeflet•1mo ago

I have wondered if it is possible to make a mining algorithm FPGA-hard in the same way that RandomX is CPU-hard and memory-hard. Relative to CPUs, the "programming time" cost is high.

Nice username btw.

hayley-patton•1mo ago

My recollection is that ASIC-resistance involves using lots of scratchpad memory and mixing multiple hashing algorithms, so that you'd have to use a lot of silicon and/or bottleneck hard on external RAM. I think the same would hurt FPGAs too.

ksk23•1mo ago

Imho, not knowing too much bout both concepts; it kinda is!

You would need to re-implement a general purpose cpu to beat it, or that was the idea behind RandomX

dnautics•1mo ago

I don't think this is correct. For inference, the bottleneck is memory bandwidth, so if you can hook up an FPGA with better memory, it has an outside shot at beating GPUs, at least in the short term.

I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.

fooblaster•1mo ago

Show me a single FPGA that can outperform a B200 at matrix multiplication (or even come close) at any usable precision.

B200 can do 10 peta ops at fp8, theoretically.

I do agree memory bandwidth is also a problem for most FPGA setups, but xilinx ships HBM with some skus and they are not competitive at inference as far as I know.

checker659•1mo ago

Said GPUs spend half the time just waiting for memory.

fooblaster•1mo ago

Yep, but they are still 50x faster than any fpga.

dnautics•1mo ago

probably not B200 level but better than you might expect:

https://www.positron.ai/

i believe a B200 is ~3x the H200 at llama-3, so that puts the FPGAs at around 60% the speed of B200s?

fooblaster•1mo ago

I wouldn't trust any benchmarks on the vendors site. Microsoft went down this path for years with FPGAs and wrote off the entire effort.

dnautics•1mo ago

ok? i worked on those devices, those numbers are real. theres a reason why they compare to h200 and not b200

> I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago

fooblaster•1mo ago

I'd like to know more. I expect these systems are 8xvh1782. Is that true? What's the theoretical math throughput - my expectation is that it isn't very high per chip. How is performance in the prefill stage when inference is actually math limited?

dnautics•1mo ago

i was a software guy, sorry, but those token rates are correct and what was flowing through my software.

i believe there was a special deal on super special fpgas. there were dsps involved.

imtringued•1mo ago

I feel like your entire comment is a self contradicting mess.

You say FPGAs won't get dedicated logic for ML, then you say they did.

Why does it matter whether the matrix multiplication units inside the AI Engine are a systolic array or not? The multipliers support 512 bit inputs which means 4x8 times 8x4 for bfloat16 with one multiplication per cycle and bigger multiplications with smaller data types. Since it is a VLIW processor, it is much easier to achieve full utilisation of the matrix multiplication units, because you can run loads, stores and process tiles all simultaneously in the same cycle.

The only thing that might be a challenge is arranging the communication between the AI Engines, but even that should be blatantly obvious. If you are doing matrix multiplication, you should be using the entire array in exactly the pattern you think they should be using internally.

Who knows, maybe there is a way to implement flash attention like that too.

fooblaster•1mo ago

The versal stuff isn't really an FPGA anymore. The chips have PL on them, but many don't. The consumer NPUs from AMD are the same versal aie cores with no PL. They just aren't configurable blocks in fabric anymore and don't have the same programming model. So I'm not contradicting myself here.

That being said, versal aie for ml has been a terrible failure. The reasons for why are complicated. One reason is because the memory hierarchy for SRAM is not a unified pool. It's partitioned into tiles and can't be accessed by all cores. additionally, access of this SRAM is only via dma engines and not directly from the cores. Thirdly, the datapaths for feeding the VLIW cores are statically set, and require a software configuration to change at runtime which is slow. Programming this thing makes the cell processor look like a cakewalk. You gotta program dma engines, you program hundreds of VLIW cores, you need to explicitly setup on chip network fabric. I could go on.

Anyway, my point is FPGAs aren't getting ML slices. Some FPGAs do have a completely separate thing that can do ML, but what is shipped is terrible. Hopefully that makes sense.

teleforce•1mo ago

>Those have failed by being impossible to program.

I think you spoke too soon about their failure, sooner they will be much easier to program [1].

Interestingly, Nvidia GPU now is also moving to tile-based GPU programming model that targets portability for NVIDIA Tensor Cores [2]. Recently there're discussions on the topic at HN [3].

[1] Developing a BLAS Library for the AMD AI Engine [pdf]:

https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas....

[2] NVIDIA CUDA Tile:

https://developer.nvidia.com/cuda/tile

[3]CUDA Tile Open Sourced (103 comments):

https://news.ycombinator.com/item?id=46330732

fooblaster•1mo ago

The amd npu and versal ML tiles (same underlying architecture) have been an complete failure. Dynamic programming models like cu tile do not work on them at all, be cause they require an entirely static graph to function. AMD is going to walk away from their NPU architecture and unify around their GPU IP on inference products in the future.

Narew•1mo ago

There was in the past. Google had Coral TPU and Intel the Neural Compute Stick (NCS). NCS is from 2018 so it's really outdated now. It was mainly oriented for edge computing so the flops was not comparable to desktop computer.

moffkalast•1mo ago

Even for edge computing neither were really even capable of keeping up with the slowest Jetson's GPU for not much less power draw.

bee_rider•1mo ago

There are also CPU extensions like AVX512-VNNI and AVX512-BF16. Maybe the idea of communicating out to a card that holds your model will eventually go away. Inference is not too memory bandwidth hungry, right?

liuliu•1mo ago

This is a common misunderstanding from industry observers (not industry practitioners). Each generation of (NVIDIA) GPU is an ASIC with different ISA etc. Bitcoin mining simply was not important enough (last year, only $23B Bitcoin mined in total (at $100,000 per)). There is amped incentive to implement every possible instructions useful into GPU (without worrying about backward compatibility, thanks to PTX).

ASIC transformers won't happen (defined as a chip with single instruction to do sdpa from anything that is not broadly marketed as GPU, and won't have annualized sale more than $3B). Mark my word. I am happy to take a bet on longbets.org with anyone on this for $1000 and my part will go to PSF.

dnautics•1mo ago

I don't know if they'll reach $3B, but at least one company is using FPGA transformers (that perform well) to get revenue in before going to ASIC transformers:

https://www.positron.ai/

zhemao•1mo ago

TPUs aren't transformer ASICs. The Ironwood TPU that Gemini was trained on was designed before LLMs became popular with ChatGPT's release. The architecture was general enough that it ended up being efficient for LLM training.

A special-purpose transformer inference ASIC would be like Etched's Sohu chip.

mrinterweb•1mo ago

> TPUs aren't transformer ASICs.

https://cloud.google.com/tpu

> A TPU is an application-specific integrated circuit (ASIC) designed by Google for neural networks.

seamossfet•1mo ago

The only time FPGAs / ASICS are better is if there's gains we can make by innovating on the hardware architecture itself. That's pretty hard to do considering GPUs are already heavily optimized for this use case.

babl-yc•1mo ago

This is cool. I'm observing a trend of "build a tiny version from the ground-up to understand it" a la Karpathy's micrograd/minGPT. Seems like one of the best ways to learn.

alanma•1mo ago

thanks for the kind words of support! definitely taught us a thing or two, hope you enjoyed the ride along

- Alan and Abiral

alanma•1mo ago

Thanks again for the repost and all the support!! Been a blast and super cool to see the interest, if you want to follow along for more of our writeups, our blog can be found here: https://chewingonchips.substack.com/

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

We Mourn Our Craft

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

The F Word

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

72M Points of Interest

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

We Mourn Our Craft

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

The F Word

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

72M Points of Interest

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit deployed on FPGA

Comments