So it's called an "AI Engine", but its performance is worse than just running the same thing on CPU? Doesn't it make it essentially useless for anything AI related? What's the point of this hardware then? Better power efficiency for tiny models? Surely someone must be using it for something?
shetaye•23h ago
The CPU baseline seems to be the beefy host CPU. The AIE is presumably faster than what you could do with the FPGA (DPS, LUT, etc.) alone.
heavyset_go•23h ago
The point is offloading ML workloads to hardware that is energy efficient, not necessarily "fast" hardware.
You want to minimize the real and energy costs at the expense of time.
Assuming NPUs don't get pulled from consumer hardware altogether, theoretically the time/efficiency trade-off gap will become smaller and smaller as time goes on.
titanix88•23h ago
Looks like the author have not used software pipelining compiler directives with the kernel loops. AMD AIE architecture has 5 cycle load/store latency and 7 cycle FP unit latency. With software pipelining, they could have 5-10x speed up for long loops.
It's an IP block that Xilinx can provide for use on their FPGAs, but as implemented on the Ryzen parts it's synthesized into a hard IP block, not an FPGA block plus bitstream.
fooblaster•20h ago
This architecture is likely going to be a dead end for AMD. It has been in the wild for several years, yet still has no open programming model, multiple compiler stacks with poor software support. I find it likely that AMD drops this architecture and unifies their ML support around their GPGPU hardware.
kouteiheika•23h ago
shetaye•23h ago
heavyset_go•23h ago
You want to minimize the real and energy costs at the expense of time.
Assuming NPUs don't get pulled from consumer hardware altogether, theoretically the time/efficiency trade-off gap will become smaller and smaller as time goes on.