Cutting edge and innovative AI hardware research from China.
Looks like Amerikan sanctions are driving a new wave of innovation in China.
" This work addresses that gap by introducing the Ten-
sor Manipulation Unit (TMU): a reconfigurable, near-memory
hardware block designed to execute data-movement-intensive
(DMI) operators efficiently. TMU manipulates long datastreams
in a memory-to-memory fashion using a RISC-inspired execution
model and a unified addressing abstraction, enabling broad
support for both coarse- and fine-grained tensor transformations.
The proposed architecture integrates TMU alongside a TPU
within a high-throughput AI SoC, leveraging double buffering
and output forwarding to improve pipeline utilization. Fab-
ricated in SMIC 40 nm technology, the TMU occupies only
0.019 mm2 while supporting over 10 representative TM operators.
Benchmarking shows that TMU alone achieves up to 1413.43×
and 8.54× operator-level latency reduction over ARM A72 and
NVIDIA Jetson TX2, respectively.
When integrated with the in-
house TPU, the complete system achieves a 34.6% reduction in
end-to-end inference latency, demonstrating the effectiveness and
scalability of reconfigurable tensor manipulation in modern AI
SoCs."
KnuthIsGod•3h ago
Looks like Amerikan sanctions are driving a new wave of innovation in China.
" This work addresses that gap by introducing the Ten- sor Manipulation Unit (TMU): a reconfigurable, near-memory hardware block designed to execute data-movement-intensive (DMI) operators efficiently. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations.
The proposed architecture integrates TMU alongside a TPU within a high-throughput AI SoC, leveraging double buffering and output forwarding to improve pipeline utilization. Fab- ricated in SMIC 40 nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative TM operators. Benchmarking shows that TMU alone achieves up to 1413.43× and 8.54× operator-level latency reduction over ARM A72 and NVIDIA Jetson TX2, respectively.
When integrated with the in- house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs."