I've been working on optimizing perception pipelines for SWaP-constrained FPGAs (like in satellite or drone payloads). I realized that we often run out of DSP slices for simple 3x3 convolutions.
I implemented a method to approximate these convolutions by learning coefficients that map strictly to Powers-of-Two (PoT). This allows replacing the constant multipliers with bit-shifts and adders (LUTs).
The results:
Reduces DSP usage by 33% (2 Muls instead of 3 per atomic dot-product).
Achieves >99% SSIM on correlated images.
The error manifests as a global DC-offset, which Batch Norm layers in CNNs can typically absorb.
I wrote a blog post detailing the math and the hardware implementation. The full C-benchmark and PoC code is on GitHub (linked in the post).
I'd love to hear from the FPGA folks here: Is this trade-off (accuracy vs. resources) something you'd use in production payloads?
el_dockerr•1h ago
I've been working on optimizing perception pipelines for SWaP-constrained FPGAs (like in satellite or drone payloads). I realized that we often run out of DSP slices for simple 3x3 convolutions.
I implemented a method to approximate these convolutions by learning coefficients that map strictly to Powers-of-Two (PoT). This allows replacing the constant multipliers with bit-shifts and adders (LUTs).
The results:
Reduces DSP usage by 33% (2 Muls instead of 3 per atomic dot-product).
Achieves >99% SSIM on correlated images.
The error manifests as a global DC-offset, which Batch Norm layers in CNNs can typically absorb.
I wrote a blog post detailing the math and the hardware implementation. The full C-benchmark and PoC code is on GitHub (linked in the post).
I'd love to hear from the FPGA folks here: Is this trade-off (accuracy vs. resources) something you'd use in production payloads?
Other sources:
[blog] https://www.dockerr.blog/blog/lowrank-hardware-approximation
[git] https://github.com/el-dockerr/Low-Rank_Hardware_Approximatio...
[LinkedIN] https://www.linkedin.com/in/swen-kalski-062b64299/