The design was originally targeting Agilex 3. We later rebuilt it for a Certus-NX board to see how portable the RTL actually was and what changed in terms of resource usage and timing.
## Model
Input: 128×128 grayscale images streamed over UART from a host PC webcam.
Architecture: - 3×3 conv (8 filters) + pooling - 3×3 conv (12 filters) + pooling - 3×3 conv (16 filters) + pooling - 3×3 conv (24 filters) + pooling - 3×3 conv (32 filters) + pooling - fully connected 512 → 10
All arithmetic is INT8. The design is single-clock and streaming; feature maps are buffered in block RAM between layers.
---
## Platform 1: Intel Agilex 3
Device: A3CY135BM16AE6S Board: Agilex 3 C-Series Development Kit Toolchain: Quartus Prime Pro 25.3
Resource usage: - ALMs: 2,526 / 45,800 (6%) - RAM blocks: 36 / 353 (10%) - DSP blocks: 17 / 184 (9%)
Fmax: 146 MHz
---
## Platform 2: Lattice Certus-NX
Device: LDN2NX-40-7BG196I Board: Cruvi CR00103-03 Toolchain: Radiant 2025.2
Resource usage: - LUT4: 13,757 / 32,256 (42.6%) - MULT9: 66 / 112 (59%) - MULT18: 12 / 56 (21%) - Block RAM: 20 / 84 (24%)
Single clock domain: 273 MHz
The Certus design uses an on-chip PLL (12 → 48 MHz) input adaptation for the board. The Agilex board required an external UART adapter; the Certus board had it integrated.
---
## Porting effort
The RTL is written in plain VHDL without vendor IP. No vendor-specific primitives are instantiated in the CNN datapath.
In practice:
- No DSP or RAM wrapper layer was required. - No changes to arithmetic or pipeline structure were necessary. - No timing constraint rework beyond board-specific clock definitions. - Only board-level adaptations (clocking, UART wiring).
The vendor itself was largely irrelevant for this design. The differences were at the board and toolchain level.
---
## Observations
- On Agilex 3, the design is small relative to the device (single-digit % utilization). - On Certus-NX-40, the same design consumes a significant fraction of LUTs and MULT9 blocks. - Achieved Fmax on Certus-NX is higher in this configuration (273 MHz vs 146 MHz), though the system clocking and board setup differ.
The DSP usage profile differs noticeably: Certus-NX’s MULT9 blocks are heavily used (59%), which constrains scaling the number of parallel MAC units more quickly than on Agilex 3.
For this size of INT8 CNN, portability at the RTL level was straightforward. The limiting factor when moving to the smaller device was resource headroom rather than functional incompatibility.
---
## Question
For those who have moved similar streaming CNN datapaths across vendors: Have you found cases where DSP inference or block RAM inference diverged enough to require structural RTL changes, or does that mostly appear once designs become more deeply pipelined or multi-clock?