I'll try to calculate it from the information given. 12 parallel instances at a clock speed of 62.5MHz, with 68 clock cycles per hash.
62.5MHz * 12 / 68 = ~11MH/s
That seems... slow? Did I do the math right? How big of an FPGA do you need before this would compete with a GPU, and how much would it cost?
For reference, an RTX 4090 can do 21975.5 MH/s according to hashcat benchmarks.
Keep in mind RTX4090 is 5 nm process node and has a lot more transistors and memory than XC7A100T, which is 28 nm. That's a huge difference in terms of dynamic performance. Also, the two are also released 10 years apart. If you compare RTX4090 against a similarly modern UltraScale part from Xilinx, I believe the FPGA can be notably faster than RTX4090.
Edit - I misread your comment. ASIC designers will use FPGAs to test their design but it won't be optimized for FPGAs which have a different logic-and-memory characteristic than ASICs. There aren't many great SHA256 FPGA implementations, largely because there's not that much demand for one
No matmul coin where the hardware could be repurposed for AI stuff?
(ASIC simulation on an FPGA will retain the combinatorial stages but run at dramatically lower fMax)
UltraScale+ chips will run a proper design at 600MHz-800MHz, big chips might be able to fit 24 cores. The Artix chip OP used is extremely slow and too small to fit this style of implementation.
FPGA cryptographic acceleration is about batch task bandwidth, OpenSSL has few places where this is required.
The other part is bulk encryption. CPUs have lots of acceleration for that, but clear text is still faster, so the win is not to ship data to an accelerator and then back to the cpu and then out to the NIC, but to ship to the accelerator and from there to the NIC without touching the CPU or often the accelerator is integrated with the NIC.
It works even better if the data never has to touch the CPU.
For alternative design/writeup, check out http://nsa.unaligned.org
15155•5h ago
m3kw9•5h ago
nayuki•5h ago
picture•4h ago