The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.
shrecshrec•1h ago
The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.