I’ve been working on an optimized prime sieve implementation called the Turkish Sieve Engine (TSE). I recently got my hands on an RTX 5090 and managed to break the 1 Tera-item/s barrier, reaching 1.136 T-items/s at the 10^12 range.
The project uses a methodology I call "N/6 Bit-Indexing." By mapping only numbers of the form 6k±1 and using a bit-compressed representation, I’ve managed to significantly reduce the memory footprint. This allows processing ranges up to 10^14 (100 trillion) with about 17GB of VRAM usage, which is well within the limits of modern consumer GPUs.
Key results from my latest benchmarks:
Range 10^12 (1 Trillion): 0.880 seconds (1.13 T-items/s)
Range 10^14 (100 Trillion): 359 seconds (6 minutes) on an RTX 5090
Twin vs Cousin Primes: Verified the Hardy-Littlewood conjecture at 10^14 with a variance of only 0.0003%.
The engine uses a hybrid CUDA/OpenMP approach. For the CPU side, I optimized it for the L3 cache of the Ryzen 9 9950X3D (using 192.5 KB segments), hitting 66 G-items/s.
I’ve archived the results and the methodology on Zenodo (DOI: 10.5281/zenodo.18038661) for those interested in the academic side.
The code is open source and I’d love to hear your thoughts on further CUDA kernel optimizations or memory management.
bilgisoft•2d ago
Would appreciate feedback from the community!