I’ve been working on an optimized prime sieve implementation called the Turkish Sieve Engine (TSE). I recently got my hands on an RTX 5090 and managed to break the 1 Tera-item/s barrier, reaching 1.136 T-items/s at the 10^12 range.
The project uses a methodology I call "N/6 Bit-Indexing." By mapping only numbers of the form 6k±1 and using a bit-compressed representation, I’ve managed to significantly reduce the memory footprint. This allows processing ranges up to 10^14 (100 trillion) with about 17GB of VRAM usage, which is well within the limits of modern consumer GPUs.
Key results from my latest benchmarks:
Range 10^12 (1 Trillion): 0.880 seconds (1.13 T-items/s)
Range 10^14 (100 Trillion): 359 seconds (6 minutes) on an RTX 5090
Twin vs Cousin Primes: Verified the Hardy-Littlewood conjecture at 10^14 with a variance of only 0.0003%.
The engine uses a hybrid CUDA/OpenMP approach. For the CPU side, I optimized it for the L3 cache of the Ryzen 9 9950X3D (using 192.5 KB segments), hitting 66 G-items/s.
I’ve archived the results and the methodology on Zenodo (DOI: 10.5281/zenodo.18038661) for those interested in the academic side.
The code is open source and I’d love to hear your thoughts on further CUDA kernel optimizations or memory management.