The Discovery: My deterministic engine matches primesieve perfectly, but Nicely’s historical data (hosted at Lynchburg) shows persistent +1 errors in several cumulative counts:
0 to 30: Shows 5 twins (Actual: 4)
0 to 600: Shows 27 twins (Actual: 26)
0 to 30M: Shows 152,892 (Actual: 152,891)
It appears to be a systematic off-by-one error or a segment-boundary issue in the legacy code used decades ago.
Performance & Methodology: TSE achieves record-breaking speeds by using a unique N/6 bit data structure, making it 6x more memory-efficient than classical sieves.
Peak Throughput: 1.136 Trillion candidates/sec (on RTX 5090).
Efficiency: Scanned 10^14 range (twin and cousin primes) in ~6 minutes.
Hardware-Friendly: No modular arithmetic; uses simple integer additions (n <- n+p) optimized for CUDA warps.
Technical Deep Dive: The N/6 indexing paradigm leverages the mathematical distribution of twin (p, p+2) and cousin (p, p+4) pairs to eliminate redundant candidates before they even hit the VRAM. This allowed me to process 100 trillion numbers using only 1.1 GB of VRAM.
GitHub: https://github.com/bilgisofttr/TurkishSieve Zenodo (Methodology): https://zenodo.org/records/18038661
I'd love to hear your thoughts on the CUDA kernel optimization and the historical discrepancy I found!