Hey HN, I've been working on a CUDA pipeline that systematically searches for Bitcoin private keys generated by flawed software from Bitcoin's early years. The linked article is a technical deep dive into the GPU engineering — no source code (project is closed-source), but
detailed pseudocode and architecture diagrams.
The problem: Between 2009 and 2012, many Bitcoin wallet tools used predictable entropy — timestamp-seeded LCGs, brain wallets with simple passwords, the Debian OpenSSL bug (CVE-2008-0166) that reduced entropy to 15 bits, the Randstorm BitcoinJS flaw, PHP's mt_rand() seeded with
time*pid, and others. The private keys generated by these algorithms have tiny search spaces (some as small as 65,536 candidates), and the addresses are still on the blockchain with funded balances.
The engineering: Everything lives in a single ~3,400-line CUDA C++ file. The per-thread pipeline is:
private key → secp256k1 EC multiply → SHA-256 → RIPEMD-160 → bloom filter → binary search → match
Some decisions that might interest the GPU computing crowd:
- EC multiplication via precomputed lookup tables (67 MB): The 256-bit private key is split into 16 × 16-bit chunks, each indexing into a precomputed table of generator point multiples. This reduces scalar multiplication from ~256 double-and-add iterations to 16 lookups + 15
point additions.
- Specialized SHA-256 kernels per pubkey format: Instead of a generic SHA-256, we have two dedicated functions — one for 33-byte compressed pubkeys (single block, partially precomputable message schedule) and one for 65-byte uncompressed (two blocks, second block is constant
padding). Both build the message schedule directly from EC point coordinates without a serialization buffer.
- 32 KB bloom filter in __constant__ memory: With ~3,100 target addresses and 100M candidates per batch, a naive binary search would thrash global memory. A bloom filter with 7 hash functions (FNV-1a + rotate-XOR double hashing) in constant memory rejects 99.9999% of candidates
with broadcast cache reads. Only ~100 false positives per batch reach the binary search confirmation stage in global memory.
- 256-bit modular arithmetic in PTX inline assembly: secp256k1 operates over a 256-bit prime field, implemented as eight 32-bit limbs with carry propagation. The sparse form of the secp256k1 prime (2^256 - 2^32 - 977) allows fast reduction.
- __launch_bounds__(256, 2) + --maxrregcount=128: Targeting 2 blocks per SM with 128 registers per thread. Higher occupancy isn't better here because the kernel is compute-bound with heavy ALU usage — 2 blocks provides enough ILP to hide latency without forcing register spills.
The Mersenne Twister mode (2.5 KB state per thread) is the exception and occasionally spills.
There are 23 vulnerability modes total, covering LCG PRNGs (glibc, MSVC, Borland Delphi, Java), brain wallet patterns (SHA-256 of counters, timestamps, phone numbers, short ASCII strings), MT19937, V8 JavaScript PRNGs, and the Debian OpenSSL disaster. The smaller modes (Debian:
65K candidates) complete in milliseconds; the largest (SHA-256 of ASCII strings 1-8 chars: ~2T) takes days across multiple GPUs.
The distributed side uses a FastAPI backend that assigns work units, verifies results via deterministic checkpoint regeneration (server independently recomputes the key from mode+offset using Python), and injects canary targets (honeypot hashes where the server knows the
private key) to detect cheating workers.
Happy to answer questions about the CUDA optimization, the secp256k1 implementation, or the weak entropy patterns.
yungdeist•1h ago