Stack: Python asyncio, Kafka in KRaft mode, ClickHouse, k3s. Cloudflare Tunnel handles ingress.
Some things that broke along the way:
ORDERBOOK GAPS Exchanges skip sequence numbers sometimes. Your local book drifts and you dont notice until something goes wrong. Had to build per-symbol gap detection with automatic snapshot recovery. Each exchange does sequencing differently so thats four separate implementations.
CLICKHOUSE INSERTS Started with small batches, ClickHouse was at 30% CPU just doing merges. Bumped batch size to 5000 rows with 2 second intervals, dropped to 8%. Also moved inserts to an async queue so the Kafka consumer never blocks.
LOGGING At 500 msg/s the logger was allocating thousands of strings per second. OOM killer got us twice before I figured it out. Set everything on the data path to WARNING and it went away.
Current numbers: - 120M+ messages/day - P50: 250ms, P95: 400ms latency - >99.8% data coverage - 5 months, no major incidents
If anyone wants to poke around the data:
qalypto.com/data-lab
CSV samples, no signup needed.
Happy to answer questions.