In-Memory C++ Leap in Blockchain Analysis

https://caudena.com/the-in-memory-c-leap-in-blockchain-analysis/

62•caudena•23h ago

Hey HN

We’re the core engineering team at Caudena (which is used globally by investigative and intelligence agencies, including: Europol, Interpol, BKA, DHS, IRS-CI, FBI, NPA and others), and we just released the technical details behind Prism - our real-time, in-memory C++ database for blockchain analysis.

To tackle the massive scale and complexity of blockchain data, we had to get creative with low-level engineering:

- We utilize barebone servers with 2TB RAM and 48 Cores.

- Implemented lock-free concurrent data structures

- Developed a custom memory management system

- Leveraging CPU-level vectorization

- Built a custom in-memory columnar/graph database from scratch

We’d love to AMA about:

- the engineering choices we made

- crazy optimizations that paid off

- pitfalls we hit

Ask us anything about scaling, memory trade-offs, building real-time analytics on immutable data, or the crypto-forensics space.

Looking forward to a great convo!

Comments

rubenvanwyk•7h ago

This should be a Show HN?

caudena•7h ago

We've been thinking about it, but according to Show HN rules, blog posts are considered off-topic.

Snoozus•7h ago

We built something very similar back in 2016, in the jvm with unsafe memory and garbage-free data structures to avoid GC pauses. The dynamic clustering is not too hard, are you able to dynamically undo a cluster when new information shows up?

Are you running separate instances per customer to separate the information they have access to?

caudena•6h ago

Assuming by undoing you mean splitting the cluster:

A linked list can be split in two in O(1). When it comes to updating the roots for all the removed nodes, there is no easy way out, but luckily:

- This process can be parallelized.

- It could be done just once for multiple clustering changes.

- This is a multi-level disjoint set, not all the levels or sub-clusters are usually affected. Upper level clustering, which is based on lower confidence level, can be rebuilt more easily.

If by undoing you mean reverting the changes, we don’t use a persistent data structure. When we need historical clustering, we use a patched forest with concurrent hash maps to track the changes, and then apply or throw them away.

We use a single instance for all clients, but when one CFD server processes new block data, it becomes fully blocked for read access. To solve this, we built a smart load balancer that redirects user requests to a secondary CFD server. This ensures there's always at least two servers running, and more if we need additional throughput.

joshstrange•7h ago

Do you see crypto as anything more than scams/crime/speculation?

Most people involved in crypto pretend it's the future and their business models depend on pumping up crypto. That might be the same for you all but I figure of anyone in the space, a group dedicated to tracking down where coins are moving for government agencies (I assume for scams/crime reasons) might not have the wool so pulled over their eyes.

caudena•6h ago

First of all, at Caudena, we are not involved in crypto projects or investments ourselves. Our expertise lies in analyzing blockchains and providing deep technical insights into how various blockchains operate. We focus on tracking and understanding the flow of digital assets, often in support of government agencies investigating scams, fraud, and other illicit activities.

That said, we absolutely believe that blockchain and cryptocurrency will shape the future of the financial system. When you look beyond the noise of scam tokens, speculative NFTs, and high-profile scandals, there is significant and meaningful financial innovation happening. This extends beyond DeFi to include the tokenization of RWA, where major institutions like BlackRock and JPM Chase are actively exploring and implementing blockchain-based solutions. Numerous projects are driving real progress, and there’s a slow but steady movement toward a more decentralized and transparent financial ecosystem.

jnkl•4h ago

Can you be a bit more specific about the practical aspects of block chain technology regarding RWA?

germandiago•5h ago

Who says that crypto is exclusively scams? There is that of course, but not only that. I do not find Bitcoin to be a scam.

newswasboring•5h ago

There are like attempts at non scam projects, but none of them get any traction and usually end up closing. What, in your opinion, is a success story in this space?

seviu•4h ago

Cadena post above yours mentioned quite a few use successful cases, all built on top of Ethereum or copy cats (Ethereum is by itself a successful use case)

Without thinking too hard, Aave is shaping to be a giant by its own as lending protocol.

Circle recently had a very successful IPO.

Farcaster and Lens are attempting to compete as social network platforms (surprisingly they lack much of the toxicity that comes on the most known ones)

And lastly don’t forget Polymarket, which is pretty well known beyond the crypto space.

The list goes on and on if you care to dig a bit deeper

newswasboring•3h ago

All of these are at best nice prospects and most of these are just services which are only useful if the space itself is useful. I'm sorry but I'm not convinced of utility by layers upon layers of "successful" protocols.

seviu•1h ago

I wouldn’t call polymarket a nice prospect though.

IshKebab•5h ago

Apart from Bitcoin is there anything successful that isn't a scam? I never heard of any.

drdrey•3h ago

Stripe would like to have a word

IshKebab•2h ago

What successful crypto products do they have?

plq•7h ago

When implementing the lock-free stuff, was portability (across processors) a goal? If yes, did you have to deal with anything specific? Do you notice any difference in behavior of correct implementations when ran on different processors? How do you test for correctness of lock-free stuff?

EDIT: Oh and did you implement from scratch? Why not use eg. the RCU implementation from folly?

caudena•5h ago

We never targeted weakly-ordered architectures like ARM, only x86. We never used a wide variety of different processors. We are not developing the Linux kernel and are not into control dependencies, just relying on the fences and the memory model. There may be some CPU-dependent performance differences, like discrepancy because of NUMA or false sharing being noticeable on one processor, but not on another. RCU and hazard pointers are nothing new. For the disjoint sets we don't need them. For the forest patches and the tries we do. We are using TBB and OpenMP whenever possible and trying to keep things simple.

folk111•6h ago

is it true that XMR / monero is untraceable?

caudena•6h ago

No.

layer8•6h ago

> barebone servers

You mean bare-metal servers?

caudena•6h ago

Ohh, you're absolutely right!

dboreham•5h ago

barebone server is a thing fwiw: A product that comprises a motherboard installed in a case with PSU. Customer adds CPU, memory and storage devices to make a complete usable server. We typically buy servers in this way because figuring out what motherboard fits in which case is a pita, conversely buying complete servers is more expensive and potentially runs into inventory issues at the vendor. So possibly they are running bare metal servers that were also barebone.

generalenvelope•5h ago

Curious why you chose C++? Were there aspects of other languages/ecosystems like Rust that were lacking? Would choosing Rust be advantageous for blockchains that natively support it (like Solana)?

To be clear: I don't mean to imply you should have done it any other way. I'm interested mainly in gaps in existing ecosystems and whether popular suggestions to "deprecate C++ for memory safe languages" (like one made by Azure CTO years ago) are realistic.

kanbankaren•5h ago

What is wrong with C++?

With POSIX semaphores, mutexes, and shared pointers, it is very rare to hit upon a memory issue in modern C++.

Source: Writing code in C/C++ for 30 years.

wat10000•5h ago

What a terrifying statement.

Edit: to be less glib, this is like saying “our shred-o-matic is perfectly safe due to its robust and thoroughly tested off switch.” An off switch is essential but not nearly enough. It only provides acceptable safety if the operator is perfect, and people are not. You need guards and safety interlocks that ensure, for example, that the machine can’t be turned on while Bob is inside lubricating the bearings.

Mutexes and smart pointers are important constructs but they don’t provide safety. Safety isn’t the presence of safe constructs, but the absence of unsafe ones. Smart pointers don’t save you when you manage to escape a reference beyond the lifetime of the object because C++ encourages passing parameters by reference all over the place. Mutexes and semaphores don’t save you from failing to realize that some shared state can be mutated on two threads simultaneously. And none of this saves you from indexing off the end of a vector.

You can probably pick a subset of C++ that lets you write reasonably safe code. But the presence of semaphores, mutexes, and shared pointers isn’t what does it.

Source: also writing C and C++ for 30 years.

lisper•3h ago

> Safety isn’t the presence of safe constructs, but the absence of unsafe ones.

Exactly. Here is a data point: https://spinroot.com/spin/Doc/rax.pdf

Tl;DR: This was software that ran on a spacecraft. Specifically designed to be safe, formally analyzed, and tested out the wazoo, but nonetheless failed in flight because someone did an end-run around the safe constructs to get something to work, which ended up producing a race condition.

nesarkvechnep•5h ago

The worst code is usually written by someone who’s doing it for 30 years and can’t find a problem with their technology of choice.

Especially with shared pointers you can encounter pretty terrible memory issues.

kanbankaren•4h ago

Dude, provide examples of "terrible" memory issues. Otherwise, you are just repeating the folklore which is outdated.

CharlesW•4h ago

> With POSIX semaphores, mutexes, and shared pointers, it is very rare to hit upon a memory issue in modern C++.

There is a mountain of evidence (two examples follow) that this is not true. Roughly two-thirds of serious security bugs in large C++ products are still memory-safety violations.

(1) https://msrc.microsoft.com/blog/2019/07/we-need-a-safer-syst... (2) https://www.chromium.org/Home/chromium-security/memory-safet...

kanbankaren•4h ago

Show me a memory issue that was caused by proper usage of POSIX concurrency primitives.

jenadine•4h ago

Proper usage is fine. The problem is that it is easy to make mistakes. The compiler won't tell you and you may not notice until too late in production, and it will take forever to debug.

CharlesW•4h ago

Here's two: CVE-2021-33574, CVE-2023-6705. The former had to be fixed in glibc, illustrating that proper usage of POSIX concurrency primitives does nothing when the rest of the ecosystem is a minefield of memory safety issues. There are some good citations on page 6 of this NSA Software Memory Safety overview in case you're interested://media.defense.gov/2022/Nov/10/2003112742/-1/-1/0/CSI_SOFTWARE_MEMORY_SAFETY.PDF

treyd•4h ago

You're right, if you use the concurrency primitives properly you won't have data races. But the issue is when people don't use the concurrency primitives properly, which there is ample evidence for (posted in this thread) happening all the time.

But with this argument, the response is "well they didn't use the primitives properly so the problem is them", which shifts the blame onto the developer and away from the tools which are too easy to silently misuse.

This also ignores memory safety issues that aren't data races, like buffer overflows, UAF, etc.

wat10000•3h ago

Any reasonable meaning of “proper” would include not causing memory issues, so you’ve just defined away any problems. Note that this is substantially different from not having any problems.

The great lesson in software security of the past few decades is that you can’t just document “proper usage,” declare all other usage to be the programmer’s fault, and achieve anything close to secure software. You must have systems that either disallow unsafe constructs (e.g. rust preventing references from escaping at compile time) or can handle “improper usage” without allowing it to become a security vulnerability (e.g. sandboxing).

Correctly use your concurrency primitives and you won’t have thread safety bugs, hooray! And when was the last time you found a bug in C-family code caused by someone who didn’t correctly use concurrency primitives because the programmer incorrectly believed that a certain piece of mutable data would only be accessed on a single thread? I’ll give you my answer: it was yesterday. Quite likely the only reason it’s not today is because I have the day off.

kanbankaren•2h ago

> And when was the last time you found a bug in C-family code caused by someone who didn’t correctly use concurrency primitives because the programmer incorrectly believed that a certain piece of mutable data would only be accessed on a single thread? I’ll give you my answer: it was yesterday.

You answered my question. My original argument was using concurrency primitives "properly" in C++ prevents memory issues and Rust isn't strictly necessary.

I have nothing against Rust. I will use it when they freeze the language and publish a ISO spec and multiple compilers are available.

bobmcnamara•2h ago

dozens caused by folks thinking pthread_cancel() was the right tool for the job

npalli•4h ago

Rust is the future of systems programming and will always be for the foreseeable future. The memory issue will mostly be addressed as needed, see from John Carmack yesterday[1], the C++ ecosystem advantage (a broad sense of how problems whether DS, Storage, OS, Networking, etc. have been solved) will be very hard to overcome for newer programming languages. I think it is ironic how modern C++ folks just keep chugging along releasing products while Rust folks are generally haranguing everyone about "memory safety" and generally leaving half finished projects (turns out writing Rust code is more fun than reading someone else, who would have guessed).

[1] https://x.com/ID_AA_Carmack/status/1935353905149341968

wgjordan•4h ago

> The memory issue will mostly be addressed as needed

I have no allegiance to either lang ecosystem, but I think it's an overly optimistic take to consider memory safety a solved problem from a tweet about fil-c, especially considering "the performance cost is not negligible" (about 2x according to a quick search?)

npalli•4h ago

Performance drop of 2x for memory safety critical sections vs Rust rewrite taking years/decades, not even a contest. Now, if that drop was 10x maybe, but at 2x it is no brainer to continue with C++. I'm not certain Fil-C totally works in all cases, but it is an example of how the ecosystem will evolve to solve this issue and not migrate to Rust.

hexaga•2h ago

What would you consider to be a non memory safety critical section? I tried to answer this and ended up in a chain of 'but wait, actually memory issues here would be similarly bad...', mainly because UB and friends tend to propagate and make local problems very non-local.

secondcoming•4h ago

What's this Rust thing?

caudena•4h ago

Because we are on the 'unsafe' territory. And Rust doesn't even have a defined memory model. Rust is a little bit immature. We have some other services written in Rust though.

wslh•5h ago

Thank you for the AMA. A few initial questions:

- Would it be possible to open source your DB in the future? I think there are challenges in blockchain analysis (e.g. internal transactions) that goes beyond the specific DB.

- Having used Chainalysis and others, your product seems superior based on your presentation. Which blockchains do you support?

- Is there a "HN Code" to test Prism?

caudena•4h ago

Thanks for the questions! We don't currently have plans to open-source it. For anything else, feel free to reach out at pa@caudena.com - happy to discuss further there. We'd like to keep this thread focused on the technical side rather than product discussions :)

Snoozus•5h ago

If the FBI tells you wallet A and wallet B belong to the same actor, how do you use that information, so that they can see it on their view, without leaking it to Europol?

caudena•4h ago

Are you from CA or CT? :)

FBI and Europol will work with the same forest (unless they are using on-premise setup), but with different "patches".

BiraIgnacio•5h ago

I didn't even know there were companies doing work in the "blockchain services" space. Kinda cool, tech begets tech, begets tech.

Love the C++ work, btw

canyp•4h ago

You really had to call it Prism (PRISM), didn't you?

It's great to see C++ resulting in orders of magnitude cost reduction anyway. Do you have more details on the various C++ tricks done for optimization?

caudena•4h ago

Yeah, we figured people would compare it to PRISM :)

There are many possible optimizations, but they’re all highly specific to the particular problems you’re trying to solve.

CharlesW•4h ago

> "Built a custom in-memory columnar/graph database from scratch"

This seems like an odd place to spend your resources. What do Prism's benchmarks look like vs Memgraph, KX kdb+, Apache Ignite, TigerGraph, etc.?

actionfromafar•3h ago

Can't it be related to that data field sizes are very fixed and never changes?

kayamon•1h ago

why you spyin on folks

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

Andrej Karpathy: Software in the era of AI [video]

Curved-Crease Sculpture

Eliza: The doll that teaches girls to code

Show HN: EnrichMCP – A Python ORM for Agents

How OpenElections uses LLMs

Homegrown Closures for Uxn

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Star Quakes and Monster Shock Waves

We Can Just Measure Things

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Flowspace (YC S17) Is Hiring Software Engineers

Posit floating point numbers: thin triangles and other tricks (2019)

Guess I'm a Rationalist Now

Juneteenth in Photos

What would a Kubernetes 2.0 look like

Show HN: Unregistry – “docker push” directly to servers without a registry

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Researchers are now vacuuming DNA from the air

Geochronology supports LGM age for human tracks at White Sands, New Mexico

Public/protected/private is an unnecessary feature

Why do we need DNSSEC?

Testing a Robust Netcode with Godot

Visual History of the Latin Alphabet

Munich from a Hamburger's perspective

Getting Started Strudel

Elliptic Curves as Art

The Scheme That Broke the Texas Lottery

Finding Dead Websites

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

Andrej Karpathy: Software in the era of AI [video]

Curved-Crease Sculpture

Eliza: The doll that teaches girls to code

Show HN: EnrichMCP – A Python ORM for Agents

How OpenElections uses LLMs

Homegrown Closures for Uxn

Show HN: A DOS-like hobby OS written in Rust and x86 assembly

Star Quakes and Monster Shock Waves

We Can Just Measure Things

Show HN: Claude Code Usage Monitor – real-time tracker to dodge usage cut-offs

Flowspace (YC S17) Is Hiring Software Engineers

Posit floating point numbers: thin triangles and other tricks (2019)

Guess I'm a Rationalist Now

Juneteenth in Photos

What would a Kubernetes 2.0 look like

Show HN: Unregistry – “docker push” directly to servers without a registry

From LLM to AI Agent: What's the Real Journey Behind AI System Development?

Researchers are now vacuuming DNA from the air

Geochronology supports LGM age for human tracks at White Sands, New Mexico

Public/protected/private is an unnecessary feature

Why do we need DNSSEC?

Testing a Robust Netcode with Godot

Visual History of the Latin Alphabet

Munich from a Hamburger's perspective

Getting Started Strudel

Elliptic Curves as Art

The Scheme That Broke the Texas Lottery

Finding Dead Websites

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

In-Memory C++ Leap in Blockchain Analysis

Comments