On latency, measurement, and optimization in algorithmic trading systems

https://www.architect.co/posts/how-fast-is-it-really

36•auc•3d ago

Comments

nine_k•9h ago

Fast code is easy. But slow code is equally easy, unless you keep an eye, and measure.

And measuring is hard. This us why consistently fast code is hard.

In any case, adding some crude performance testing into your CI/CD suite, and signaling a problem if a test ran for much longer than it used to, is very helpful at quickly detecting bad performance regressions.

mattigames•8h ago

Exactly, another instance where perfect can be the enemy of good, many times you are better out deploying something to prod, have some fairly good logging system and whenever you see an spike in slowness you try to replicate the conditions that made it slow, and debug from there, instead of expecting to have the impossible perfect measuring system that can detect even missing atoms of networking cables.

auc•8h ago

Agreed, not worth making a huge effort toward an advanced system for measuring an ATS until you’ve really built out at scale

tombert•9h ago

I remember in 2017, I was trying to benchmark some highly concurrent code in F# using the async monad.

I was using timers, and I was getting insanely different times for the same code, going anywhere from 0ms to 20ms without any obvious changes to the environment or anything.

I was banging my head against it for hours, until I realized that async code is weird. Async code isn’t directly “run”, it’s “scheduled” and the calling thread can yield until we get the result. By trying to do microbenchmarks, I wasn’t really testing “my code”, I was testing the .NET scheduler.

It was my first glimpse into seeing why benchmarking is deceptively hard. I think about it all the time whenever I have to write performance tests.

am17an•8h ago

Typically you want to measure both things - time it takes to send an order and time it takes to calculate the decision to send an order. Both are important choke points, one for latency and the other for throughput (in case of busy markets, you can spend a lot of time deciding to send an order, creating backpressure)

The other thing is that L1/L2 switches provide this functionality, of taking switch timestamps and marking them, which is the true test of e2e latency without any clock drift etc.

Also, fast code is actually really really hard, you just to create the right test harness once

auc•8h ago

Yeah definitely. Don’t want to have an algo that makes money when times are slow but then blows up/does nothing when market volume is 10x

iammabd•7h ago

Yeah, Most people write for the happy path... Few obsess over the runtime behavior under stress.

webdevver•1h ago

sadly its usually cheaper to just kick the server once a week than to spend $$$xN dev hours doing the right thing and making it work well

...unless youre faang and can amortize the costs across your gigafleet

weinzierl•6h ago

Is your code really fast if you haven't measured it properly? I'd say measuring is hard but a prerequisite for writing fast code, so truly fast code is harder.

The number one mistake I see people make is measuring one time and taking the results at face value. If you do nothing else, measure three times and you will at least have a feeling for the variability of your data. If you want to compare two versions of your code with confidence there is usually no way around proper statistical analysis.

Which brings me to the second mistake. When measuring runtime, taking the mean is not a good idea. Runtime measurements usually skew heavily towards a theoretical minimum which is a hard lower bound. The distribution is heavily lopsided with a long tail. If your objective is to compare two versions of some code, the minimum is a much better measure than the mean.

Leszek•5h ago

Fast code isn't a quantum effect, it doesn't wait for a measurement to wave collapse into being fast. The _assertion_ that a certain piece of code is fast probably requires a measurement (maybe you can get away with reasoning, e.g. algorithmic complexity or counting instructions; each have their flaws but so does measurement).

bostik•2h ago

> The distribution is heavily lopsided with a long tail.

You'll see this in any properly active online system. Back in the previous job we had to drill it to teams that mean() was never an acceptable latency measurement. For that reason the telemetry agent we used provided out-of-the-box p50 (median), p90, p95, p99 and max values for every timer measurement window.

The difference between p99 and max was an incredibly useful indicator of poor tail latency cases. After all, every one of those max figures was an occurrence of someone or something experiencing the long wait.

These days, if I had the pleasure of dealing with systems where individual nodes handled thousands of messages per second, I'd add p999 to the mix.

ethan_smith•1h ago

For comparing HFT implementations, the 99th percentile is often more practical than minimum values since it accounts for tail latency while excluding extreme outliers caused by GC pauses or OS scheduling.

omgtehlion•4h ago

In HFT context (as in the article) measurement is quite easy: you tap incoming and outgoing network fibers and measure this time. Also you can do this in production, as this kind of measurement does not impact latency at all

dmurray•37m ago

The article also touches on some reasons this isn't enough. You might want to test outside of production, you might want to measure the latency when you decide to send no order, and you might want to profile your code at a more granular level than the full lifecycle of market data to order.

Attummm•2h ago

The title is clickbait, unfortunately.

The article states the opposite.

> Writing fast algorithmic trading system code is hard. Measuring it properly is even harder.

foobar10000•15m ago

You generally want the following in trading:

* mean/med/p99/p999/p9999/max over day, minute, second, 10ms

* software timestamps of rdtsc counter for interval measurements - am17 says why below

* all of that not just on a timer - but also for each event - order triggered for send, cancel sent, etc - for ease of correlation to markouts.

* hw timestamps off some sort of port replicator that has under 3ns jitter - and a way to correlate to above.

* network card timestamps for similar - solar flare card (amd now) support start of frame to start of Ethernet frame measurements.

Hidden interface controls that affect usability

Overthinking GIS (2024)

Valve conquered PC gaming – what comes next?

Local-first software (2019)

Serving 200M requests per day with a CGI-bin

Take Two: Eshell

July 5, 1687: When Newton explained why you don't float away

Eastern Baltic cod grow much smaller than they did due to overfishing

Show HN: I made Logic gates using CSS if() function

What a Hacker Stole from Me

Can we test it? Yes, was can [video]

How to Network as an Introvert

Development of a transputer ISA board

The Force-Feeding of AI on an Unwilling Public

Europe's first geostationary sounder satellite is launched

macOS Icon History

Volvo delivers 5,000th electric semi

"Swiss Cheese" Failure Model

X-Clacks-Overhead

Four integers are enough to write a Snake Game

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

ClojureScript from First Principles [video]

Speeding up PostgreSQL dump/restore snapshots

On latency, measurement, and optimization in algorithmic trading systems

Are we the baddies?

Yet Another Zip Trick

Ask HN: If AGI were invented tomorrow which countries would fare better?

Techno-feudalism and the rise of AGI: A future without economic rights?

New study offers clues about what makes someone cool

Show HN: a community for collaborating on sideprojects

On latency, measurement, and optimization in algorithmic trading systems

Comments

Hidden interface controls that affect usability

Overthinking GIS (2024)

Valve conquered PC gaming – what comes next?

Local-first software (2019)

Serving 200M requests per day with a CGI-bin

Take Two: Eshell

July 5, 1687: When Newton explained why you don't float away

Eastern Baltic cod grow much smaller than they did due to overfishing

Show HN: I made Logic gates using CSS if() function

What a Hacker Stole from Me

Can we test it? Yes, was can [video]

How to Network as an Introvert

Development of a transputer ISA board

The Force-Feeding of AI on an Unwilling Public

Europe's first geostationary sounder satellite is launched

macOS Icon History

Volvo delivers 5,000th electric semi

"Swiss Cheese" Failure Model

X-Clacks-Overhead

Four integers are enough to write a Snake Game

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

ClojureScript from First Principles [video]

Speeding up PostgreSQL dump/restore snapshots

On latency, measurement, and optimization in algorithmic trading systems

Are we the baddies?

Yet Another Zip Trick

Ask HN: If AGI were invented tomorrow which countries would fare better?

Techno-feudalism and the rise of AGI: A future without economic rights?

New study offers clues about what makes someone cool

Show HN: a community for collaborating on sideprojects