A couple of quants had built a random forest regression model that could take inputs like time of day, exchange, order volume etc and spit out an interval of what latency had historically been in that range.
If the latency moved outside that range, an alert would fire and then I would co-ordinate a response with the a variety of teams e.g. trading, networking, Linux etc
If we excluded changes on our side as the culprit, we would reach out to the exchange and talk to our sales rep there would might also pull in networking etc.
Some exchanges, EUREX comes to mind, were phenomenal at helping us identify issues. e.g. they once swapped out a cable that was a few feet longer than the older cable and that's why the latency increased.
One day, it's IEX, of Flash Boys fame, that triggers an alert. Nothing changed on our side so we call them. We are going back and forth with the networking engineer and then the sales rep says, in almost hushed tones:
"Look, I've worked at other exchange so I get where you are coming from in asking these questions. Problem is, b/c of our founding ethos, we are actually not allowed to track our own internal latency so we really can't help you identify the root cause. I REALLY wish it was different."
I love this story b/c HN, as a technology focused site, often thinks all problems have technical solutions but sometimes it's actually a people or process solution.
Also, incentives and "philosophy of the founders" matter a lot too.
Many firms have them and they are a hybrid of:
- DevOps (e.g. we help, or own, deployments to production)
- SRE (e.g. we own the dashboards that monitored trading and would manage outages etc)
- Trading Operations (e.g. we would work with exchanges to set up connections, cancel orders etc)
My background is:
- CompSci/Economics BA
- MBA
- ~20 years of basically doing the above roles. I started supporting an in house Order Management System at a large bank and then went from there.
For more detail, here is my LinkedIn: https://www.linkedin.com/in/alex-elliott-3210352/
I also have a thread about the types of outages you see in this line of work here: https://x.com/alexpotato/status/1215876962809339904?s=20
(I have a lot of other trading/SRE related threads here: https://x.com/alexpotato/status/1212223167944478720?s=20)
I'm a front office engineer at a prop firm -- always interesting to get insight into how others do it.
We have fairly similar parallels, maybe with the exception of throwing new exchange connections to the dedicated networking group.
Always love watching their incident responses from afar (usually while getting impacted desks to put away the pitchforks). Great examples of crisis management, effectiveness and prioritization under pressure, ... All while being extremely pragmatic about actual vs perceived risk.
(I'm sure joining KCG in August of 2012 was a wild time...)
A notable edge case here is that if EVERYTHING (e.g. market data AND orders) goes through the sequencer then you can, essentially, Denial of Service to key parts of the trading flow.
e.g. one of the first exchanges to switch to a sequencer model was famous for having big market data bursts and then huge order entry delays b/c each order got stuck in the sequencer queue. In other words, the queue would be 99.99% market data with orders sprinkled in randomly.
for an exchange: market data is a projection of the order book, an observer that sits on the stream but doesn't contribute to it
and client ports have rate limits
dmurray•1h ago
First, it is of course possible to apply horizontal scaling through sharding. My order on Tesla doesn't affect your order on Apple, so it's possible to run each product on its own matching engine, its own set of gateways, etc. Most exchanges don't go this far: they might have one cluster for stocks starting A-E, etc. So they don't even exhaust the benefits available from horizontal scaling, partly because this would be expensive.
On the other hand, it's not just the sequencer that has to process all these events in strict order - which might make you think it's just a matter of returning a single increasing sequence number for every request. The matching engine which sits downstream of the sequencer also has to consume all the events and apply a much more complicated algorithm: the matching algorithm described in the article as "a pure function of the log".
Components outside of that can generally be scaled more easily: for example, a gateway cares only about activity on the orders it originally received.
The article is largely correct that separating the sequencer from the matching engine allows you to recover if the latter crashes. But this may only be a theoretical benefit. Replaying and reprocessing a day's worth of messages takes a substantial fraction of the day, because the system is already operating close to its capacity. And after it crashed, you still need to figure out which customers think they got their orders executed, and allow them to cancel outstanding orders.
londons_explore•1h ago
For example, Order A and order B might interact with eachother... but they also might not. If we assume they do not, we can have them processed totally independently and in parallel, and then only if we later determine they should have interacted with each other then we throw away the results and reprocess.
It is very similar to the way speculative execution happens in CPU's. Assume something then throw away the results if your assumption was wrong.
noitpmeder•37m ago
Happy to discuss more, might be off the mark... these optimizations are always very interesting in their theoretical vs actual perf impact.
eep_social•31m ago
blibble•7m ago
not necessarily
many exchanges allow orders into one instrument to match on another
(very, very common on derivatives exchanges)