My first instinct, knowing less about this domain than maybe I should, would be to abuse the return address predictor. I believe CPUs will generally predict the target of a “ret” instruction using an internal stack of return addresses; some ARM flavours even make this explicit (https://developer.arm.com/documentation/den0042/0100/Unified...).
The way to abuse this would be to put send() on the normal return path and call abandon() by rewriting the return address. In code:
void resolve(Transaction *t) {
predict(t);
send(t);
}
void predict(Transaction *t) {
if (!should_send(t)) *(void **)__builtin_return_address(0) = &abandon;
}
This isn’t exactly correct because it ignores control flow integrity (which you’d have to bypass), doesn’t work like this on every architecture, and abandon() would need to be written partly in assembly to deal with the fact that the stack is in a weird state post-return, but hopefully it conveys the idea anyway.The if in predict() is implementable as a branchless conditional move. The return address predictor should guess that predict() will return to send(), but in most cases you’ll smash the return address to point at abandon() instead.
Seems you'd be doing this anyway with the dummy transactions.
Then you have no branch, though may want to add dummy transactions anyway to keep the code in cache.
The article suggests flooding the system with dummy "should send" transactions so that they become the majority.
Quote:
> One such solution that I know of is one that Carl Cook talked about during his CppCon 17 talk2: we can fill our system with mocked transaction data for which should_send(t) returns true. We have to do this enough times such that the mocked transaction data becomes the vast majority over the real data, so the branch predictor will be primed to assume that the send() path will be executed practically on every resolve() call. Overall, this may actually be a better approach than hard-coding prediction rules, because those rules wouldn’t really guarantee us that the whole send() path would get executed (the assumption may just lead to partial execution, until the should_send(t) condition is actually evaluated and the pipeline is flushed; so at the end we may still have important stuff not placed in the instruction/data cache).
What I am suggesting is to remove the branch entirely, and instead poison the "should abandon" transactions so they get dropped or null routed on the NIC. This is the kind of thing that low latency cut through L1 switches do.
Thereby removing the CPU branch predictor from the equation entirely.
> Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does.
This made me chuckle. Thanks.
Yes, sorry, you’re correct. I’ve usually had 97 more double ristrettos by this time in the morning.
Some schools of though suggest this has already happened.
It was so detailed it makes me wonder if maybe it was in the training data somewhere. Maybe it ingested an internal MS doc for a proposed API or something. The case of the missing ARM instructions makes me wonder the same. Maybe someone on a forum proposed them and they were ingested.
I did actually verify on the OS that the calls do not exist in the kernel or any driver or DLL on the system.
They've been fed enough real text in each of these various styles that they can parrot the shallow surface style without reference to the internals.
last time I checked a cpu documentation they had a simple rule that branches are always taken, that would be easy for the compiler to code order first. However I don't recall which CPU that was. Still this whole thing feels like a citation needed with me being suspicious it is false. CPU designers know this matters and sometimes compilers have information they don't that users care about: they document how it works. (This is the CPU not the instruction set)
Was that in the 80s? Modern performant CPUs all use dynamic branch prediction.
I’m not really sure I understand the “where does it get the memory” point. Yes, this requires some amount of memory per tracked branch. This memory is hardwired into the CPU, just like all the other memory a CPU needs to do its job (registers, L1 cache, TLB, various in-flight instruction state, etc.)
This sucks.
The trigger doesn't happen often, you when it does, the naive implementation incurs a misprediction penalty.
I wonder if the solution here might be to avoid the branch altogether?
Write some boolean logic monster that isn't an if condition. It unconditionally creates the reaction (new/cancel order), but if the trigger was not present, the payload is invalid and downstream processing will not let it out of your system onto the network.
1. They don't work that well.
2. The intended use case is that you'd label an unlikely branch that you want to speed up as `[[likely]]`, which is confusing.
They are certainly motivated by good intentions (like the HFT use-case as mentioned in TFA, I remember that talk too, I think I was a volunteer that year for CppCon), but even after knowing both of my points above _they were still added to the standard_.
2 - Can you back that up? Generally [[likely]] is for the codepath that you want to be faster, where common path = fast path. It is a specific HFT use case to desire the fast path to be the uncommon path. [[likely]] is definitely intended to be the fast path per https://en.cppreference.com/w/cpp/language/attributes/likely
I write software packet processors for a living, and it's common to have requirements stated as "process X packets per second, no matter what", ie youre judged by the worst case kind of packets. Also common is optimization of 99.9th percentile latency.
There's also just cases like if (being ddosed) or if (overloaded) where you definitely want the rare case to be faster.
As to point 1, there's.... a significant doubt as to whether these actually change performance. I have never personally managed to confirm a performance gain from them, and not for lack of trying. CPUs also seem to not give them a ton of weight, and will probably override you if they see a way more common other path.
for example, predicting a different branch can lead to different memory access patterns. maybe you can get rid of most of the real world cost of the misprediction by just forcefully prefetching the memory the send() path is going to use.
tux3•2mo ago
This isn't always a win, because you prevent the CPU from speculating down the wrong path, but you also prevent it from speculating the correct path.
If you really don't care about the failure path and really don't mind unmaintainable low-level hacks, I can think of a few ways to get creative.
First there's the whole array of anti uarch-speculation-exploit tricks in the Kernel that you can use as inspiration to control what the CPU is allowed to speculate. These little bits of assembly were reviewed by engineers from Intel and AMD, so these tricks can't stop working without also breaking the kernel with it.
Another idea is to take inspiration from anti-reverse engineering tricks. Make the failure path an actual exception. I don't mean software stack unwinding, I mean divide by your boolean and then call your send function unconditionally. If the boolean is true, it costs nothing because the result of the division is unused and we just speculate past it. If the boolean is false, the CPU will raise a divide by 0 exception, and this invisible branch will never be predicted by the CPU. Then your exception handler recovers and calls the cold path.
sparkie•2mo ago
If we're sending significantly more often than 1/256 calls to resolve, it may be possible to train the BTP to prefer the send branch, as it will correctly predict this branch more often than the others which are chosen randomly - though this would depend on how the branch target predictor is implemented in the processor.
Dwedit•2mo ago
On GCC, it's multiply by 1103515245 then add 12345, and return low 30 bits. On MSVC, it's multiply by 214013 and add 2531011 then return bits 16...30.
sparkie•2mo ago