I call into a vendor library that eventually calls into Solarflare's TCPDirect API to send packets to the network.
This call usually stays under 5us and has very little variance, but I was seeing the occasional 30ms or 60ms outlier.
I attached probes to track the latency of each call down the stack and it became quite easy to pin down the specific function.
Eventually tracking it down to a single assignment to a variable that was mmapped to a file, and getting spikes due to page cache writeback.
After figuring it out I found these two articles that go into more detail on the problem.
https://rigtorp.se/virtual-memory/
https://tungdam.medium.com/our-lessons-on-linux-writeback-do...
Here's one of the probes producing a histogram of the latency of the call to the TCPDirect function, which was surprising in how simple it was.
sudo bpftrace -p "$PID" -e '
// —— TCPDirect ——
uprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single {
@zf_start[tid] = nsecs;
}
uretprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single /@zf_start[tid]/ {
$lat = nsecs - @zf_start[tid];
@h_zft = hist($lat / 1000);
delete(@zf_start[tid]);
}
'
mrlongroots•1d ago
eBPF is incredible: I think of it as flipping the relationship between you and the perf tool. Instead of the tool giving you information, it asks you "what do you want to know?".
I don't know if it's google search but I find it hard to discover blogs etc. when I'm hunting for the syntax/language etc, having documented "case studies" is super useful.