One of the things that struck me when reading this with only general knowledge of the linux kernel is: What makes things so terrible? Is iptables really that bad? Is something serialized to a single core somewhere in the other 3 scenarios? Is the CPU at 100% in all cases? Is this TCP or UDP traffic? How many threads is iperf using? It would be cool to see the CPU utilization of all 4 scenarios, along with CPU flamegraphs.
You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average.
Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path.
This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth).
It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does.
In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers.
Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency.
On what's now almost 10 year old hardware, we could drop 44Mpps of a volumetric DOS attack and still serve our nominal workload with no impact. See PFILCTL(8) and PFIL(9), focus on ethernet (link layer) packets.
It relies on the same principal -- NIC passes the RX buffer directly to the firewall (ipfw, pf, or ipfilter). If the firewall says the packet is OK, rx processing happens as normal. If it says to drop, then dropping is very fast because it can simply re-use the buffer without re-allocation, re-doing DMA mapping, etc.
The beauty of XDP is that it's all eBPF. Completely customizable by injecting policy where it's needed and native to the kernel.
You'll bypass a memory copy (ringbuf -> kernel memory), allocations (skb), parsing (ips & such), firewalling, checking if the packet is local, checksum validation, the list goes on...
The following diagram helps seeing all the things that happens: https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilte...
(yes, xdp is the leftmost step, literally after "card dma'd packet in memory")
You can also use XDP for outgoing packets for tap interfaces.
In some scenarios veth is being replaced with netkit for a similar reason. Does this impact how you're going to manage this?
Honestly, the real news is that they're doing it in production, not that they found anything unique.
Heck, all the XDP development I've ever done was against a veth interface on my laptop, to run later on server metal.
Maybe more importantly: they're not building a middlebox. DPDK ultra-high performance comes in part from polling. It's always running. XDP is just an extension to the existing network driver.
Why doesn’t checksum offload in the NIC take care of that?
I'll definitely be coming to check you all out at Kubecon.
* The BPF verifier's DX is not great yet. If it finds problems with your BPF code it will spit our a rather inscrutable set of error messages that often requires a good understanding of the verifier internals (e.g the register nomenclature) to debug
* For the same source code, the code generated by the verifier can change across compiler versions in a breaking way, e.g. because the new compiler version implemented an optimization that broke the verifier (see https://github.com/iovisor/bcc/issues/4612)
* Checksum updating requires extra care. I believe you can only do incremental updates, not just because of better perf as the post suggests but also because the verifier does not allow BPF programs to operate on unbounded buffers (so checksumming a whole packet of unknown size is tricky / cumbersome). This mostly works but you have to be careful with packets that were generated with csum offload, don't have a valid checksum and whose csum can't be incrementally updated.
As the blog post points out, the kernel networking stack does a lot of work that we don't generally think about. Once you start taking things into your own hands you don't have the luxury of ignorance anymore (think not just ARP but also MTU, routing, RP filtering etc.), something any user of userspace networking frameworks like DPDK will tell you.
My general recommendation is to stick with the kernel unless you have a very good justification for chasing better performance and if you do use eBPF save yourself some trouble and try to limit yourself to readonly operations, if your use case allows.
Also, if you are trying to debug packet drops, newer kernels have started logging this information that you can track using bpftrace which gives you better diagnostics.
Example script (might have to adjust based on kernel version):
bpftrace -e '
kprobe:kfree_skb_reason {
$skb = (struct sk_buff *)arg0;
$ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
printf("reason :%d %s -> %s\n", arg1, ntop($ipheader->saddr), ntop($ipheader->daddr));
}'
loopholelabs•1d ago
This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.
rtkaratekid•2h ago
toprerules•1h ago
You can't compare the efficiency of the frameworks without talking about the specific setups on the host. The major advantage of XDP is that it is completely baked into the kernel. All you need to do is bring your eBPF program and attach it. DPDK requires a great deal of setup and user space libraries to work.
tuetuopay•23m ago
1. to get the absolute best performance, you're running in poll-mode, and burning cpu cores just for packet processing
2. the network interface is invisible to the kernel, making non-accelerated traffic on said interface tricky (say, letting the kernel perform arp resolution for you).
3. your dataplane is now a long-lived process, which means that stopping said process equates to no more network (hello restarts!)
Alleviating most of those takes a lot of effort or some tradeoffs making it less worth it:
1. can be mitigated by adaptive polling at the cost of latency.
2. by using either a software bifurcation by re-injecting non-accelerated traffic in a tap, or with NICs with hardware bifurcation (e.g. connectx) and installing the flows in its flow engine. Both are quite time consuming to get right
3. by manually writing a handoff system between new and old processes, and making sure it never crashes
DPDK also needs its own runtime, with its own libraries. Some stuff will be manual (e.g. giving it routing tables). XDP gives all of those for free:
1. All modern NIC drivers will already perform adaptive polling and interrupt moderation; so you're not burning CPU cycles on polling the card outside of high packet rate scenarios (on which you'd burn CPUs on IRQs and context switches anyways).
2. It's just an extra bit of software in the driver's path, and the XDP program decides whether to handle it itself or pass id down to the kernel. Pretty useful to keep ARP, ICMP, BGP, etc without extra code.
3. XDP is closer to a lambda than anything: the code runs once for every single packet, meaning its runtime is extremely short. This also means that the long-running process is your kernel; and that updating the code is an atomic operation that done on the fly.
4. A lot of facilities are already provided, and the biggest of them is maps. The kernel handles all the stateful things to feed data (routing tables, arp tables, etc) to your dataplane code. CPU affinity is also handled by the kernel in the sense that XDP runs on the CPU responsible for the NIC queue, whose mapping is controlled through standard kernel interfaces, unrelated to XDP (meaning: not on your mind).
Now, speaking purely of optimizations. Yes, DPDK will always be better CPU-wise because you can compile it with -march native while eBPF is JIT-ed when available (and pretty poorly, having already looked at it). However, from experience, the parts that actually take time are map lookups (looking up the nexthop, looking up the mac address, etc), and those are written in C in the kernel, thus are as optimized as the kernel can be. Recompiling the kernel for your CPU can boost performance, but I've never done it myself.
Today, I would consider that unless you absolutely need the absolute best performance, XDP is more than fine. Modern CPUs are so fast that it's not worth it to consider DPDK for most cases.
- container routing like there? the dpdk runtime is a no-go, and the operational flexibility of xdp is a killer.
- network appliances like switches/routers? shell out a few extra bucks and buy a slightly better CPU. if latency is paramount, or you're doing per-packet processing that cannot fit in an ebpf probe, then go the dpdk route.
At a previous job, I rewrote for fun a simple internal dpdk routing application using xdp: only half the performance (in packets per second, not bits per second) on the same hardware with no optimizations whatsoever, in 100 lines or ebpf. Mind you, I could saturate a 100Gbps link with 100 bytes packets, not 64 bytes, what a tragedy /s. On more modern hardware (latest EPYC), I trivially reached 200Mpps on an 8 core CPU using XDP.
Long story short, you'll know when you need DPDK.