An eBPF loophole: Using XDP for egress traffic

https://loopholelabs.io/blog/xdp-for-egress-traffic

243•shivanshvij•3mo ago

Comments

shivanshvij•3mo ago

XDP (eXpress Data Path) is the fastest packet processing framework in linux - but it only works for incoming (ingress) traffic. We discovered how to use it for outgoing (egress) traffic by exploiting a loophole in how the linux kernel determines packet direction. Our technique delivers 10x better performance than current solutions, works with existing Docker/Kubernetes containers, and requires zero kernel modifications.

This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.

rtkaratekid•3mo ago

Forgive me of my ignorance, but is XDP faster than DPDK for packet processing? It seems like DPDK has had a lot of work done for hardware optimizations that allow speeds that I can’t recall XDP being able to do. I have not looked too deeply into this though, so I’m very open to being wrong!

toprerules•3mo ago

DPDK is a framework with multiple backends, on the receive side it can use XDP to intercept packets.

You can't compare the efficiency of the frameworks without talking about the specific setups on the host. The major advantage of XDP is that it is completely baked into the kernel. All you need to do is bring your eBPF program and attach it. DPDK requires a great deal of setup and user space libraries to work.

tuetuopay•3mo ago

DPDK will give you the absolute best performance, period. But it will do so with tradeoffs that are far from negligible, especially on mixed-workload machines like a docker host/k8s node/hypervisor.

1. to get the absolute best performance, you're running in poll-mode, and burning cpu cores just for packet processing

2. the network interface is invisible to the kernel, making non-accelerated traffic on said interface tricky (say, letting the kernel perform arp resolution for you).

3. your dataplane is now a long-lived process, which means that stopping said process equates to no more network (hello restarts!)

Alleviating most of those takes a lot of effort or some tradeoffs making it less worth it:

1. can be mitigated by adaptive polling at the cost of latency.

2. by using either a software bifurcation by re-injecting non-accelerated traffic in a tap, or with NICs with hardware bifurcation (e.g. connectx) and installing the flows in its flow engine. Both are quite time consuming to get right

3. by manually writing a handoff system between new and old processes, and making sure it never crashes

DPDK also needs its own runtime, with its own libraries. Some stuff will be manual (e.g. giving it routing tables). XDP gives all of those for free:

1. All modern NIC drivers will already perform adaptive polling and interrupt moderation; so you're not burning CPU cycles on polling the card outside of high packet rate scenarios (on which you'd burn CPUs on IRQs and context switches anyways).

2. It's just an extra bit of software in the driver's path, and the XDP program decides whether to handle it itself or pass id down to the kernel. Pretty useful to keep ARP, ICMP, BGP, etc without extra code.

3. XDP is closer to a lambda than anything: the code runs once for every single packet, meaning its runtime is extremely short. This also means that the long-running process is your kernel; and that updating the code is an atomic operation that done on the fly.

4. A lot of facilities are already provided, and the biggest of them is maps. The kernel handles all the stateful things to feed data (routing tables, arp tables, etc) to your dataplane code. CPU affinity is also handled by the kernel in the sense that XDP runs on the CPU responsible for the NIC queue, whose mapping is controlled through standard kernel interfaces, unrelated to XDP (meaning: not on your mind).

Now, speaking purely of optimizations. Yes, DPDK will always be better CPU-wise because you can compile it with -march native while eBPF is JIT-ed when available (and pretty poorly, having already looked at it). However, from experience, the parts that actually take time are map lookups (looking up the nexthop, looking up the mac address, etc), and those are written in C in the kernel, thus are as optimized as the kernel can be. Recompiling the kernel for your CPU can boost performance, but I've never done it myself.

Today, I would consider that unless you absolutely need the absolute best performance, XDP is more than fine. Modern CPUs are so fast that it's not worth it to consider DPDK for most cases.

- container routing like there? the dpdk runtime is a no-go, and the operational flexibility of xdp is a killer.

- network appliances like switches/routers? shell out a few extra bucks and buy a slightly better CPU. if latency is paramount, or you're doing per-packet processing that cannot fit in an ebpf probe, then go the dpdk route.

At a previous job, I rewrote for fun a simple internal dpdk routing application using xdp: only half the performance (in packets per second, not bits per second) on the same hardware with no optimizations whatsoever, in 100 lines or ebpf. Mind you, I could saturate a 100Gbps link with 100 bytes packets, not 64 bytes, what a tragedy /s. On more modern hardware (latest EPYC), I trivially reached 200Mpps on an 8 core CPU using XDP.

Long story short, you'll know when you need DPDK.

vsgherzi•3mo ago

Oh wow interesting, so the rewrite only went 1/2 as fast? I know cloudflare uses ebpf quite heavy whereas the Great Firewall uses DPDK. I wonder if cloudflare's motivation is just to run it easier on GCP. Any cloudflare employee's here?

tuetuopay•3mo ago

Pretty much, which was incredible for a half day rewrite, learning ebpf in rust included. The effort to result ratio is simply incredible. A few cleanups and optimizations later and I was pretty much convinced I would not need to touch DPDK again (so was the company). Following this experiments, I wrote some actual production grade eBPF routers at this company that are in production, much more complex, but still able to reach 200Mpps on a $500 CPU (EPYC 9015).

As for why Cloudflare uses eBPF where the GFW uses DPDK I can see a few reasons:

- DPDK was the only game in town when the GFW started, while eBPF was the hot new thing for Cloudflare's recent endeavors. GFW did not have any choice.

- Cloudflare has a performance focus, but still has a bit of "hardware is cheap, engineers are expensive", making eBPF more than fine.

- The GFW runs on dedicated machines on the traffic path, while I would expect most of Cloudflare's eBPF endeavors run directly on mixed-workloads machines. One of their first blogpost about it (dropping x Mpps) specifically calls the reason was to protect an end machine directly on said machine, by preventing bad packets from reaching the kernel stack

- Most of the operational advantages I already mentioned. GFW is fine with a "drop traffic if DPDK down", but Cloudflare is absolutely not, making the operational simplicity a bit win.

I bet Cloudflare does have quite a hefty DPDK application used for the traffic scrubbing part of their anti-ddos; but they don't publicize it because it's not as shiny as eBPF.

There are also other advantages to eBPF that makes it better suited to a multi-product company like cloudflare that don't weigh as much as in a mono-product org like the GFW. Take for example the much easier testing, dev env on any laptop, ... Or that eBPF probes can be written in Rust, getting the same featureful language to run in the kernel and in userspace (the classic combo is Go in userspace, C in kernelspace).

vsgherzi•3mo ago

gotcha that makes sense, thanks for sharing! Impressed that you were eventually able to make a full ebpf router to stand in for the DPDK one, might have to look into it as a serious alternative for me. Differences between tuning and hardware types has been a nightmare with DPDK especially with cloud deployments.

rtkaratekid•3mo ago

Thanks for the effort making a great overview. I’ve used both frameworks before but not deeply, so your write up was a good read.

liqilin1567•3mo ago

Thanks for sharing. There are two things I don't understand:

> XDP for Egress Traffic Flow Diagram

1. Are packages still raw application data when they get to veth0-A without going through the tcp-stack?

2. In container to container communication case, do you plan redirect packages directly from veth0-A to veth0-B

shivanshvij•3mo ago

Hi HN, Shivansh (founder) here, happy to answer any questions folks might have about the implementation and the benchmarks!

drewg123•3mo ago

I come from a very different world (optimizing the FreeBSD kernel for the Netflix CDN, running on bare metal) but performance leaps like this are fascinating to me.

One of the things that struck me when reading this with only general knowledge of the linux kernel is: What makes things so terrible? Is iptables really that bad? Is something serialized to a single core somewhere in the other 3 scenarios? Is the CPU at 100% in all cases? Is this TCP or UDP traffic? How many threads is iperf using? It would be cool to see the CPU utilization of all 4 scenarios, along with CPU flamegraphs.

tux1968•3mo ago

It's also a bit depressing that everyone is still using the slower iptables, when nftables has been in the kernel for over a decade.

shivanshvij•3mo ago

Actually the latest benchmarks were ran on a Fedora 43 host, which as far as I can tell uses the nftables backend for iptables!

billfor•3mo ago

Iptables uses nftables under the hood.

shivanshvij•3mo ago

As far as we can tell, it’s a mixture of a lot of things. One of the questions I got asked was how useful this is if you have a smaller performance requirement than 200Gbps (or, maybe a better way to put it, what if your host is small and can only do 10Gbps anyways).

You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average.

Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path.

This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth).

It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does.

cptnntsoobv•3mo ago

> IPtables couldn’t do more than 5Gbps of throughput (TCP!)

Is this for a single connection? IIRC, AWS has a 5gbps limit per connection, does it not? I am guessing since you were able to get to ~10 it must be a multi connection number.

shivanshvij•3mo ago

No this was multiple connections - and we tried with both `iperf2` and `iperf3`, UDP and TCP traffic. UDP actually does much worse on `iptables` than TCP, and I'm not sure why just yet.

cptnntsoobv•3mo ago

For UDP I'd look into GSO/GRO to get an upper bound on what pure kernel can do.

With performance benchmarking, specially in networking there is no end to "oh, but did you think of that?!" :)

shivanshvij•3mo ago

That's a great point. This was one of my iPerf3 runs at one point:

`iperf3 -c 172.31.45.187 -p 5201 -P 128 -t 5 -b 512M -u -l 1448 --bidir | grep "\[SUM\]\[$TX\|RX$-C\].*receiver"`

We're also looking at using packet generators to test raw packet throughput. There's a lot more bottlenecks we can cleanup I'm sure.

toprerules•3mo ago

In the case of XDP, the reason it's so much faster is that it requires 0 allocations in the most common case. The DMA buffers are recycled in a page pool that's already allocated and mapped at least queue depth buffers for each hardware queue. XDP is simply running on the raw buffer data, then telling the driver what the user wants to do with the buffer. If all you are doing is rewriting an IP address, this is incredibly fast.

In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers.

Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency.

drewg123•3mo ago

Yes, I (with a few others) did a similar optimization for FreeBSD's firewall, with similar results but much greater simplicity using what we call "pfil memory pointer hooks" We wrote a paper about it in 2020 for a conference that was cancelled due to Covid, so its fairly unknown.

On what's now almost 10 year old hardware, we could drop 44Mpps of a volumetric DOS attack and still serve our nominal workload with no impact. See PFILCTL(8) and PFIL(9), focus on ethernet (link layer) packets.

It relies on the same principal -- NIC passes the RX buffer directly to the firewall (ipfw, pf, or ipfilter). If the firewall says the packet is OK, rx processing happens as normal. If it says to drop, then dropping is very fast because it can simply re-use the buffer without re-allocation, re-doing DMA mapping, etc.

toprerules•3mo ago

This is an essential use case for XDP - this is how FB's firewall works, and above that their LB uses the same technology.

The beauty of XDP is that it's all eBPF. Completely customizable by injecting policy where it's needed and native to the kernel.

TuxPowered•3mo ago

> On what's now almost 10 year old hardware, we could drop 44Mpps of a volumetric DOS attack and still serve our nominal workload with no impact.

Was filtering done with pf, ipfw or some custom firewall?

tuetuopay•3mo ago

The kernel will allocate, merge packets in skbs if needed, extract data, and do quite a lot. XDP runs as early as possible in the datapath. Pretty much all drivers have to do is call the XDP code when they receive an IRQ from the NIC.

You'll bypass a memory copy (ringbuf -> kernel memory), allocations (skb), parsing (ips & such), firewalling, checking if the packet is local, checksum validation, the list goes on...

The following diagram helps seeing all the things that happens: https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilte...

(yes, xdp is the leftmost step, literally after "card dma'd packet in memory")

pxeger1•3mo ago

The "Architect" live migration tech seems super cool and useful on its own. Is it available independent of your kubernetes stuff?

shivanshvij•3mo ago

It is! We demo'd it on stage at KubeCon last hear: https://loophole.sh/kc2024

kosolam•3mo ago

Hey I can’t browse the link crashes on ios

shivanshvij•3mo ago

Sorry about that! It's fixed!

tptacek•3mo ago

From 2022: https://www.samd.is/2022/06/13/egress-XDP.html

You can also use XDP for outgoing packets for tap interfaces.

cortesoft•3mo ago

This is why I am always skeptical when anyone writes that they are the first to do something… the added caveat is always, “that we know of”

HeWhoLurksLate•3mo ago

in engineering there's also the "first unclassified attempt" stuff too

tptacek•3mo ago

"Who did it first" is not interesting to me, but what they're doing isn't a "loophole"; that's all I'm concerned with.

cortesoft•3mo ago

I am guessing the `loophole` wording is just referencing their company name.

Twirrim•3mo ago

It's a weird choice of word (especially in the company name, honestly). To my mind, loopholes are things that get closed. I would not want to be relying on a loophole for anything.

"Whoops, sorry, an innocent kernel update broke your entire production!"

tptacek•3mo ago

The framing of the article --- the article is fine, it's a good piece --- is weird to me because one of the original marquee use cases for XDP was for hosting providers, where virtuals are connected to physicals by way of tap interfaces, where you have to reason about the rx/tx path to do XDP at all. It's not that the article is bad, it just creates the impression that there's something weird or nonnormative about what they did, when, again, I think there's literally an xdp-tutorial example of this.

"Go ahead and do stuff like this and don't worry about whether it's a 'loophole' is I guess my whole point".

touisteur•3mo ago

Similar fun as the time I discovered one could use IFB to set qdiscs on incoming traffic (why would one would do that is left as exercise to the reader, but my journey included using the 'plug' qdisc and tcp-checkpoint/restore). The Linux kernel has so many building blocks....

docapotamus•3mo ago

Great post.

In some scenarios veth is being replaced with netkit for a similar reason. Does this impact how you're going to manage this?

ZiiS•3mo ago

I understand they are attached to the phrase "loophole" but it feels fairly like they are using it as designed to me?

seneca•3mo ago

XDP is intended only for inbound traffic. They are exploiting veth pairs to make outbound traffic "look like" inbound traffic. That's the "loophole".

tptacek•3mo ago

It's really not a loophole. I think this might literally be in the xdp-tutorials repo.

tuetuopay•3mo ago

Yup, I don't really get it either. I've had this exact setup in my mind for a while to make an hypervisor dataplane (thus on TAPs, not VETHs). It's working as designed, and it's precisely for this usecase the veth driver has had quite a lot of care as far as XDP is concerned, getting optimizations and multiqueue support over time.

Honestly, the real news is that they're doing it in production, not that they found anything unique.

Heck, all the XDP development I've ever done was against a veth interface on my laptop, to run later on server metal.

iSloth•3mo ago

Also wondering, why not just use DPDK?

ZiiS•3mo ago

I think they are getting a lot of value from the rest of the Kernel's networking (VETH/namespaces etc talking to containers).

tptacek•3mo ago

First, "just use" is doing a lot of work in that sentence, because DPDK is much harder to use than XDP. The authors of this blog were surprised they had to do their own checksumming, for instance.

Maybe more importantly: they're not building a middlebox. DPDK ultra-high performance comes in part from polling. It's always running. XDP is just an extension to the existing network driver.

toprerules•3mo ago

XDP is built into the kernel. DPDK is a huge framework that invasively bypasses the kernel and has to remain compatible as an external project.

ZiiS•3mo ago

They say "By the time a packet reaches the TC hook, the kernel has already processed it through various subsystems for routing, firewalling, and even connection tracking." but surely this is also true before it reaches the VETH?

tuetuopay•3mo ago

Yes, but it does so once. Additionally, you're likely to have a much heavier network path in the main network namespace of e.g. a k8s node than within the container: firewalls, connection tracking, multiple interfaces/bridges/taps/etc, NAT, and so on.

notherhack•3mo ago

For NAT (Network Address Translation) or any other packet header modifications, you need to recalculate checksums manually

Why doesn’t checksum offload in the NIC take care of that?

jcalvinowens•3mo ago

And I'm confused how they have to correct the TCP checksum but not the IPv4 header checksum...

shivanshvij•3mo ago

Oh no we absolutely have to also correct the IPv4 header checksum!

AlexB138•3mo ago

Really good, and glad that you're taking this technique further into a docker network plugin. I wouldn't be surprised to see a Kubernetes CNI appear using this approach, seems entirely viable unless I am missing something.

I'll definitely be coming to check you all out at Kubecon.

shivanshvij•3mo ago

Awesome we’ll be looking forward to it!

toprerules•3mo ago

I think the title is a little disingenuous and the idea of using a redirect is certainly not novel. The solution for XDP egress should be able to handle all host egress including sr-iov traffic. This works with a very specific namespace driven topology.

sim7c00•3mo ago

i really love this one. its a really elegant and well informed solution. one of the nicest finds ive seen in a while was a pleasure reading how it works! thanks a lot

betaby•3mo ago

As I understand they implemented NAT using eBPF?

cptnntsoobv•3mo ago

XDP, and the eBPF ecosystem in general, is quite neat. However, a word of caution:

* The BPF verifier's DX is not great yet. If it finds problems with your BPF code it will spit our a rather inscrutable set of error messages that often requires a good understanding of the verifier internals (e.g the register nomenclature) to debug

* For the same source code, the code generated by the verifier can change across compiler versions in a breaking way, e.g. because the new compiler version implemented an optimization that broke the verifier (see https://github.com/iovisor/bcc/issues/4612)

* Checksum updating requires extra care. I believe you can only do incremental updates, not just because of better perf as the post suggests but also because the verifier does not allow BPF programs to operate on unbounded buffers (so checksumming a whole packet of unknown size is tricky / cumbersome). This mostly works but you have to be careful with packets that were generated with csum offload, don't have a valid checksum and whose csum can't be incrementally updated.

As the blog post points out, the kernel networking stack does a lot of work that we don't generally think about. Once you start taking things into your own hands you don't have the luxury of ignorance anymore (think not just ARP but also MTU, routing, RP filtering etc.), something any user of userspace networking frameworks like DPDK will tell you.

My general recommendation is to stick with the kernel unless you have a very good justification for chasing better performance and if you do use eBPF save yourself some trouble and try to limit yourself to readonly operations, if your use case allows.

Also, if you are trying to debug packet drops, newer kernels have started logging this information that you can track using bpftrace which gives you better diagnostics.

Example script (might have to adjust based on kernel version):

    bpftrace -e '
        kprobe:kfree_skb_reason {
        $skb = (struct sk_buff *)arg0;
        $ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
        printf("reason :%d %s -> %s\n", arg1, ntop($ipheader->saddr), ntop($ipheader->daddr));
    }'

shivanshvij•3mo ago

We absolutely ran into these issues.

A couple notes that help quite a bit:

1. Always build the eBPF programs in a container - this is great for reproducibility of course, but also makes DevX on MacOS better for those who prefer to use that.

2. You actually can do a full checksum! You need to limit the MTU but you can:

  static __always_inline void tcp_checksum(const struct iphdr *ip_header, struct tcphdr *tcp_header, const __u16 tcp_len, const void *data_end) {
    __u32 sum = 0;
    __u16 *buf = (void *)tcp_header;
    ip_header_pseudo_checksum(ip_header, tcp_len, &sum);
    tcp_header->check = 0;
    __u16 max_packet_size = tcp_len;
    if (max_packet_size > MAX_TCP_PACKET_SIZE) {
        max_packet_size = MAX_TCP_PACKET_SIZE;
    }
    for (int i = 0; i < max_packet_size / 2; i++) {
        if ((void *)(buf + 1) > data_end) {
            break;
        }
        sum += *buf;
        buf++;
    }
    if ((void *)buf + 1 <= data_end && ((__u8 *)buf - (__u8 *)tcp_header) < max_packet_size) {
        sum += *(__u8 *)buf;
    }
    tcp_header->check = csum_fold_helper(sum);
  }

With that being said, it's not lost on me that XDP in general is something you should only reach for once you hit some sort of bottleneck. The original version of our network migration was actually implemented in userspace for this exact reason!

mgaunard•3mo ago

How do containers help when bpf is mostly a matter of kernel version?

beanjuiceII•3mo ago

they don't its just the poster wanting people to do what they prefer

tanelpoder•3mo ago

I figure it’s one way to keep your compiler version unchanged for eBPF work, while you might update/upgrade your dev OS packages over time for other reasons. The title of the linked issue is this:

“Checksum code does not work for LLVM-14, but it works for LLVM-13”

Newer compilers might use new optimizations that the verifier won’t be happy with. I guess the other option would be to find some config option to disable that specific incompatible optimization.

cptnntsoobv•3mo ago

> You actually can do a full checksum

Indeed! This is what I had in mind when I wrote "cumbersome" :).

It's been a while for me to be able to recall whether the problem was the verifier or me, and things may have improved since, but I recall having the verifier choke on a static size limit too. Have you been able to use this trick successfully?

> Always build the eBPF programs in a container

That should work generally but watch out for any weirdness due to the fact that in a container you are already inside a couple of layers of networking (bridge, netns etc.).

tptacek•3mo ago

Different kernels will be different levels of fussy about the bounded loop you're using there. Bounded loops are themselves a relatively recent feature.

Of course, checksum fixups in eBPF are idiomatically incremental.

andrepd•3mo ago

openonload is faster than the kernel even with the most basic configuration, which is pretty much drop-in and requires zero changes on your application.

Trev123•3mo ago

This post hits close to home, I've run into all of these myself.

On checksums: Incremental updates are the path of least pain only if the packet’s checksum is valid and not CHECKSUM_PARTIAL. With modern offloads (TSO/GSO/GRO/checksum offload), the checksum visible to XDP is often zero/garbage because the NIC fills it later. In practice, either disable offloads for that traffic or recompute from scratch with bpf_csum_diff() plus bpf_l3_csum_replace() / bpf_l4_csum_replace().

The verifier: This is a fun one, when you make a small change and suddenly the verifier won't allow it.

And the moment you start modifying packets too much yourself, you're on the hook for everything the kernel used to do for you.

I once went down the rabbit hole of building a minimal TCP stack, and the experience was exactly as you'd expect. Getting to 95% done felt quick, but that last 5% was a nightmare (if 100% is even achievable)

phineyes•3mo ago

21G on tc egress is slightly surprising to me. I'd like to see the program used for the benchmark. Was GSO accounted for? If you pop/pull headers by hand, you'll often kill GSO which will result in a massive loss in throughput like this.

zygentoma•3mo ago

The page has no text for me (only the table of contents on the side, that updates by scrolling over the completely blank purple page …)

I'm using Firefox

zygentoma•3mo ago

Removing this:

  @layer base {
    :root, #nd-docs-layout {
      --fd-layout-offset: max(calc(50vw - var(--fd-layout-width)/2),0px);
    }
  }

fixes it for me and I can read the text now …

EDIT: Oh, or making the window less than around 3000 px wide does also fix it, and resizing you can see what is happening there …

shivanshvij•3mo ago

Sorry about that, we've fixed it!

joshstrange•3mo ago

For some reason at above 1600px wide the content starts to shrink and become unreadable.

Video: https://cs.joshstrange.com/Zhxk4kRp

cachius•3mo ago

Awesome that you have a set up that allows you to record and share and then also do it!

joshstrange•3mo ago

CleanShotX [0] is what I used. Honestly the cloud hosting aspect is overpriced but it does make sharing a picture/video super easy for cases like this. I know macOS has annotation tools built-in for screenshots but CleanShotX's tools are about 1 billion times faster/easier to use. I use this tool all the time just to take a quick screenshot, add some arrows/circles/text/numbers and send it off.

Not affiliated in any way, just a big fan of any tool that lets me quickly explain or document something. I'd say about 99% of bug tickets I enter into our ticketing tool have a CleanShotX screenshot attached (or video).

Here is an example of a handful of the tools (numbers, text, boxes, arrows, highlight area, and redact): https://cs.joshstrange.com/kSpsv7DG

[0] https://cleanshot.com/

tptacek•3mo ago

CleanShot X is extremely good, in case anyone is looking for more endorsements.

shivanshvij•3mo ago

Wow thank for the recording! We were able to push out a fix, should be workin g now!

joshstrange•3mo ago

No problem, happy to help and thank you for fixing it! It works perfectly now.

blipvert•3mo ago

Hardly new. Used this years ago for directing a “NAT” address to a virtual IP with a specific MAC address to do health checking.

r3tr0•3mo ago

XDP is awesome.

We use it quite a ton for capturing and dashboarding inbound network traffic over at https://yeet.cx

I am really excited for the future of eBPF especially with tcx now being available in Debian 13. The tc API was very hard to work with.

WatchDog•3mo ago

Presumably you don’t need to handle traffic at line speed, you just need to process it faster than userspace applications can produce and consume it.

What I don’t really understand is why iptables and tv is so slow.

If the kernel can’t route packets at line speed, how are userspace applications saturating it?

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

What Is Stoicism?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

What Is Stoicism?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

An eBPF loophole: Using XDP for egress traffic

Comments