Fast UDP I/O for Firefox in Rust

https://max-inden.de/post/fast-udp-io-in-firefox/

367•Bender•4mo ago

Comments

jcranmer•4mo ago

> After many hours of back and forth with the reporter, luckily a Mozilla employee as well, I ended up buying the exact same laptop, same color, in a desperate attempt to reproduce the issue.

Glad to know that networking still produces insanity trying to reproduce issues à la https://xkcd.com/2259/.

3form•4mo ago

For that matter, a fun read in the "The map download struggle, part 2 (Technical)" section at https://www.factorio.com/blog/post/fff-176 (end of the document).

Analemma_•4mo ago

Factorio's dev blog is a great deal of fun. It's on pause at the moment after the release of 2.0, but if you go through the archives there's great stuff in there. A lot of it is about optimizations which only matter once you're building 10,000+ SPM gigafactories, which casual players will never even come close to, but since crazy excess is practically what defines hardcore Factorio players it's cool to see the devs putting in the work to make the experience shine for their most devoted fans.

rkomorn•4mo ago

This is how I find out there's a 2.0 Factorio? What am I doing with my life??

dinosaurdynasty•4mo ago

Not only that, there's also a DLC with 4 new planets.

rkomorn•4mo ago

Well there goes the rest of the year...

Maken•4mo ago

Be careful, some of these new planets can spoil the fun.

rkomorn•4mo ago

Oh? Tell me more.

Shish2k•4mo ago

Each planet has its own gimmick which throws a spanner into standard builds in its own unique way - one planet is essentially a farm where your factory is growing and processing fruits, which will rot and spoil if they aren't processed immediately -- so you need to design a factory which processes small packets at high speed without any buffering.

gizmo686•4mo ago

That's what I asked after downloading it.

bobmcnamara•4mo ago

Could be related to UDP checksum offload.

0x0000 is a special value for some NICs meaning please calculate for me.

One NIC years ago would set 0xFFFF for bad checksum. At first we thought this was horrifyingly broken. But really you can just fallback to software verification for the handful of legitimate and bad packets that arrive with that checksum.

Joel_Mckay•4mo ago

It is funnier if you've ever dealt with mystery packet runts, as most network appliances still do not handle them very cleanly.

UDP/QUIC can DoS any system not based on a cloud deployment large enough to soak up the peak traffic. It is silly, but it pushes out any hosting operation that can't reach a disproportionate bandwidth asymmetry with the client traffic. i.e. fine for FAANG, but a death knell for most other small/medium organizations.

This is why many LAN still drop most UDP traffic, and rate-limit the parts needed for normal traffic. Have a nice day =3

znpy•4mo ago

It’s crazy thar sendmmsg/recvmmsg are considered “modern”… i mean, they’ve been around for quite a while.

I was expecting to see io_uring mentioned somewhere in the linux section of the article.

Cloudef•4mo ago

io_uring doesn't really have equivalent[1], it can't batch multiple UDP diagrams, best it can do is batch multiple sendmsgs and recvmsgs. GSO/GRO is the way to go. sendmmsg/recvmmsg are indeed very old, and some kernel devs wish they could sunset them :)

1: https://github.com/axboe/liburing/discussions/1346

LtdJorge•4mo ago

Will ZCRX help here? I’m not sure it supports UDP. It should provide great speed-ups but it requires hardware support which is very scarce for now.

znpy•4mo ago

> Will ZCRX help here?

Do you mean zero-copy rx by ZCRX? If so, io_uring should support that, but you need a kernel recent enough.

It supports both zero-copy rx/tx on tcp and udp afaik, and there was a paper from the google engineers that implemented zero-copy send/recv (originally for tcp only) showing that the whole endeavor was worth it when the payload was >= 20kb in size (iirc).

It could be interesting to see if that's still the case with current state of linux, memory sticks, etc.

I've looked up the links from when i dug into that:

- https://www.kernel.org/doc/html/latest/networking/msg_zeroco...

- the paper: https://netdevconf.org/2.1/papers/debruijn-msgzerocopy-talk....

- presentation about the paper: https://netdevconf.org/2.1/session.html?debruijn

metadat•4mo ago

The key takeaway is hidden in the middle:

> In extreme cases, on purely CPU bound benchmarks, we’re seeing a jump from < 1Gbit/s to 4 Gbit/s. Looking at CPU flamegraphs, the majority of CPU time is now spent in I/O system calls and cryptography code.

400% increase in throughput, which should translate to a proportionate reduction in CPU utilization for UDP network activity. That's pretty cool, especially for better power efficiency on portable clients (mobile and notebook).

I found this presentation refreshing. Too often, claims about transition to "modern" stacks are treated as being inherently good and do not come with the data to back it up.

a-dub•4mo ago

i wonder if we'll ever see hardware accelerated cross-context message passing for user and system programs.

wbl•4mo ago

Shared ring buffers for IO exist in Linux, I don't think we'll ever see it extend to DMA for the NIC due to the rearchitecture of security required. However if the NIC is smart enough and the rules simple maybe.

a-dub•4mo ago

sure, but what about some kind of generalized cross-context ipc primitive towards a zero copy messaging mechanism for high performance multiprocessing microkernels?

jitl•4mo ago

There are systems that move the NIC control to user space entirely. For example Snabb has an Intel 10g Ethernet controller driver that appears to use a ring buffer on DMA memory.

https://github.com/snabbco/snabb/blob/master/src/apps/intel/...

1vuio0pswjnm7•4mo ago

"(You could think of it as a [busybox](https://en.wikipedia.org/wiki/BusyBox#Single_binary) for networking.)"

They sugggest thinking of busybox

But if using busybox, their Makefile will fail

Using toybox instead will work

publicmail•4mo ago

RDMA offers that. The NIC can directly access user space buffers. It does require that the buffers are “registered” first but applications usually aim to do that once up front.

kd913•4mo ago

There is AMD's onload https://github.com/Xilinx-CNS/onload. It works with Solarflare, Xilinx but also generic NIC support via AF_XDP.

wbl•4mo ago

The price of doing that is losing OS controls over emitted packets. For servers fine. Browsers not so much.

fulafel•4mo ago

Any guesses on whether they have other cases where they get more than 4 Gbps but wasn't CPU bound or was this the fastest they got?

mxinden•4mo ago

_Author here_.

4 Gbit/s is on our rather dated benchmark machines. If you run the below command on a modern laptop, you likely reach higher throughput. (Consider disabling PMTUD to use a realistic Internet-like MTU. We do the same on our benchmark machines.)

https://github.com/mozilla/neqo

cargo bench --features bench --bench main -- "Download"

Cloudef•4mo ago

Interesting I was not aware of GSO/GRO equivalent on Windows and MacOS, though unfortunate that they seem buggy.

Avamander•4mo ago

I wonder why Microsoft and Apple do not care about the proper functioning of their network stacks.

Pretty sure GSO/GRO aren't the only buggy parts either.

superkuh•4mo ago

Wow! Does this mean that Firefox can re-enable self-signed certs for it's HTTP/3 stack since it's using a custom implementation and not someone elses big QUIC lib and default build flags anymore? That'd be a huge win for human people and their typical LAN use cases. Even if the corporate use cases don't want it for 'security' reasons.

jeroenhd•4mo ago

I think self-signed certs should be possible on principal, but is there a reason to use HTTP/3 on LAN use cases? In low-latency situations, there's barely any advantage to using HTTP3 over http/2, and even HTTP 1.1 is good enough for most use cases (and will outperform the other options in terms of pure throughput).

ekr____•4mo ago

Certificate verification in Firefox happens at a layer way above HTTP and TLS (for those who care, it's in PSM), so which QUIC library is used is basically not relevant.

The reason that Firefox -- and other major browsers -- make self-signed certs so difficult to use is that allowing users to override certificate checks weakens the security of HTTPS, which otherwise relies on certificates being verifiable against the trust anchor list. It's true that this makes certain cases harder, but the judgement of the browser community was that that wasn't worth the security tradeoff. In other words, it's a policy decision, not a technical one.

rcxdude•4mo ago

It's a pretty bad one, though. It massively undermines the security of connections to local devices for a slight improvement in security on the open internet. It's very frustrating how browser vendors don't even seem to consider it something worth solving, even if e.g. the way it is presented to the user is different. At the moment if you just use plain HTTP then things do mostly work (apart from some APIs which are somewhat arbitrarily locked to 'secure contexts' which means very little about the trustworthiness of the code that does or does not have access to those APIs), but if you try to use HTTPs then you get a million 'this is really inesecure' warnings. There's no 'use HTTPs but treat it like HTTP' option.

dochtman•4mo ago

I'm pretty sure private PKIs are an option that is pretty straightforward to use.

Security is still a lot better because the root is communicated out of band.

ekr____•4mo ago

I don't think it's correct to say that browser vendors don't think it's worth solving. For instance, Martin Thomson from Mozilla has done some thinking about it. https://docs.google.com/document/u/0/d/170rFC91jqvpFrKIqG4K8....

However, it's not an entirely trivial problem to get it right, especially because how how deeply the scheme is tied into the Web security model. Your example here is a good one of what I'm talking about:

> At the moment if you just use plain HTTP then things do mostly work (apart from some APIs which are somewhat arbitrarily locked to 'secure contexts' which means very little about the trustworthiness of the code that does or does not have access to those APIs),

You're right that being served over HTTPS doesn't make the site trustworthy, but what it does do is provide integrity for the identity of the server. So, for instance, the user might look at the URL and decide that the server is trustworthy and can be allowed to use the camera or microphone. However, if you use HTTPS but without verifying the certificate, then an attacker might in the future substitute themselves and take advantage of that camera and microphone access. Another example is when the user enters their password.

Rather than saying that browser vendors don't think this is worth solving in the abstract I would say that it's not very high on the priority list, especially because most of the ideas people have proposed don't work very well.

ElectricalUnion•4mo ago

Either you really are secure, or ideally you should not be able to even pretend you are secure. Allowing "pretend it's secure" downgrades the security in all contexts.

IMHO they should gradually lock all dynamic code execution such as dynamic CSS and javascript behind a explicit toggle for insecure http sites.

> It massively undermines the security of connections to local devices

No, you see the prompt, it is insecure. If the network admin wants it secure, it means either a internal CA, or a literally free cert from let's encrypt. As the network admin did not care, it's insecure.

"but I have legacy garbage with hardcoded self-signed certs" then reverse proxy that legacy garbage with Caddy?

rcxdude•4mo ago

I'm talking about situations where you have nontechnical users that need to connect to the device, neither the client nor the device have necessarily an internet connection, and the connection is often via a local IP address. None of your proposed solutions are appropriate for that situation. And basically all I'm asking is that the connection be at least encrypted (meaning that eavesdropping is not enough: you need to construct a man in the middle), even if it's not presented to the user as secure.

(An option to get some authentication, and one that I think chrome have kind of started to figure out, is to allow a PWA to connect to a local device and authenticate with its own keys. This still means you need to connect to the internet once with the client device, but at least from that point onwards it can work without internet. But then you need to have a whole other flow so that random sites can't just connect to your local devices...)

acdha•4mo ago

How often are you offline like that but on a network you can trust isn’t malicious? If I’m at home, my printer is more protected from eavesdropping by the WiFi password than a self-signed certificate. If I’m at the coffee shop, it’s insecure because I can’t trust the dozens of other people not to be malicious or compromised, and the answer is to clearly tell me that it’s unsafe.

rcxdude•4mo ago

You could be in any of those situations, is my point. I fail to see any situation where some encryption is worse than no encryption.

wolrah•4mo ago

You can still have self-signed certs, you just have to actually set up your own CA and import it as trusted in the relevant trust store so it can be verified.

You can't just have some random router, printer, NAS, etc. generate its own cert out of thin air and tell the browser to ignore the fact that it can't be verified.

IMO this is a good thing. The way browsers handle HTTPS on older protocols is a result of the number of legacy badly configured systems there are out there which browser vendors don't want to break. Anywhere someone's supporting HTTP/3 they're doing something new, so enforcing a "do it right or don't do it at all" policy is possible.

superkuh•4mo ago

Which also means it's impossible to host a visitable webserver for random people on HTTP/3 without the continued permission of a third party corporation. Do it "right" means "Do it for the corps' use cases only" to most people it seems.

wolrah•4mo ago

I'm not sure what you're trying to say here. Your random self-signed cert never worked with HTTPS v1.x-2.x either, and never served a real purpose unless the client had explicitly trusted your cert.

HTTP/3 just removes the space for misunderstanding.

superkuh•4mo ago

Self signed certs are the standard for mailservers and work just fine as they have for the last 25 years.

Just like self-signed certs worked for 20 years until the megacorps decided to break people's browsers because only their for-profit use cases matter. You might not remember, but random self signed certs worked for a long, long time. I use them. And their purpose is as a speed bump against massive passive surveillance, something that still works. TOFU works. ID isn't actually needed for most personal use cases on the web. That's a corporate thing. HTTP+HTTPS (self signed) is the perfect combo for human person use cases. And much more robust than HTTPS only which will break within a year or two left unwatched by human eyes.

The misunderstanding Chrome and it's followers (like firefox) removed was that they were for anything except corporate use cases.

mxinden•4mo ago

Author here. You can find details on why we disable HTTP/3 on self-signed certs here: https://bugzilla.mozilla.org/show_bug.cgi?id=1985341#c7

Veserv•4mo ago

While their improvements are real and necessary for actual high speed (100 Gb/s and up), 4 Gb/s is not fast. That is only 500 MB/s. Something somewhere, likely not in their code, is terribly slow. I will explain.

As the author cited, kernel context switch is only on the order of 1 us (which seems too high for a system call anyways). You can reach 500 MB/s even if you still call sendmsg() on literally every packet as long as you average ~500 bytes/packet which is ~1/3 of the standard 1500 bytes MTU. So if you average MTU sized packets, you get 2 us of processing in addition to a full system call to reach 4 Gb/s.

The old number of 1 Gb/s could be reached with a average of ~125 bytes/packet, ~1/12 of the MTU or ~11 us of processing.

“But there are also memory copies in the network stack.” A trivial 3 instruction memory copy will go ~10-20 GB/s, 80–160 Gb/s. In 2 us you can drive 20-40 KB of copies. You are arguing the network stack does 40-80(!) copies to put a UDP packet, a thin veneer over a literal packet, into a packet. I have written commercial network drivers. Even without zero-copy, with direct access you can shovel UDP packets into the NIC buffers at basically memory copy speeds.

“But encryption is slow.” Not that slow. Here is some AES-128 GCM performance done what looks like over 5 years ago. [1] The Intel i5-6500, a midline processor from 8 years ago, averages 1729 MB/s. It can do the encryption for a 500 byte packet in 300 ns, 1/6 of the remaining 2 us budget. Modern processors seem to be closer to 3-5 GB/s per core, or about 25-40 Gb/s, 6-10x the stated UDP throughput.

[1] https://calomel.org/aesni_ssl_performance.html

vlovich123•4mo ago

There is no indication what class the CPU they're benchmarking on. Additionally, this is presumably including the overhead of managing the QUIC protocol as well given they mention encryption which isn't relevant for raw UDP. And QUIC is known to not have a good story of NIC offload for encryption at the moment the way you can do kTLS offload for TCP streams.

Veserv•4mo ago

Encryption is unlikely to be relevant. As I pointed out, doing it on any modern desktop CPU with no offload gets you 25-40 Gb/s, 6-10x faster than the benchmarked throughput. It is not the bottleneck unless it is being done horribly wrong or they do not have access to AES instructions.

“It is slow because it is being layered over QUIC.” Then why did you layer over a bottleneck that slows you down by 25x. Second of all, they did not used to do that and they still only got 1 Gb/s previously which is abysmal.

Third of all, you can achieve QUIC feature parity (minus encryption which will be your per-core bottleneck) at 50-100 Gb/s per core, so even that is just a function of using a slow protocol.

Finally, CPU class used in benchmarking is largely irrelevant because I am discussing 20x per-core performance bottlenecks. You would need to be benchmarking on a desktop CPU from 25 years ago to get that degree of single-core performance difference. We are talking iPhone 6, a decade old phone, territory for a efficient implementation to bottleneck on the processor at just 4 Gb/s.

But again, it is probably not a problem with their code. It is likely something else stupid happening on the network stack or protocol side of which they are merely a client.

raggi•4mo ago

> which seems too high for a system call anyways

spectre & meltdown.

> you get 2 us of processing in addition to a full system call to reach 4 Gb/s

TCP has route binding, UDP does not (connect(2) helps one side, but not both sides).

> “But encryption is slow.” Not that slow.

Encryption _is slow_ for small PDUs, at least the common constructions we're currently using. Everyone's essentially been optimizing for and benchmarking TCP with large frames.

If you hot loop the state as the micro-benchmarks do you can do better, but you still see a very visible cost of state setup that only starts to amortize decently well above 1024 byte payloads. Eradicate a bunch of cache efficiency by removing the tightness of the loop and this amortization boundary shifts quite far to the right, up into tens of kilobytes.

---

All of the above, plus the additional framing overheads come into play. Hell even the OOB data blocks are quite expensive to actually validate, it's not a good API to fix this problem, it's just the API we have shoved over bsd sockets.

And we haven't even gotten to buffer constraints and contention yet, but the default UDP buffer memory available on most systems is woefully inadequate for these use cases today. TCP buffers were scaled over time, but UDP buffers basically never were, they're still conservative values from the late 90s/00s really.

The API we really need for this kind of UDP setup is one where you can do something like fork the fd, connect(2) it with a full route bind, and then fix the RSS/XSS challenges that come from this splitting. After that we need a submission queue API rather than another bsd sockets ioctl style mess (uring, rio, etc). Sadly none of this is portable.

On the crypto side there are KDF approaches which can remove a lot of the state cost involved, it's not popular but some vendors are very taken with PSP for this reason - but PSP becoming more well known or used was largely suppressed by its various rejections in the ietf and in linux. Vendors doing scale tests with it have clear numbers though, under high concurrency you can scale this much better than the common tls or tls like constructions.

Veserv•4mo ago

I think you are just agreeing with me?

You are basically saying: “It is slow because of all these system/protocol decisions that mismatch what you need to get high performance out of the primitives.”

Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.

raggi•4mo ago

> I think you are just agreeing with me?

sure, i mean i have no goal of alignment or misalignment, i'm just trying to provide more insights into what's going on based on my observations of this from having also worked on this udp path.

> Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.

yes, though this basically means we're talking about throwing out chunks of the os, the crypto design, the protocol, and a whole lot of tuning at each layer.

the only vendor in a good position to do this is apple (being the only vendor that owns every involved layer in a single product chain), and they're failing to do so as well.

the alternative is a long old road, where folks make articles like this from time to time, we share our experiences and hope that someone is inspired enough reading it to be sniped into making incremental progress. it'd be truly fantastic if we sniped a group with the vigor and drive that the mptcp folks seem to have, as they've managed to do an unusually broad and deep push across a similar set of layered challenges (though still in progress).

ori_b•4mo ago

> spectre & meltdown.

I just measured. On my Ryzen 7 9700X, with Linux 6.12, it's about 50ns to call syscall(__NR_gettimeofday). Even post-spectre, entering the kernel isn't so expensive.

whatevaa•4mo ago

Are you sure that system call actually enters the kernel mode? It might be one of the special ones where kernel serves it from user space, forgot their name.

Cloudef•4mo ago

VDSO

ori_b•4mo ago

Those are only served from userspace if you call the libc wrappers. The syscall() function bypasses the wrappers.

menaerus•4mo ago

If it isn't a vDSO call, I think 50ns figure shouldn't be possible.

ori_b•4mo ago

No need to guess, it's 10 lines of code. And you can use bpftrace to watch the test program enter the kernel.

Using the libc wrapper will use the vdso. Using syscall() will enter the kernel.

I haven't measured, but calling the vdso should be closer to 5ns.

Someone else did more detailed measurements here:

https://arkanis.de/weblog/2017-01-05-measurements-of-system-...

menaerus•4mo ago

50ns on a 3GHz CPU core is ~150 cycles. Pushing and popping back the registers to L1 cache is 5-10 cycles each. With having to handle 16 general purpose registers on x86-64 this is already close to or even more than 150 cycles, no?

ori_b•4mo ago

When you measure, what numbers do you get?

Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.

menaerus•4mo ago

I'm on my mobile. Store to L1 width is typically 32B and you're probably right that CPU will take advantage of it and pack as much registers as it can. This still means 4x store and 4x load for 16 registers. This is ~40 cycles. So 100 cycles for the rest? Still feels minimal.

ori_b•4mo ago

A modern x86 processor has about 200 physical registers that get mapped to the 16 architectural registers, with similar for floating point registers. It's unlikely that anything is getting written to cache. Additionally, any writes, absent explicit synchronization or dependencies, will be pipelined.

It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.

As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)

menaerus•4mo ago

Where is the state/registers written to then if not L1? I'm confused.

What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.

ori_b•4mo ago

The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming

As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.

That makes me suspect something else is up in that benchmark.

For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:

    f:
   .LFB6:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $8, %rsp
        movq    $42, -128(%rbp)
        movq    $42, -120(%rbp)
        movq    $42, -112(%rbp)
        movq    $42, -104(%rbp)
        movq    $42, -96(%rbp)
        movq    $42, -88(%rbp)
        movq    $42, -80(%rbp)
        movq    $42, -72(%rbp)
        movq    $42, -64(%rbp)
        movq    $42, -56(%rbp)
        movq    $42, -48(%rbp)
        movq    $42, -40(%rbp)
        movq    $42, -32(%rbp)
        movq    $42, -24(%rbp)
        movq    $42, -16(%rbp)
        movq    $42, -8(%rbp)
        nop
        leave
        .cfi_def_cfa 7, 8
        ret

menaerus•4mo ago

Benchmark is simple but I find it worthwhile because of the fact that (1) it is run across 15 different platforms (different CPUs, libc's) and results are pretty much reproducible, and (2) it is run through gbenchmark which has a mechanism to make the measurements statistically significant.

Interesting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.

Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.

The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...

syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.

As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.

This is from Intel manual:

  Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).

So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.

I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.

ori_b•4mo ago

I can't reproduce. When I run The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca..., here are the numbers on the computers I have:

    AMD Ryzen 7 9700X Desktop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            38.6 ns         38.5 ns     18160546
    bench_getpid                            39.9 ns         39.9 ns     17703749
    bench_close                             45.2 ns         45.1 ns     15711379
    bench_syscall                           42.2 ns         42.1 ns     16638675
    bench_sched_yield                       81.7 ns         81.6 ns      8623522
    bench_clock_gettime                     15.9 ns         15.9 ns     44010857
    bench_clock_gettime_tai                 15.9 ns         15.9 ns     43997256
    bench_clock_gettime_monotonic           15.9 ns         15.9 ns     44012908
    bench_clock_gettime_monotonic_raw       15.9 ns         15.9 ns     43982277
    bench_nanosleep0                       49961 ns          370 ns       100000
    bench_nanosleep0_slack1                10839 ns          351 ns      1000000
    bench_nanosleep1_slack1                10878 ns          358 ns      1000000
    bench_pthread_cond_signal               1.37 ns         1.37 ns    503715097
    bench_assign                           0.563 ns        0.562 ns   1000000000
    bench_sqrt                              1.63 ns         1.63 ns    430096636
    bench_sqrtrec                           5.33 ns         5.33 ns    132574542
    bench_nothing                          0.394 ns        0.394 ns   1000000000

    12th Gen Intel(R) Core(TM) i5-12600H
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            70.0 ns         70.0 ns      9985369
    bench_getpid                            71.6 ns         71.6 ns      9763016
    bench_close                             76.7 ns         76.7 ns      9131090
    bench_syscall                           66.8 ns         66.8 ns     10533946
    bench_sched_yield                        160 ns          160 ns      4377987
    bench_clock_gettime                     12.2 ns         12.2 ns     57432496
    bench_clock_gettime_tai                 12.1 ns         12.1 ns     57826299
    bench_clock_gettime_monotonic           12.2 ns         12.2 ns     57736141
    bench_clock_gettime_monotonic_raw       12.3 ns         12.3 ns     57070425
    bench_nanosleep0                       63154 ns        11834 ns        55756
    bench_nanosleep0_slack1                 2933 ns         1700 ns       348675
    bench_nanosleep1_slack1                 2654 ns         1479 ns       467420
    bench_pthread_cond_signal               1.39 ns         1.39 ns    483995101
    bench_assign                           0.868 ns        0.868 ns    821103909
    bench_sqrt                              1.69 ns         1.69 ns    422094139
    bench_sqrtrec                           4.06 ns         4.06 ns    174511095
    bench_nothing                          0.750 ns        0.750 ns    941204159

    AMD Ryzen 5 PRO 7545U Laptop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                             106 ns          106 ns      6581746
    bench_getpid                             111 ns          111 ns      6271878
    bench_close                              116 ns          116 ns      5944154
    bench_syscall                           85.9 ns         85.9 ns      7317584
    bench_sched_yield                        315 ns          315 ns      2249333
    bench_clock_gettime                     17.6 ns         17.6 ns     39935693
    bench_clock_gettime_tai                 17.6 ns         17.6 ns     39920957
    bench_clock_gettime_monotonic           17.5 ns         17.5 ns     39962966
    bench_clock_gettime_monotonic_raw       17.5 ns         17.5 ns     39561163
    bench_nanosleep0                       52720 ns         3058 ns       100000
    bench_nanosleep0_slack1                13815 ns         2969 ns       244790
    bench_nanosleep1_slack1                13710 ns         2722 ns       254666
    bench_pthread_cond_signal               2.66 ns         2.66 ns    264735233
    bench_assign                           0.930 ns        0.930 ns    813279743
    bench_sqrt                              2.43 ns         2.43 ns    286953468
    bench_sqrtrec                           5.67 ns         5.67 ns    123889652
    bench_nothing                          0.812 ns        0.812 ns    860562208

So, I've tested multiple times in multiple ways, and the results don't seem to match.

menaerus•4mo ago

Interesting because on my machine I can reproduce the results. It's a pretty hefty 5.3GHz and recentish (Raptor Lake) Intel i7-13850HX CPU:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             384 ns          384 ns      1822307
  bench_getpid                             382 ns          382 ns      1835289
  bench_close                              390 ns          390 ns      1796493
  bench_syscall                            374 ns          374 ns      1874165
  bench_sched_yield                        611 ns          611 ns      1143456
  bench_clock_gettime                     44.1 ns         44.1 ns     15872740
  bench_clock_gettime_tai                 44.1 ns         44.1 ns     15879915
  bench_clock_gettime_monotonic           44.1 ns         44.1 ns     15887383
  bench_clock_gettime_monotonic_raw       44.4 ns         44.4 ns     15755225
  bench_nanosleep0                       55617 ns         4647 ns       100000
  bench_nanosleep0_slack1                 7144 ns         4362 ns       160448
  bench_nanosleep1_slack1                 7159 ns         4369 ns       160645
  bench_pthread_cond_signal               7.38 ns         7.38 ns     94670062
  bench_assign                           0.523 ns        0.523 ns   1000000000
  bench_sqrt                              8.04 ns         8.04 ns     86998912
  bench_sqrtrec                           11.4 ns         11.4 ns     61428535
  bench_nothing                          0.000 ns        0.000 ns   1000000000

EDIT: also reproducible on my skylake-x (Gold 6152) machine

With turbo-boost @3.7Ghz enabled:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             619 ns          616 ns      1153007
  bench_getpid                             632 ns          627 ns      1150829
  bench_close                              629 ns          626 ns      1110226
  bench_syscall                            617 ns          613 ns      1160239
  bench_sched_yield                        974 ns          969 ns       702773
  bench_clock_gettime                     17.9 ns         17.8 ns     39368735
  bench_clock_gettime_tai                 17.8 ns         17.7 ns     39109544
  bench_clock_gettime_monotonic           17.9 ns         17.8 ns     39591364
  bench_clock_gettime_monotonic_raw       19.0 ns         18.8 ns     38902038
  bench_nanosleep0                       63993 ns         4381 ns       100000
  bench_nanosleep0_slack1                 7445 ns         2115 ns       328474
  bench_nanosleep1_slack1                 7346 ns         2111 ns       334833
  bench_pthread_cond_signal               2.13 ns         2.12 ns    327903411
  bench_assign                           0.167 ns        0.166 ns   1000000000
  bench_sqrt                              1.87 ns         1.85 ns    374885774
  bench_sqrtrec                          0.000 ns        0.000 ns   1000000000
  bench_nothing                          0.000 ns        0.000 ns   1000000000

With turbo-boost disabled (@2.1GHz base frequency):

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                            1019 ns         1012 ns       688965
  bench_getpid                            1057 ns         1048 ns       688020
  bench_close                             1039 ns         1029 ns       684537
  bench_syscall                           1010 ns         1003 ns       696919
  bench_sched_yield                       1653 ns         1642 ns       434212
  bench_clock_gettime                     30.7 ns         30.4 ns     22999055
  bench_clock_gettime_tai                 30.5 ns         30.2 ns     23716873
  bench_clock_gettime_monotonic           29.8 ns         29.6 ns     23643198
  bench_clock_gettime_monotonic_raw       30.5 ns         30.3 ns     23277717
  bench_nanosleep0                       65256 ns         5114 ns       100000
  bench_nanosleep0_slack1                11649 ns         3402 ns       197983
  bench_nanosleep1_slack1                11572 ns         3528 ns       209371
  bench_pthread_cond_signal               3.62 ns         3.60 ns    195696177
  bench_assign                           0.255 ns        0.253 ns   1000000000
  bench_sqrt                              3.13 ns         3.10 ns    225561559
  bench_sqrtrec                          0.000 ns        0.000 ns   1000000000
  bench_nothing                          0.000 ns        0.000 ns   1000000000

I wonder why your results are so much different. Mine almost linearly scale with the core frequency.

ori_b•4mo ago

Something is definitely up. Is there a VM? are you running in a container with seccomp?

Why are your calls to sqrt so slow on your newest machine? Why is sqrtrec free on the others?

menaerus•4mo ago

No VM, no container. I could check the asm later on but sqrtrec is likely "free" because it was optimized away, no fences in the code neither so this might be an artifact of different versions of gcc being used across two different platforms.

As for the sqrt, I don't think it is unusually slow if we compare it against the results from the table above - it's definitely not an outlier since the recorded range is from 1ns to 15ns and I recorded the value of 8ns. Why is that so is not a question here.

Better question is why are your results such a big outlier?

ori_b•4mo ago

Are you sure they're outliers? Here's someone else with similar results:

https://arkanis.de/weblog/2017-01-05-measurements-of-system-...

Google also reported similar numbers in 2011, when publicizing their fiber work.

I can also get similar numbers (~68ns) on 9front, though a little higher.

menaerus•4mo ago

Data suggests that they are, and common sense too. And your point of reference is a little bit problematic since there's no code attached so it's hard for people to validate the measurements.

Since you have been laser-focused on sqrt "bad" performance, and obvious optimization with sqrtrec, but also decided to ignore the rest of the results, maybe you can explain why there is such a large difference in your measurements between seemingly very similar platforms in terms of compute. After all this is pure compute problem.

For example, why does 4.9GHz CPU (AMD Ryzen™ 5 7545U) yield 2x to 4x worse results than 5.5GHz CPU (AMD Ryzen™ 7 9700X)?

    AMD Ryzen 7 9700X Desktop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            38.6 ns         38.5 ns     18160546
    bench_getpid                            39.9 ns         39.9 ns     17703749
    bench_close                             45.2 ns         45.1 ns     15711379
    bench_syscall                           42.2 ns         42.1 ns     16638675
    bench_sched_yield                       81.7 ns         81.6 ns      8623522
    
    AMD Ryzen 5 PRO 7545U Laptop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                             106 ns          106 ns      6581746
    bench_getpid                             111 ns          111 ns      6271878
    bench_close                              116 ns          116 ns      5944154
    bench_syscall                           85.9 ns         85.9 ns      7317584
    bench_sched_yield                        315 ns          315 ns      2249333

ori_b•4mo ago

Because the low power laptop part has rather different characteristics to the desktop part, according to CPUmark benchmarks. It's not surprising that the low power part is slower; it's surprising when the newer/faster part is significantly slower for pure CPU operations. Different compliation flags, I guess.

Edit: And, apparently, because regardless of what I do with `cpupower`, and twiddling the governors, cpu frequency on this machine is getting scaled. I've run out of time to debug that, I'll update later.

https://www.cpubenchmark.net/compare/6205vs6367vs4835/AMD-Ry...

I'm not sure what's up with sched_yield.

I can also replicate these numbers with `perf bench syscall basic`.

menaerus•4mo ago

I mean, the base and turbo frequency are about the same on both parts, and the workload is very very simple. Case where TDP would matter is with the workload sucking up all the power budget of a whole chip in which case frequency would have to be downscaled in order to remain within the limits. I doubt this is the case here but I guess this can also be measured if one is curious enough. In my case, only sqrt was slower, the rest was 2x faster on a more modern CPU.

I reran the experiment in a VM, on a company's Xeon server clocked @2.2GHz, and results are again pretty much the same as before:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             778 ns          778 ns       901999
  bench_getpid                             774 ns          774 ns       902699
  bench_close                              779 ns          779 ns       896939
  bench_syscall                            761 ns          761 ns       916941
  bench_sched_yield                       1121 ns         1121 ns       566012
  bench_clock_gettime                     22.1 ns         22.1 ns     31579512
  bench_clock_gettime_tai                 22.0 ns         22.0 ns     31502402
  bench_clock_gettime_monotonic           22.1 ns         22.1 ns     31848177
  bench_clock_gettime_monotonic_raw       22.4 ns         22.4 ns     30953415
  bench_nanosleep0                       57424 ns         6967 ns        98218
  bench_nanosleep0_slack1                 6342 ns         6340 ns       110862
  bench_nanosleep1_slack1                 6310 ns         6308 ns       111064
  bench_pthread_cond_signal               3.23 ns         3.23 ns    216726274
  bench_assign                           0.323 ns        0.323 ns   1000000000
  bench_sqrt                              2.64 ns         2.64 ns    265275643
  bench_sqrtrec                           4.40 ns         4.40 ns    160328959
  bench_nothing                          0.000 ns        0.000 ns   1000000000

Arcuru•4mo ago

> Instead of starting from scratch, we built on top of quinn-udp, the UDP I/O library of the Quinn project, a QUIC implementation in Rust. This sped up our development efforts significantly. Big thank you to the Quinn project.

Awesome, so you sponsored them right?

https://opencollective.com/quinn-rs

kouteiheika•4mo ago

> Awesome, so you sponsored them right?

Why bother sponsoring any open source projects when they can throw a few extra million into their CEO's salary, while that CEO is running their flagship product (Firefox) into the ground?

Avamander•4mo ago

They contributed in other ways?

dochtman•4mo ago

When I asked about financial support, the Senior Principal Software Engineer from Mozilla I talked to said "Mozilla has no money".

To be fair, we've gotten a great amount of code contributions from the Mozilla folks, so it's not like they haven't contributed anything.

(I am one of the Quinn maintainers.)

LtdJorge•4mo ago

It is true, Mozilla has no money (except for paying execs)

sethev•4mo ago

It's always interesting how these large organizations can bring in tens of millions of dollars in excess of expenses, yet still manage to "have no money"

Source: https://assets.mozilla.net/annualreport/2024/b200-mozilla-fo...

philipallstar•4mo ago

I really liked this. All Mozilla content should be like this. Technical content written by literate engineers. No alegria.

brycewray•4mo ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1979683

Still seeing this in Firefox with Cloudflare-hosted sites on both macOS and Fedora.

mxinden•4mo ago

Author here. Thanks for raising this. I posted a comment. Maybe you can help us reproduce.

https://bugzilla.mozilla.org/show_bug.cgi?id=1979683#c3

brycewray•4mo ago

I was the one who filed the original webcompat issue :-) ...

https://github.com/webcompat/web-bugs/issues/168913

Although the form result made it sound like a macOS-only issue, I actually have observed (and continue to observe) it on both macOS and Fedora.

EDIT: In the thread, am seeing the reference to how Firefox-on-QUIC works if one has IPv6. My ISP (Frontier FiOS) infamously doesn't support IPv6, so I'm out of luck there where Firefox is concerned.

NooneAtAll3•4mo ago

idk if author reads this, but

> The combination of the two did cost me a couple of days, resulting in this (basically single line) change in quinn-udp.

2 hyper-links here probably were meant to be different, but got copy pasted the same link

mxinden•4mo ago

Fixed. Thank you!

riobard•4mo ago

Can someone explain how UDP GSO/GRO works in detail? Since UDP packets can arrive out-or-order, how does a single large QUIC packet be split into multiple smaller UDP packets without any header sequence number, and how does the receiving side know the order of the UDP packets to merge?

jiehong•4mo ago

I think as an application, when receiving packets you never really see a coalesced UDP datagrams when GRO is active.

It’s more like the kernel puts multiple datagrams into a single structure and passes that around between layers, maintaining the boundaries between them in that structure (sk_buff data fragments?)

Not an expert, but I tried looking at how this works and stumbled upon [0].

[0]: https://lwn.net/Articles/768995/

wheezle•4mo ago

You definitely see the coalesced datagram as an application. That is kind of the whole point: Passing a big buffer to the syscall and segment it in user-space to minimize the syscall overhead per packet.

mxinden•4mo ago

Author here.

QUIC does not depend on UDP datagrams to be delivered in order. Re-ordering happens on the QUIC layer. Thus, when receiving, the kernel passes a batch (i.e. segmented super datagram) of potentially out-of-order datagrams to the QUIC layer. QUIC reorders them.

Maybe https://blog.cloudflare.com/accelerating-udp-packet-transmis... brings some clarity.

riobard•4mo ago

Thanks! The Cloudflare blog article explained GSO pretty well: application must send a contiguous data buffer with a fixed segment size (except for the tail of the buffer) for GSO to split into smaller packets. But how does GRO work on the receiving side?

For example GSO might split a 3.5KB data buffer into 4 UDP datagrams: U1, U2, U3, and U4, with U1/U2/U3 being 1KB and U4 being 512B. When U1~4 arrives on the receiving host, how does GRO deal with the different permutations of orderingof the four packets (assuming no loss) and pass them to the QUIC layer? Like if U1/U2/U3/U4 come in the original sending order GRO can batch nicely. But what if they come in the order U1/U4/U3/U2? How does GRO deal with the fact that U4 is shorter?

wheezle•4mo ago

It will deliver two separate batches. One of U1 & U4 and a 2nd one of U3 & U2. `quinn-udp` in particular also uses recvmmsg and is thus able to receive up to 32 different permutations of src, dst and segment length with a single syscall (assuming the application provides enough buffers).

riobard•4mo ago

Thank you very much for the clarification! :)

Too•4mo ago

Why are they supporting Android 5? It’s over 10 years old, the devices running it after updates even older. Mobile devices from that era must have a real tough time to browse the modern bloated web. It shouldn’t even be possible to publish to Play store when targeting such an old API level. Who is the user base? Hackers who refurbished their old OnePlus, run it with charger always plugged in, didn’t upgrade to a newer LineageOS, and installed an alternative App Store, just for the sake of it? While novel, it’s a steep price to pay, as we see here it is slowing down development for the rest of us.

mxinden•4mo ago

Note that I (author) made a mistake. We (Mozilla) recently raised the minimum Android version off of 5. See https://blog.mozilla.org/futurereleases/2025/09/15/raising-t... for details.

pabs3•4mo ago

Wonder if this will lead to BitTorrent in the browser.

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Fast UDP I/O for Firefox in Rust

Comments