I was expecting to see io_uring mentioned somewhere in the linux section of the article.
Do you mean zero-copy rx by ZCRX? If so, io_uring should support that, but you need a kernel recent enough.
It supports both zero-copy rx/tx on tcp and udp afaik, and there was a paper from the google engineers that implemented zero-copy send/recv (originally for tcp only) showing that the whole endeavor was worth it when the payload was >= 20kb in size (iirc).
It could be interesting to see if that's still the case with current state of linux, memory sticks, etc.
I've looked up the links from when i dug into that:
- https://www.kernel.org/doc/html/latest/networking/msg_zeroco...
- the paper: https://netdevconf.org/2.1/papers/debruijn-msgzerocopy-talk....
- presentation about the paper: https://netdevconf.org/2.1/session.html?debruijn
> In extreme cases, on purely CPU bound benchmarks, we’re seeing a jump from < 1Gbit/s to 4 Gbit/s. Looking at CPU flamegraphs, the majority of CPU time is now spent in I/O system calls and cryptography code.
400% increase in throughput, which should translate to a proportionate reduction in CPU utilization for UDP network activity. That's pretty cool, especially for better power efficiency on portable clients (mobile and notebook).
I found this presentation refreshing. Too often, claims about transition to "modern" stacks are treated as being inherently good and do not come with the data to back it up.
https://github.com/snabbco/snabb/blob/master/src/apps/intel/...
They sugggest thinking of busybox
But if using busybox, their Makefile will fail
Using toybox instead will work
4 Gbit/s is on our rather dated benchmark machines. If you run the below command on a modern laptop, you likely reach higher throughput. (Consider disabling PMTUD to use a realistic Internet-like MTU. We do the same on our benchmark machines.)
https://github.com/mozilla/neqo
cargo bench --features bench --bench main -- "Download"
Pretty sure GSO/GRO aren't the only buggy parts either.
The reason that Firefox -- and other major browsers -- make self-signed certs so difficult to use is that allowing users to override certificate checks weakens the security of HTTPS, which otherwise relies on certificates being verifiable against the trust anchor list. It's true that this makes certain cases harder, but the judgement of the browser community was that that wasn't worth the security tradeoff. In other words, it's a policy decision, not a technical one.
Security is still a lot better because the root is communicated out of band.
However, it's not an entirely trivial problem to get it right, especially because how how deeply the scheme is tied into the Web security model. Your example here is a good one of what I'm talking about:
> At the moment if you just use plain HTTP then things do mostly work (apart from some APIs which are somewhat arbitrarily locked to 'secure contexts' which means very little about the trustworthiness of the code that does or does not have access to those APIs),
You're right that being served over HTTPS doesn't make the site trustworthy, but what it does do is provide integrity for the identity of the server. So, for instance, the user might look at the URL and decide that the server is trustworthy and can be allowed to use the camera or microphone. However, if you use HTTPS but without verifying the certificate, then an attacker might in the future substitute themselves and take advantage of that camera and microphone access. Another example is when the user enters their password.
Rather than saying that browser vendors don't think this is worth solving in the abstract I would say that it's not very high on the priority list, especially because most of the ideas people have proposed don't work very well.
IMHO they should gradually lock all dynamic code execution such as dynamic CSS and javascript behind a explicit toggle for insecure http sites.
> It massively undermines the security of connections to local devices
No, you see the prompt, it is insecure. If the network admin wants it secure, it means either a internal CA, or a literally free cert from let's encrypt. As the network admin did not care, it's insecure.
"but I have legacy garbage with hardcoded self-signed certs" then reverse proxy that legacy garbage with Caddy?
(An option to get some authentication, and one that I think chrome have kind of started to figure out, is to allow a PWA to connect to a local device and authenticate with its own keys. This still means you need to connect to the internet once with the client device, but at least from that point onwards it can work without internet. But then you need to have a whole other flow so that random sites can't just connect to your local devices...)
You can't just have some random router, printer, NAS, etc. generate its own cert out of thin air and tell the browser to ignore the fact that it can't be verified.
IMO this is a good thing. The way browsers handle HTTPS on older protocols is a result of the number of legacy badly configured systems there are out there which browser vendors don't want to break. Anywhere someone's supporting HTTP/3 they're doing something new, so enforcing a "do it right or don't do it at all" policy is possible.
HTTP/3 just removes the space for misunderstanding.
Just like self-signed certs worked for 20 years until the megacorps decided to break people's browsers because only their for-profit use cases matter. You might not remember, but random self signed certs worked for a long, long time. I use them. And their purpose is as a speed bump against massive passive surveillance, something that still works. TOFU works. ID isn't actually needed for most personal use cases on the web. That's a corporate thing. HTTP+HTTPS (self signed) is the perfect combo for human person use cases. And much more robust than HTTPS only which will break within a year or two left unwatched by human eyes.
The misunderstanding Chrome and it's followers (like firefox) removed was that they were for anything except corporate use cases.
As the author cited, kernel context switch is only on the order of 1 us (which seems too high for a system call anyways). You can reach 500 MB/s even if you still call sendmsg() on literally every packet as long as you average ~500 bytes/packet which is ~1/3 of the standard 1500 bytes MTU. So if you average MTU sized packets, you get 2 us of processing in addition to a full system call to reach 4 Gb/s.
The old number of 1 Gb/s could be reached with a average of ~125 bytes/packet, ~1/12 of the MTU or ~11 us of processing.
“But there are also memory copies in the network stack.” A trivial 3 instruction memory copy will go ~10-20 GB/s, 80–160 Gb/s. In 2 us you can drive 20-40 KB of copies. You are arguing the network stack does 40-80(!) copies to put a UDP packet, a thin veneer over a literal packet, into a packet. I have written commercial network drivers. Even without zero-copy, with direct access you can shovel UDP packets into the NIC buffers at basically memory copy speeds.
“But encryption is slow.” Not that slow. Here is some AES-128 GCM performance done what looks like over 5 years ago. [1] The Intel i5-6500, a midline processor from 8 years ago, averages 1729 MB/s. It can do the encryption for a 500 byte packet in 300 ns, 1/6 of the remaining 2 us budget. Modern processors seem to be closer to 3-5 GB/s per core, or about 25-40 Gb/s, 6-10x the stated UDP throughput.
“It is slow because it is being layered over QUIC.” Then why did you layer over a bottleneck that slows you down by 25x. Second of all, they did not used to do that and they still only got 1 Gb/s previously which is abysmal.
Third of all, you can achieve QUIC feature parity (minus encryption which will be your per-core bottleneck) at 50-100 Gb/s per core, so even that is just a function of using a slow protocol.
Finally, CPU class used in benchmarking is largely irrelevant because I am discussing 20x per-core performance bottlenecks. You would need to be benchmarking on a desktop CPU from 25 years ago to get that degree of single-core performance difference. We are talking iPhone 6, a decade old phone, territory for a efficient implementation to bottleneck on the processor at just 4 Gb/s.
But again, it is probably not a problem with their code. It is likely something else stupid happening on the network stack or protocol side of which they are merely a client.
spectre & meltdown.
> you get 2 us of processing in addition to a full system call to reach 4 Gb/s
TCP has route binding, UDP does not (connect(2) helps one side, but not both sides).
> “But encryption is slow.” Not that slow.
Encryption _is slow_ for small PDUs, at least the common constructions we're currently using. Everyone's essentially been optimizing for and benchmarking TCP with large frames.
If you hot loop the state as the micro-benchmarks do you can do better, but you still see a very visible cost of state setup that only starts to amortize decently well above 1024 byte payloads. Eradicate a bunch of cache efficiency by removing the tightness of the loop and this amortization boundary shifts quite far to the right, up into tens of kilobytes.
---
All of the above, plus the additional framing overheads come into play. Hell even the OOB data blocks are quite expensive to actually validate, it's not a good API to fix this problem, it's just the API we have shoved over bsd sockets.
And we haven't even gotten to buffer constraints and contention yet, but the default UDP buffer memory available on most systems is woefully inadequate for these use cases today. TCP buffers were scaled over time, but UDP buffers basically never were, they're still conservative values from the late 90s/00s really.
The API we really need for this kind of UDP setup is one where you can do something like fork the fd, connect(2) it with a full route bind, and then fix the RSS/XSS challenges that come from this splitting. After that we need a submission queue API rather than another bsd sockets ioctl style mess (uring, rio, etc). Sadly none of this is portable.
On the crypto side there are KDF approaches which can remove a lot of the state cost involved, it's not popular but some vendors are very taken with PSP for this reason - but PSP becoming more well known or used was largely suppressed by its various rejections in the ietf and in linux. Vendors doing scale tests with it have clear numbers though, under high concurrency you can scale this much better than the common tls or tls like constructions.
You are basically saying: “It is slow because of all these system/protocol decisions that mismatch what you need to get high performance out of the primitives.”
Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.
sure, i mean i have no goal of alignment or misalignment, i'm just trying to provide more insights into what's going on based on my observations of this from having also worked on this udp path.
> Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.
yes, though this basically means we're talking about throwing out chunks of the os, the crypto design, the protocol, and a whole lot of tuning at each layer.
the only vendor in a good position to do this is apple (being the only vendor that owns every involved layer in a single product chain), and they're failing to do so as well.
the alternative is a long old road, where folks make articles like this from time to time, we share our experiences and hope that someone is inspired enough reading it to be sniped into making incremental progress. it'd be truly fantastic if we sniped a group with the vigor and drive that the mptcp folks seem to have, as they've managed to do an unusually broad and deep push across a similar set of layered challenges (though still in progress).
I just measured. On my Ryzen 7 9700X, with Linux 6.12, it's about 50ns to call syscall(__NR_gettimeofday). Even post-spectre, entering the kernel isn't so expensive.
Using the libc wrapper will use the vdso. Using syscall() will enter the kernel.
I haven't measured, but calling the vdso should be closer to 5ns.
Someone else did more detailed measurements here:
https://arkanis.de/weblog/2017-01-05-measurements-of-system-...
Also: register renaming is a thing, as is write combining and pipelining. You're not flushing to L1 synchronously for every register, or ordinary userspace function calls would regularly take hundreds of cycles for handling saved registers. They don't.
It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.
As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)
What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.
As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.
That makes me suspect something else is up in that benchmark.
For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:
f:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $8, %rsp
movq $42, -128(%rbp)
movq $42, -120(%rbp)
movq $42, -112(%rbp)
movq $42, -104(%rbp)
movq $42, -96(%rbp)
movq $42, -88(%rbp)
movq $42, -80(%rbp)
movq $42, -72(%rbp)
movq $42, -64(%rbp)
movq $42, -56(%rbp)
movq $42, -48(%rbp)
movq $42, -40(%rbp)
movq $42, -32(%rbp)
movq $42, -24(%rbp)
movq $42, -16(%rbp)
movq $42, -8(%rbp)
nop
leave
.cfi_def_cfa 7, 8
retInteresting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.
Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.
The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...
syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.
As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.
This is from Intel manual:
Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).
So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.
AMD Ryzen 7 9700X Desktop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 38.6 ns 38.5 ns 18160546
bench_getpid 39.9 ns 39.9 ns 17703749
bench_close 45.2 ns 45.1 ns 15711379
bench_syscall 42.2 ns 42.1 ns 16638675
bench_sched_yield 81.7 ns 81.6 ns 8623522
bench_clock_gettime 15.9 ns 15.9 ns 44010857
bench_clock_gettime_tai 15.9 ns 15.9 ns 43997256
bench_clock_gettime_monotonic 15.9 ns 15.9 ns 44012908
bench_clock_gettime_monotonic_raw 15.9 ns 15.9 ns 43982277
bench_nanosleep0 49961 ns 370 ns 100000
bench_nanosleep0_slack1 10839 ns 351 ns 1000000
bench_nanosleep1_slack1 10878 ns 358 ns 1000000
bench_pthread_cond_signal 1.37 ns 1.37 ns 503715097
bench_assign 0.563 ns 0.562 ns 1000000000
bench_sqrt 1.63 ns 1.63 ns 430096636
bench_sqrtrec 5.33 ns 5.33 ns 132574542
bench_nothing 0.394 ns 0.394 ns 1000000000
12th Gen Intel(R) Core(TM) i5-12600H
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 70.0 ns 70.0 ns 9985369
bench_getpid 71.6 ns 71.6 ns 9763016
bench_close 76.7 ns 76.7 ns 9131090
bench_syscall 66.8 ns 66.8 ns 10533946
bench_sched_yield 160 ns 160 ns 4377987
bench_clock_gettime 12.2 ns 12.2 ns 57432496
bench_clock_gettime_tai 12.1 ns 12.1 ns 57826299
bench_clock_gettime_monotonic 12.2 ns 12.2 ns 57736141
bench_clock_gettime_monotonic_raw 12.3 ns 12.3 ns 57070425
bench_nanosleep0 63154 ns 11834 ns 55756
bench_nanosleep0_slack1 2933 ns 1700 ns 348675
bench_nanosleep1_slack1 2654 ns 1479 ns 467420
bench_pthread_cond_signal 1.39 ns 1.39 ns 483995101
bench_assign 0.868 ns 0.868 ns 821103909
bench_sqrt 1.69 ns 1.69 ns 422094139
bench_sqrtrec 4.06 ns 4.06 ns 174511095
bench_nothing 0.750 ns 0.750 ns 941204159
AMD Ryzen 5 PRO 7545U Laptop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 106 ns 106 ns 6581746
bench_getpid 111 ns 111 ns 6271878
bench_close 116 ns 116 ns 5944154
bench_syscall 85.9 ns 85.9 ns 7317584
bench_sched_yield 315 ns 315 ns 2249333
bench_clock_gettime 17.6 ns 17.6 ns 39935693
bench_clock_gettime_tai 17.6 ns 17.6 ns 39920957
bench_clock_gettime_monotonic 17.5 ns 17.5 ns 39962966
bench_clock_gettime_monotonic_raw 17.5 ns 17.5 ns 39561163
bench_nanosleep0 52720 ns 3058 ns 100000
bench_nanosleep0_slack1 13815 ns 2969 ns 244790
bench_nanosleep1_slack1 13710 ns 2722 ns 254666
bench_pthread_cond_signal 2.66 ns 2.66 ns 264735233
bench_assign 0.930 ns 0.930 ns 813279743
bench_sqrt 2.43 ns 2.43 ns 286953468
bench_sqrtrec 5.67 ns 5.67 ns 123889652
bench_nothing 0.812 ns 0.812 ns 860562208
So, I've tested multiple times in multiple ways, and the results don't seem to match. ----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 384 ns 384 ns 1822307
bench_getpid 382 ns 382 ns 1835289
bench_close 390 ns 390 ns 1796493
bench_syscall 374 ns 374 ns 1874165
bench_sched_yield 611 ns 611 ns 1143456
bench_clock_gettime 44.1 ns 44.1 ns 15872740
bench_clock_gettime_tai 44.1 ns 44.1 ns 15879915
bench_clock_gettime_monotonic 44.1 ns 44.1 ns 15887383
bench_clock_gettime_monotonic_raw 44.4 ns 44.4 ns 15755225
bench_nanosleep0 55617 ns 4647 ns 100000
bench_nanosleep0_slack1 7144 ns 4362 ns 160448
bench_nanosleep1_slack1 7159 ns 4369 ns 160645
bench_pthread_cond_signal 7.38 ns 7.38 ns 94670062
bench_assign 0.523 ns 0.523 ns 1000000000
bench_sqrt 8.04 ns 8.04 ns 86998912
bench_sqrtrec 11.4 ns 11.4 ns 61428535
bench_nothing 0.000 ns 0.000 ns 1000000000
EDIT: also reproducible on my skylake-x (Gold 6152) machineWith turbo-boost @3.7Ghz enabled:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 619 ns 616 ns 1153007
bench_getpid 632 ns 627 ns 1150829
bench_close 629 ns 626 ns 1110226
bench_syscall 617 ns 613 ns 1160239
bench_sched_yield 974 ns 969 ns 702773
bench_clock_gettime 17.9 ns 17.8 ns 39368735
bench_clock_gettime_tai 17.8 ns 17.7 ns 39109544
bench_clock_gettime_monotonic 17.9 ns 17.8 ns 39591364
bench_clock_gettime_monotonic_raw 19.0 ns 18.8 ns 38902038
bench_nanosleep0 63993 ns 4381 ns 100000
bench_nanosleep0_slack1 7445 ns 2115 ns 328474
bench_nanosleep1_slack1 7346 ns 2111 ns 334833
bench_pthread_cond_signal 2.13 ns 2.12 ns 327903411
bench_assign 0.167 ns 0.166 ns 1000000000
bench_sqrt 1.87 ns 1.85 ns 374885774
bench_sqrtrec 0.000 ns 0.000 ns 1000000000
bench_nothing 0.000 ns 0.000 ns 1000000000
With turbo-boost disabled (@2.1GHz base frequency): ----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 1019 ns 1012 ns 688965
bench_getpid 1057 ns 1048 ns 688020
bench_close 1039 ns 1029 ns 684537
bench_syscall 1010 ns 1003 ns 696919
bench_sched_yield 1653 ns 1642 ns 434212
bench_clock_gettime 30.7 ns 30.4 ns 22999055
bench_clock_gettime_tai 30.5 ns 30.2 ns 23716873
bench_clock_gettime_monotonic 29.8 ns 29.6 ns 23643198
bench_clock_gettime_monotonic_raw 30.5 ns 30.3 ns 23277717
bench_nanosleep0 65256 ns 5114 ns 100000
bench_nanosleep0_slack1 11649 ns 3402 ns 197983
bench_nanosleep1_slack1 11572 ns 3528 ns 209371
bench_pthread_cond_signal 3.62 ns 3.60 ns 195696177
bench_assign 0.255 ns 0.253 ns 1000000000
bench_sqrt 3.13 ns 3.10 ns 225561559
bench_sqrtrec 0.000 ns 0.000 ns 1000000000
bench_nothing 0.000 ns 0.000 ns 1000000000
I wonder why your results are so much different. Mine almost linearly scale with the core frequency.Why are your calls to sqrt so slow on your newest machine? Why is sqrtrec free on the others?
As for the sqrt, I don't think it is unusually slow if we compare it against the results from the table above - it's definitely not an outlier since the recorded range is from 1ns to 15ns and I recorded the value of 8ns. Why is that so is not a question here.
Better question is why are your results such a big outlier?
https://arkanis.de/weblog/2017-01-05-measurements-of-system-...
Google also reported similar numbers in 2011, when publicizing their fiber work.
I can also get similar numbers (~68ns) on 9front, though a little higher.
Since you have been laser-focused on sqrt "bad" performance, and obvious optimization with sqrtrec, but also decided to ignore the rest of the results, maybe you can explain why there is such a large difference in your measurements between seemingly very similar platforms in terms of compute. After all this is pure compute problem.
For example, why does 4.9GHz CPU (AMD Ryzen™ 5 7545U) yield 2x to 4x worse results than 5.5GHz CPU (AMD Ryzen™ 7 9700X)?
AMD Ryzen 7 9700X Desktop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 38.6 ns 38.5 ns 18160546
bench_getpid 39.9 ns 39.9 ns 17703749
bench_close 45.2 ns 45.1 ns 15711379
bench_syscall 42.2 ns 42.1 ns 16638675
bench_sched_yield 81.7 ns 81.6 ns 8623522
AMD Ryzen 5 PRO 7545U Laptop:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 106 ns 106 ns 6581746
bench_getpid 111 ns 111 ns 6271878
bench_close 116 ns 116 ns 5944154
bench_syscall 85.9 ns 85.9 ns 7317584
bench_sched_yield 315 ns 315 ns 2249333Edit: And, apparently, because regardless of what I do with `cpupower`, and twiddling the governors, cpu frequency on this machine is getting scaled. I've run out of time to debug that, I'll update later.
https://www.cpubenchmark.net/compare/6205vs6367vs4835/AMD-Ry...
I'm not sure what's up with sched_yield.
I can also replicate these numbers with `perf bench syscall basic`.
I reran the experiment in a VM, on a company's Xeon server clocked @2.2GHz, and results are again pretty much the same as before:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
bench_getuid 778 ns 778 ns 901999
bench_getpid 774 ns 774 ns 902699
bench_close 779 ns 779 ns 896939
bench_syscall 761 ns 761 ns 916941
bench_sched_yield 1121 ns 1121 ns 566012
bench_clock_gettime 22.1 ns 22.1 ns 31579512
bench_clock_gettime_tai 22.0 ns 22.0 ns 31502402
bench_clock_gettime_monotonic 22.1 ns 22.1 ns 31848177
bench_clock_gettime_monotonic_raw 22.4 ns 22.4 ns 30953415
bench_nanosleep0 57424 ns 6967 ns 98218
bench_nanosleep0_slack1 6342 ns 6340 ns 110862
bench_nanosleep1_slack1 6310 ns 6308 ns 111064
bench_pthread_cond_signal 3.23 ns 3.23 ns 216726274
bench_assign 0.323 ns 0.323 ns 1000000000
bench_sqrt 2.64 ns 2.64 ns 265275643
bench_sqrtrec 4.40 ns 4.40 ns 160328959
bench_nothing 0.000 ns 0.000 ns 1000000000Awesome, so you sponsored them right?
Why bother sponsoring any open source projects when they can throw a few extra million into their CEO's salary, while that CEO is running their flagship product (Firefox) into the ground?
To be fair, we've gotten a great amount of code contributions from the Mozilla folks, so it's not like they haven't contributed anything.
(I am one of the Quinn maintainers.)
Source: https://assets.mozilla.net/annualreport/2024/b200-mozilla-fo...
Still seeing this in Firefox with Cloudflare-hosted sites on both macOS and Fedora.
https://github.com/webcompat/web-bugs/issues/168913
Although the form result made it sound like a macOS-only issue, I actually have observed (and continue to observe) it on both macOS and Fedora.
EDIT: In the thread, am seeing the reference to how Firefox-on-QUIC works if one has IPv6. My ISP (Frontier FiOS) infamously doesn't support IPv6, so I'm out of luck there where Firefox is concerned.
> The combination of the two did cost me a couple of days, resulting in this (basically single line) change in quinn-udp.
2 hyper-links here probably were meant to be different, but got copy pasted the same link
It’s more like the kernel puts multiple datagrams into a single structure and passes that around between layers, maintaining the boundaries between them in that structure (sk_buff data fragments?)
Not an expert, but I tried looking at how this works and stumbled upon [0].
QUIC does not depend on UDP datagrams to be delivered in order. Re-ordering happens on the QUIC layer. Thus, when receiving, the kernel passes a batch (i.e. segmented super datagram) of potentially out-of-order datagrams to the QUIC layer. QUIC reorders them.
Maybe https://blog.cloudflare.com/accelerating-udp-packet-transmis... brings some clarity.
For example GSO might split a 3.5KB data buffer into 4 UDP datagrams: U1, U2, U3, and U4, with U1/U2/U3 being 1KB and U4 being 512B. When U1~4 arrives on the receiving host, how does GRO deal with the different permutations of orderingof the four packets (assuming no loss) and pass them to the QUIC layer? Like if U1/U2/U3/U4 come in the original sending order GRO can batch nicely. But what if they come in the order U1/U4/U3/U2? How does GRO deal with the fact that U4 is shorter?
jcranmer•4mo ago
Glad to know that networking still produces insanity trying to reproduce issues à la https://xkcd.com/2259/.
3form•4mo ago
Analemma_•4mo ago
rkomorn•4mo ago
dinosaurdynasty•4mo ago
rkomorn•4mo ago
Maken•4mo ago
rkomorn•4mo ago
Shish2k•4mo ago
gizmo686•4mo ago
bobmcnamara•4mo ago
0x0000 is a special value for some NICs meaning please calculate for me.
One NIC years ago would set 0xFFFF for bad checksum. At first we thought this was horrifyingly broken. But really you can just fallback to software verification for the handful of legitimate and bad packets that arrive with that checksum.
Joel_Mckay•4mo ago
UDP/QUIC can DoS any system not based on a cloud deployment large enough to soak up the peak traffic. It is silly, but it pushes out any hosting operation that can't reach a disproportionate bandwidth asymmetry with the client traffic. i.e. fine for FAANG, but a death knell for most other small/medium organizations.
This is why many LAN still drop most UDP traffic, and rate-limit the parts needed for normal traffic. Have a nice day =3