frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
193•theblazehen•2d ago•56 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
679•klaussilveira•14h ago•203 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
954•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
125•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
25•kaonwarb•3d ago•21 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
62•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
235•isitcontent•15h ago•25 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
39•jesperordrup•5h ago•17 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
227•dmpetrov•15h ago•121 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•17h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
499•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
384•ostacke•21h ago•96 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•183 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
292•eljojo•17h ago•182 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
21•speckx•3d ago•10 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
6•matt_d•3d ago•1 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•10 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
66•kmm•5d ago•9 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
93•quibono•4d ago•22 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
260•i5heu•17h ago•202 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
38•gmays•10h ago•13 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1073•cdrnsf•1d ago•459 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
291•surprisetalk•3d ago•43 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•71 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
8•1vuio0pswjnm7•1h ago•0 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
154•SerCe•10h ago•144 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
187•limoce•3d ago•102 comments
Open in hackernews

RIP pthread_cancel

https://eissing.org/icing/posts/rip_pthread_cancel/
238•robin_reala•4mo ago

Comments

pizlonator•4mo ago
At first I wondered if musl does it better, so I checked, and the version I have disables cancellation in the guts of `getaddrinfo`.

I've always thought APIs like `pthread_cancel` are too nasty to use. Glad to see well documented evidence of my crank opinion

pengaru•4mo ago
The asynchronous cancellation in particular is difficult to use correctly, but is also one of the most useful aspects of the api in situations where appropriate.

Imagine cpu-bound worker threads that do nothing but consume work via condition variables and spend long periods of time in hot compute-only loops working on said work... Instead of adding a conditional in the compute you're probably not interested in slowing down at all, you turn on async cancellation and pthread_cancel() the workers when you need to interrupt what's going on.

But it's worth noting pthread_cancel() is also rarely supported anywhere outside first-class pthreads-capable systems like modern linux. So if you have any intention of running elsewhere, forget about it. Thread cancellation support in general is actually somewhat rare IME.

epcoa•4mo ago
> But it's worth noting pthread_cancel() is also rarely supported anywhere outside first-class pthreads-capable systems like modern linux

Having written some of the implementation for a non x86 commercial Unix well over 30 years ago now (yeah, I know), pthread_cancel is not that rare. A carve out like “modern linux” is io_uring or even inotify and epoll. AIX and HP-UX, fuck even OSF/1 had pthread_cancel.

Windows has TerminateThread. Most RTOS have some kind of thread level task killing interface.

While they have different semantics than pthread_cancel, that doesn’t really affect the example you’re giving - they can all be used for the “cpu-bound worker”

pizlonator•4mo ago
pthreads has `pthread_kill`, which is like `TerminateThread`.

`pthread_cancel` is different

lll-o-lll•4mo ago
I’m not familiar with pthread_cancel, but I am with TerminateThread. It’s not something that can be used safely: ever. Raymond Chen has written a few times about it, including the history.

> Originally, there was no Terminate­Thread function. The original designers felt strongly that no such function should exist because there was no safe way to terminate a thread, and there’s no point having a function that cannot be called safely. But people screamed that they needed the Terminate­Thread function, even though it wasn’t safe, so the operating system designers caved and added the function because people demanded it. Of course, those people who insisted that they needed Terminate­Thread now regret having been given it.

hedora•4mo ago
Assuming it’s OK to take 10msec to cancel, that conditional can be a well-predicted branch and a read of a cached memory address every 10msec. On a 1GHz processor, that’s a one cycle instruction that’s run every 10 million cycles. Unless the conditional or the cached read is the straw that breaks the back of the cache, there’s no way it’ll be measurable.
achierius•4mo ago
How do you insert a branch "every 10ms" without some sort of hardware-provided interrupt?

If your code is running in a hot loop, you would have to insert that branch into the hot loop (even well-predicted branches add a few cycles, and can do things like break up decode groups) or have the hot loop bail out every once in a while to go execute the branch and code, which would mean tiling your interior hot loop and thus adding probably significant overhead that way.

Also, you say "cached memory address" but I can almost guarantee that unless you're doing that load a lot more frequently than once every 10 milliseconds the inner loop is going to knock that address out of l1 and probably l2 by the time you get back around to it.

hedora•4mo ago
You put the check outside the innermost loop. Put it up one or two loops instead, and reason that the check runs frequently enough and also infrequently enough.

Also, don’t you have to hit a pthread cancellation point for pthread_cancel to take effect?

Those are way more expensive than a branch, but if you want the exact behavior, you could do “if (done) { break; } else { pthread_??? }”

pengaru•4mo ago
> Also, don’t you have to hit a pthread cancellation point for pthread_cancel to take effect?

No, the whole point here is async cancellation - you don't test for it and you don't enter a cancellation point.

Excerpt from pthread_setcancelstate(3):

  >    Asynchronous cancelability
  >        Setting the cancelability  type  to  PTHREAD_CANCEL_ASYNCHRONOUS  is  rarely  useful.
  >        Since  the  thread  could be canceled at any time, it cannot safely reserve resources
  >        (e.g., allocating memory with malloc(3)), acquire mutexes, semaphores, or locks,  and
  >        so  on.   Reserving resources is unsafe because the application has no way of knowing
  >        what the state of these resources is when the thread is canceled; that is, did cance-
  >        lation occur before the resources were reserved, while they were reserved,  or  after
  >        they  were  released?   Furthermore,  some internal data structures (e.g., the linked
  >        list of free blocks managed by the malloc(3) family of functions) may be left  in  an
  >        inconsistent  state if cancelation occurs in the middle of the function call.  Conse-
  >        quently, clean-up handlers cease to be useful.
  >
  >        Functions that can be safely asynchronously  canceled  are  called  async-cancel-safe
  >        functions.   POSIX.1-2001  and  POSIX.1-2008  require  only  that  pthread_cancel(3),
  >        pthread_setcancelstate(), and pthread_setcanceltype() be async-cancel-safe.  In  gen-
  >        eral,  other  library  functions can't be safely called from an asynchronously cance-
  >        lable thread.
  >
  >        One of the few circumstances in which asynchronous cancelability  is  useful  is  for
  >        cancelation of a thread that is in a pure compute-bound loop.
sthustfo•4mo ago
asynchronous cancellation (when compared to deferred) is only recommended in scenarios where the thread does not share any data, semaphore or conditional variables with other threads. The target thread tends to cleanup any data within the thread cleanup handlers via pthread_cleanup_pop(). If not, the entire application might end up going down. Async cancellation has a very narrow application scope imho.
rwmj•4mo ago
Netscape used to start a new thread (or maybe it was a subprocess?) to handle DNS lookups, because the API at the time (gethostbyname) was blocking. It's kind of amazing that we're 30 years on and this is still a problem.
nly•4mo ago
If you want DNS resolution to obey user/system preferences then you need to use the system provided API
rwmj•4mo ago
For sure! The only problem is there should be a non-blocking system-provided API and there isn't.
foota•4mo ago
System provided is maybe a strange word to use here since getaddrinfo is a libc function, not a system call.
rwmj•4mo ago
POSIX as the system, of course.
froh•4mo ago
the system API is not syscalls but libc. so why does it feel strange?
Seattle3503•4mo ago
In this case it isn't in the kernel, but in glibc. Could someone implement an equivalent alternative? Do any language runtimes re-implement DNS resolution?
NewJazz•4mo ago
I think most languages use the OS api by default, but there are plenty of libraries out there that bypass the system resolution.
bradfitz•4mo ago
Go does. And it supports timeouts and cancelation.
tremon•4mo ago
The system-provided API for getting DNS user/system preferences on Unix systems is to read /etc/resolv.conf. Every application is free to implement their own lookup from that.
dcrazy•4mo ago
That is absolutely not the API on macOS, which is a certified UNIX.
Spivak•4mo ago
This isn't even correct on Linux as it won't work if your user has anything other than or in addition to the dns module in their nsswitch.conf. You must use glibc's resolution on Linux for correct behavior. If it's software on your own systems then do what you want but you'll piss off some sysadmins deploying your software if you don't. Even Go farms out to cgo to resolve names if it detects modules it doesn't recognize.
silon42•4mo ago
As long as broken APIs exist, they will be problematic... they really should be deprecated.

Calling a separate (non-cancellable) thread to perform the lookup sounds a like viable solution...

jeroenhd•4mo ago
getaddrinfo_a is available, but not widely adopted (*BSD and Linux), probably because you can't guarantee it'll be available on every computer/phone/modem. This is only an issue if you're targeting POSIX rather than modern operating systems.

Windows 8 and above also have their own asynchronous DNS API on NON-POSIX land.

rfl890•4mo ago
>Windows 8 and above also have their own asynchronous DNS API on NON-POSIX land. Interesting. Which API?
poizan42•4mo ago
GetAddrInfoEx[0] has async support support since Windows 8 - it had the overlapped parameters earlier but didn't support them. I'm guessing that is what GP is referring to.

[0] https://learn.microsoft.com/en-us/windows/win32/api/ws2tcpip...

Arnavion•4mo ago
>getaddrinfo_a is available, but not widely adopted (*BSD and Linux), probably because you can't guarantee it'll be available on every computer/phone/modem. This is only an issue if you're targeting POSIX rather than modern operating systems.

To be precise, even on Linux getaddrinfo_a is not guaranteed to be present. It's a glibc extension. musl doesn't have it.

o11c•4mo ago
And MUSL's explicit policy of "do not implement essential features, just because the standards are falling behind" is a major reason why many programs choose to never support building against MUSL.
pjz•4mo ago
Funny, I see it as them choosing not to add another de-facto 'standard' ala https://xkcd.com/927/ .
fpoling•4mo ago
In 1996 it was a subprocess. Linux Threads appeared only later that year.
Aardwolf•4mo ago
Maybe this is naive, but could there just be some amount of worker threads that run forever, wait for and take jobs when needed, and message when the jobs are done? Don't need to be canceled, don't block
danappelxx•4mo ago
If the DNS resolution call blocks the thread, then you need N worker threads to perform N DNS calls. Threads aren’t free, so this is suboptimal. OTOH some thread pools e.g. libdispatch on Apple operating systems will spawn new threads on demand to prevent starvation, so this _can_ be viable. Though of course this can lead to thread explosion which may be even more problematic depending on the use case. In libcurl’s situation, spawning a million threads is probably even worse than a memory leak, which is worse than long timeouts.

In general, what you really want is for the API call to be nonblocking so you’re not forced to burn a thread.

ComputerGuru•4mo ago
This is, essentially, what the previous (largely pathetic) excuse for true asynchronous I/O on Linux did with the libc aio(7) interface to essentially fake support for truly asynchronous file IO. It wasn’t great.
variadix•4mo ago
Yeah, I’m not sure I see the problem (other than that threads are more expensive than e.g. file descriptors, but this is a moot point without a better API). You define how many requests in flight you want to allow and that sets the cap on how many worker threads you spawn/use, you could also support an unbounded number in flight this way by lazily spawning worker threads per requests. Cancellation/kill interfaces for multithreading are pretty much always a footgun. Even for multiprocessing on modern machines, if you’re doing something non-trivial and decide to use SIGKILL to terminate a worker process, it’s easy to leave e.g. file system resources in a bad state.
nly•4mo ago
Why is running the DNS resolution thread a problem? It should be dequeuing resolution requests and pushing responses and sleeping when there is nothing to do

When someone kills off the curl context surely you simply set a suicide flag on the thread and wake it up so it can be joined.

rwmj•4mo ago
One problem may be that fork() kills background threads, so now any program that uses libcurl + fork has to have a new API to restart the DNS thread (or use posix_atfork which is a big PITA), and that might break existing programs using curl.
ComputerGuru•4mo ago
It’s not too much of an exaggeration to say that everything about using fork() instead of vfork() plus exec() is essentially fundamentally broken in modern osdev without a whole stack of hacks to try and patch individual issues one-by-one.
EPWN3D•4mo ago
It's not an exaggeration in any sense. fork(2) basically cannot be done correctly in modern userspace stacks.
loeg•4mo ago
A surmountable problem, sure.
rwmj•4mo ago
Sometimes. To give one counterexample, golang doesn't have a way to restart the threads it uses for I/O (apparently a decision the golang developers made), so if you're embedding golang code in another binary, it better not call fork. (The reason for this warning: https://gitlab.com/nbdkit/nbdkit/-/commit/2589e6da40939af9ae...)
loeg•4mo ago
This is a policy choice of Golang, but a C library like Curl (the topic of this thread) is not constrained by the policy choices of the Go developers. Curl could use MADV_WIPEONFORK or other primitives to restart its DNS thread automatically.
foota•4mo ago
The thread started sounds like it's single use, not a thread handling requests in a loop. Anyway, a single thread handling requests in a loop would serialize these DNS lookups which if they're hanging would be problematic.
loeg•4mo ago
Yes, but why? As GP notes, the thread doesn't have to be single-use.
throwaway81523•4mo ago
There might be a way to getaddrinfo asynchronously with io_uring by now. Otherwise just call the synchronous version in another thread and let it time out so the thread exits normally, right? Why bother with pthread_cancel?
loeg•4mo ago
io_uring is for calling kernel APIs; this is a userspace API.
yxhuvud•4mo ago
No. Getaddrinfo is libc, not the kernel. It is of course possible, but complicated, to implement dns resolution with io_uring, but making it behave the same as glibc is very much a nontrivial piece of work.
gary_0•4mo ago
The problem is that the standard library function is specified to be blocking (and it's in userspace, so io_uring is not relevant). It's quite possible to do a non-blocking DNS lookup but you have to use a separate non-standard library (like c-ares).
1over137•4mo ago
io_uring is a linux-ism, curl is cross-platform.
Someone•4mo ago
> Then it needs to sort them if there is more than one address. And in order to do that it needs to read /etc/gai.conf

I don’t see why glibc would have to do that inside a call to getaddrinfo. can’t it do that once at library initialization? If it has to react to changes to that file while a process is running, couldn’t it have a separate thread for polling that file for changes, or use inotify for a separate thread to be called when it changes? Swapping in the new config atomically might be problematic, but I would think that is solvable.

Even ignoring the issue mentioned it seems wasteful to open, parse, and close that file repeatedly.

loeg•4mo ago
I think the libc people might argue this level of functionality is just outside the scope of libc. (Arguably, it is a mistake for DNS to be part of libc, given how complicated it is.)
ComputerGuru•4mo ago
To be sure, complexity isn’t the determinator for whether something is or isn’t in scope for libc though.
loeg•4mo ago
Historically libcs have been leery of imposing threading on otherwise singlethreaded applications; and for similar reasons, try to minimize startup costs.
Someone•4mo ago
https://sourceware.org/glibc/ says

“The GNU C Library is designed to be a backwards compatible, portable, and high performance ISO C library. It aims to follow all relevant standards including ISO C11, POSIX.1-2008, and IEEE 754-2008.”

⇒ I don’t think they make that argument.

loeg•4mo ago
The standards make no requirement that an implementation be good, performant, or even useful.
NewJazz•4mo ago
You want libc to start a thread whenever it is loaded?
Someone•4mo ago
If it’s the only alternative to being broken: yes.

It could read and parse the file the first time a thread gets created, too.

A cheaper alternative is to check the modification time of the config file, and only reparse it in pthread_cancel when that changes. That doesn’t 10% fix the problem in theory, but would do it in practice.

cesarb•4mo ago
> I don’t see why glibc would have to do that inside a call to getaddrinfo. can’t it do that once at library initialization?

If it were a library dedicated to DNS, sure, but glibc is used by nearly every process in the system, including many which will never call getaddrinfo.

jart•4mo ago
Why can't they help fix the C library in question? Cancelation is really tricky to implement for the C library author. It's one of those concepts that, like fork, has implications that pervade everything. Please give your C library maintainers a little leeway if they get cancelation wrong. Especially if it's just a memory leak.
RedShift1•4mo ago
I'm betting this code is so old and its behavior so ingrained everywhere else that nobody dares touching it.
jart•4mo ago
No it sounds like they just need to add a pthread_cleanup_push() call somewhere in the getaddrinfo() implementation.

C libraries are not black magic. Nor are they holy code. We needn't fear them. It's the simplest part of the software stack.

gary_0•4mo ago
[deleted]
okl•4mo ago
https://c-ares.org/
comex•4mo ago
pthread_cancel is not a good design because it operates entirely separately from normal mechanisms of error handling and unwinding. (That is, if you’re using C. If you’re using C++ it can integrate with exception handling.)

A better approach would have been to mimic how kernels internally handle signals received during syscalls. Receiving a signal is supposed to cancel the syscall. But from the kernel’s perspective, a syscall implementation is just some code. It can call other functions, acquire locks, wait for conditions, and do anything else you would expect code to do. All of that needs to be cleanly cancelled and unwound to avoid breaking the rest of the system.

So it works like this: when a signal is sent to a thread, a persistent “interrupted” flag is set for that thread. Like with pthread_cancel, this doesn’t immediately interrupt the thread, but only has an effect once the thread calls one of a specific set of functions. For pthread_cancel, that set consists of a bunch of syscalls and other “cancellation points”. For kernel-internal code, it consists of most functions that wait for a condition. The difference is in what happens afterwards. In pthread_cancel’s case, the thread is immediately aborted with only designated cleanups running. In the kernel, the condition-waiting function simply returns an error code. The caller is expected to handle this like any other error code, i.e. by performing any necessary cleanup and then returning the same error code itself. This continues until the entire chain of calls has been unwound. Classic C manual error handling. It’s nothing special, but because interruption works the same way as regular error handling, it‘s more likely to “just work”. Once everything is unwound, the “interrupted” flag is cleared and the original signal can be handled.

(The error code for interruption is usually EINTR, but don’t confuse this with EINTR handling in userspace, which is a mess. The difference is because userspace generally doesn’t want to abort operations upon receiving EINTR, and because from userspace’s perspective there’s no persistent flag.)

pthread_cancel could have been designed the same way: cancellation points return an error code rather than forcibly unwinding. Admittedly, this system might not work quite as well in userspace as it does in kernels. Kernel code already needs to be scrupulous about proper error handling, whereas userspace code often just aborts if a syscall fails. Still, the system would work fine for well-written userspace code, which is more than can be said for pthread_cancel.

jcarrano•4mo ago
In libdill, cancelling a coroutine makes all blocking calls on that routine return immediately with ECANCELED. The code must check this condition and exit any loop and so eventually the coroutine finishes, having released all resources in the process.
yardstick•4mo ago
It’s been decades, why doesn’t getaddrinfo have a standardised way to specify a timeout? Set a timeout to 10 seconds and life becomes a lot easier.

Yes I know in Linux you can set the timeout in a config file.

But really the dns setting should be configurable by the calling code. Some code requires fast lookups and doesn’t mind failing which, while others won’t mind waiting longer. It’s not a one size fits all thing.

ComputerGuru•4mo ago
I disagree, there are too many variables and ultimately the end user would be th one that knows best. The proper solution isn’t having the library or application dev, who has no idea what kind of network connection the user is running, the type of dns server (caching or not, lan or remote, etc) or the name servers of the target domain and their performance or availability. This is all really the domain of the sysadmin.

The solution is to make it a properly non-blocking api.

o11c•4mo ago
On Linux you can do what you're asking with `getaddrinfo_a` + `gai_suspend`.

As always, on non-Linux Unixen the answer is "screw you!"

sedatk•4mo ago
Just wanted to note that Windows doesn't have that problem either. Even Windows NT had async getaddrinfo() variants.
BobbyTables2•4mo ago
Wow, TIL ! Thanks!
throwawayoogux•4mo ago
OpenBSD has getaddrinfo_async since 5.6 (March 2014).
surajrmal•4mo ago
The fact that a library as polific as libcurl doesn't use it and tried to resort to pthread_cancel, known to have serious problems, points to a larger problem. Either posix needs to incorporate an async getaddrinfo or there is an education problem that needs to be addressed.
okanat•4mo ago
Just leave DNS out, are there any POSIX standard async functionality for networking or even normal IO? All I know by reading some libraries is epoll or io_uring used on Linux, kevent on BSDs.
mort96•4mo ago
POSIX has poll for that.
comex•4mo ago
Yes for networking. You set your sockets into O_NONBLOCK mode and use poll() or select(). These APIs are in POSIX and also have direct equivalents in Winsock.

There is also POSIX AIO for async I/O on any file descriptor, but at least historically speaking it doesn't work properly on Linux.

epcoa•4mo ago
POSIX AIO on “Linux” is implemented in glibc with userland threads using regular blocking syscalls behind the scenes. It basically works properly, it just doesn’t gain any potential efficiency benefits, it adds avoidable overhead, prone to priority inversion, etc. The linux kernel has no provision for POSIX AIO.

Until io_uring the only asynchronous disk IO interface was the io_* syscalls, which were confusingly referred to as Asynchronous IO, though these have nothing to do with POSIX AIO and can only be used bypassing the page cache, and suck for general purpose use.

eklitzke•4mo ago
A few reasons, I think.

The first is that getaddrinfo is specified by POSIX, and the POSIX evolve very conservatively and at a glacial pace.

The second reason is that specifying a timeout breaks symmetry with a lot of other functions in Unix/C, both system calls and libc calls. For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations. There are ways to do these things in a non-blocking manner with timeouts using aio or io_uring, but those are already relatively complicated APIs for just using simple system calls, and getaddrinfo is much more complicated.

The last reason is that if you use the sockets APIs directly it's not that hard to write a non-blocking DNS resolver (c-ares is one example). The thing is though that if you write your own resolver you have to consider how to do caching, it won't work with NSS on Linux, etc. You can implement these things (systemd-resolved does it, and works with NSS) but they are a lot of work to do properly.

jstimpfle•4mo ago
> For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations.

No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking. But if you do that, you should probably also consider scheduling and really any CPU instruction to be blocking. (If the system is too loaded, anything can be slow).

To be fair, network requests can be considered non-blocking in a similar way, but they're depending on other systems that you generally can't control or inspect. In practice you'll see network timeouts. Note that you (at least normally -- there might be tricky exceptions) won't see EINTR from read() to a filesystem file. But you can see EINTR for network sockets. The difference is that, in Unix terminology, disks are not considered "slow devices".

jcelerier•4mo ago
I'd consider "blocking" anything that given same inputs, state and cpu frequency, may take variable time. That means pretty much every system call and entering the system scheduler, doing something that leads to a page fault, etc. Pretty much only pure math in total functions and function calls to paged functions are acceptable.
Joker_vD•4mo ago
Yeah... the sudden paging in also has been noted as a source of latency in the network-oriented software. But that's the problem with our current state of the APIs and their implementations: ideally, you'd have as many independent threads of executions as you want/need, and every time one of them initiates some "blocking" operation, it is quickly end efficiently scheduled, and another ready-to-run thread is switched in. Native threads don't give you that context-switching efficiency, and user-space threads can accidentally cause an underlying native thread block even on "read a non-local variable".
Joker_vD•4mo ago
> disks are not considered "slow devices".

And neither are the tapes. But the pipes, apparently, are.

Well, unfortunately, disk^H^H^H^H large persistent storage I/O is actually slow, or people wouldn't have been writing thread-pools to make it look asynchrnous, or sometimes even process-pools to convert disk I/O to pipe I/O, for the last two decades.

jstimpfle•4mo ago
There is a misunderstanding. "Slow device" in the POSIX sense is about unpredictability, not maximal possible bandwidth. Reading from a spinning disk might be comparably slow in the bandwidth sense, but it's actually quite deterministic how much data you can shovel to or from it.

A pipe on the other hand might easily stall for an hour. The kernel generally can't know how long it will have to wait for more data. That's why pipe reads (as well as writes) are interruptible.

The absolute bandwidth number of a harddisk doesn't matter --- in principle you can overload any system such that it fails to schedule and complete all processes in time. Putting aside possible system overload, the "slow device" terminology makes a lot of sense.

Joker_vD•4mo ago
Seeking a tape also takes an unpredictable amount of time; and so is seeking a disk, for that matter (IIRC, historically it was actually quite difficult for UNIX systems to saturate disk's througput with random reads).
jstimpfle•4mo ago
According to ChatGPT, a tape device is actually considered a "slow device". Even though I'm not sure it's that unpredictable. Maybe for most common use cases it is.

I was under the impression that seeking a disk you can generally calculate well with 10ms? Again, it depends on the file system abstractions built on top, and then the cache and the current system load -- how many seeks will be required?

jcalvinowens•4mo ago
>> disks are not considered "slow devices".

> And neither are the tapes. But the pipes, apparently, are.

The "slow vs fast" language is really unfortunate, I realize it's traditional but it's unnecessarily confusing.

A better way to conceptualize it IMHO is bounded vs unbounded: a file or a tape contains a fixed amount of data known a priori, a socket or a pipe does not.

surajrmal•4mo ago
In a data center, networks can have lower latency than disk (even ssd). Now the real place this all falls on its head is page faults. That there are definitely places where you need to have granular control over what can and cannot stall a thread from making progress.
jcalvinowens•4mo ago
> No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking.

The important point is that the kernel takes locks during all those operations, and will wait an unbounded amount of time if those locks are contended.

So really and truly, yes, any synchronous syscall can schedule out for an arbitrary amount of time, no matter what you do.

It's sort of semantic. The word "block" isn't a synonym for "sleep", it has a specific meaning in POSIX. In that meaning, you're correct, they never "block". But in the generic way most people use the term "block", they absolutely do.

asveikau•4mo ago
I think getaddrinfo_a is cancellable, including the ability to block with a timeout. It is a glibc extension.
albertzeyer•4mo ago
Why not use getaddrinfo_a / getaddrinfo_async_start / GetAddrInfoExW?

Or just use some standalone DNS resolve code or library (which basically replicates getaddrinfo but supports this in an async way)?

See also here the discussion: https://github.com/crystal-lang/crystal/issues/13619

eliaspro•4mo ago
A standalone library would have to work with all the existing system facilities (e.g. NSS on Linux systems) to be not restricted to just resolv.conf entries, but to allow for all the various other methods of resolving names.
flopsamjetsam•4mo ago
libcurl's c-ares support would fit the bill?
charcircuit•4mo ago
Why isn't DNS in a service on the operating system instead of libc? You'll want requests to be locally cache anyways. This also makes it easier to just abandon a RPC instead of stopping a thread you don't control.
cesarb•4mo ago
> Why isn't DNS in a service on the operating system instead of libc?

On modern Linux systems, it is: systemd-resolved (https://www.freedesktop.org/software/systemd/man/latest/syst...) is a system service which can be queried through RPC (using dbus or varlink), through the traditional glibc APIs (using a NSS plugin), or even by being queried on the loopback interface as if it were a normal recursive DNS server (for compatibility with software which bypasses glibc and does direct DNS queries).

charcircuit•4mo ago
Well from the blogpost it looks like calling it via glibc does extra work than just querying the service.
kelnos•4mo ago
Not sure how standard that is, though. E.g. on Debian systemd-resolved isn't even installed by default, let alone enabled and set up as the default resolver.
pajko•4mo ago
This is clearly an implementation error in getaddrinfo(). It should set up cleanup functions: https://man7.org/linux/man-pages/man3/pthread_cleanup_push.3...
hacker_homie•4mo ago
what's old is new again, I loved java in the early 2000's trying to remotely stop a thread

Thread.destroy() Thread.stop() Thread.suspend()

so much potential for corrupted state.

senderista•4mo ago
Relevant for Linux: https://www.imperialviolet.org/2005/06/01/asynchronous-dns-l...
blaz0•4mo ago
I'm the author of the GitHub issue that the blog links to, and I'd like to thank Stefan for quickly acknowledging the problem and addressing the issue! I try to keep one of our internal applications up to date with the latest libcurl version within a day or two of a release, so we sometimes hit fresh problems while running our battery of tests.

Ironically, our application has also struggled with blocking DNS resolution in the past, so I appreciate the discussion here. In case anyone is interested, here is a quick reference of the different asynchronous DNS resolution methods that you can use in a native code application for some of the most popular platforms:

  - Windows/Xbox: GetAddrInfoExW / GetAddrInfoExCancel
  - macOS/iOS: CFHostStartInfoResolution / CFHostCancelInfoResolution
  - Linux (glibc): getaddrinfo_a / gai_cancel
  - Android: android.net.DnsResolver.query (requires the use of JNI)
  - PS5: proprietary DNS resolver API
wolletd•4mo ago
This information about `getaddrinfo_a` should probably also be in the Github issue?
somat•4mo ago
If we are doing a survey. I found a few more. It feels like we need to get everyone together in a room and say "we will let you out when you decide on a standard non-blocking address lookup" What a mess.

  - OpenBSD: getaddrinfo_async / asr_abort https://man.openbsd.org/asr_run
  - FreeBSD: dnsres_getaddrinfo / found no way to abort https://man.freebsd.org/cgi/man.cgi?query=dnsres
jnwatson•4mo ago
As the article mentions, why not just delegate it to a library dedicated to the solution? c-ares is a solid, well-maintained library.
JosifA•4mo ago
Unfortunately, c-ares is not problem-free on all platforms.

On iOS, its use triggers a local network access popup (it tries to reach your DNS server, which is often on your LAN). If a user denies acess, your app will simply not work.

On Android, it's not compatible with some VPN apps. Those apps are to blame, but your users are going to blame you not them.

So, at my previous company we ended up building libcurl with a threaded DNS resolver on both iOS and Android.

enva2712•4mo ago
i’ve run into this issue with pthread_cancel

it makes cleanup logic convoluted when you can’t easily pthread_cleanup_push, forcing you to either block on the join or signal cancellation and detach

rurban•4mo ago
Just use c-ares. Threads with signals are evil
juliangmp•4mo ago
I feel like the entire pthread_cancel and cancellation point feature is only really useful when your application is exiting and resource management doesn't matter anymore. If you want to use it on a long running process, the thread that gets canceled must effectively be completely free of any resource allocation....

Unless of course you're making use of shared memory between processes, then you could still leak even when the entire process is shut down.

whatifitoldyou•4mo ago
pthread_cancel is never the answer. Abrupt thread cancelation is always a risk for leaks and/or even worse, deadlocks. No matter of it was blocked in one libc function. The function is not one instruction you don't know what the poor thread was doing when it was killed.