RIP pthread_cancel

https://eissing.org/icing/posts/rip_pthread_cancel/

71•robin_reala•3h ago

Comments

pizlonator•1h ago

At first I wondered if musl does it better, so I checked, and the version I have disables cancellation in the guts of `getaddrinfo`.

I've always thought APIs like `pthread_cancel` are too nasty to use. Glad to see well documented evidence of my crank opinion

pengaru•39m ago

The asynchronous cancellation in particular is difficult to use correctly, but is also one of the most useful aspects of the api in situations where appropriate.

Imagine cpu-bound worker threads that do nothing but consume work via condition variables and spend long periods of time in hot compute-only loops working on said work... Instead of adding a conditional in the compute you're probably not interested in slowing down at all, you turn on async cancellation and pthread_cancel() the workers when you need to interrupt what's going on.

But it's worth noting pthread_cancel() is also rarely supported anywhere outside first-class pthreads-capable systems like modern linux. So if you have any intention of running elsewhere, forget about it. Thread cancellation support in general is actually somewhat rare IME.

rwmj•1h ago

Netscape used to start a new thread (or maybe it was a subprocess?) to handle DNS lookups, because the API at the time (gethostbyname) was blocking. It's kind of amazing that we're 30 years on and this is still a problem.

nly•1h ago

If you want DNS resolution to obey user/system preferences then you need to use the system provided API

rwmj•1h ago

For sure! The only problem is there should be a non-blocking system-provided API and there isn't.

foota•59m ago

System provided is maybe a strange word to use here since getaddrinfo is a libc function, not a system call.

rwmj•58m ago

POSIX as the system, of course.

silon42•1h ago

As long as broken APIs exist, they will be problematic... they really should be deprecated.

Calling a separate (non-cancellable) thread to perform the lookup sounds a like viable solution...

jeroenhd•10m ago

getaddrinfo_a is available, but not widely adopted (*BSD and Linux), probably because you can't guarantee it'll be available on every computer/phone/modem. This is only an issue if you're targeting POSIX rather than modern operating systems.

Windows 8 and above also have their own asynchronous DNS API on NON-POSIX land.

Aardwolf•1h ago

Maybe this is naive, but could there just be some amount of worker threads that run forever, wait for and take jobs when needed, and message when the jobs are done? Don't need to be canceled, don't block

danappelxx•53m ago

If the DNS resolution call blocks the thread, then you need N worker threads to perform N DNS calls. Threads aren’t free, so this is suboptimal. OTOH some thread pools e.g. libdispatch on Apple operating systems will spawn new threads on demand to prevent starvation, so this _can_ be viable. Though of course this can lead to thread explosion which may be even more problematic depending on the use case. In libcurl’s situation, spawning a million threads is probably even worse than a memory leak, which is worse than long timeouts.

In general, what you really want is for the API call to be nonblocking so you’re not forced to burn a thread.

ComputerGuru•37m ago

This is, essentially, what the previous (largely pathetic) excuse for true asynchronous I/O on Linux did with the libc aio(7) interface to essentially fake support for truly asynchronous file IO. It wasn’t great.

nly•1h ago

Why is running the DNS resolution thread a problem? It should be dequeuing resolution requests and pushing responses and sleeping when there is nothing to do

When someone kills off the curl context surely you simply set a suicide flag on the thread and wake it up so it can be joined.

rwmj•1h ago

One problem may be that fork() kills background threads, so now any program that uses libcurl + fork has to have a new API to restart the DNS thread (or use posix_atfork which is a big PITA), and that might break existing programs using curl.

ComputerGuru•40m ago

It’s not too much of an exaggeration to say that everything about using fork() instead of vfork() plus exec() is essentially fundamentally broken in modern osdev without a whole stack of hacks to try and patch individual issues one-by-one.

loeg•37m ago

A surmountable problem, sure.

rwmj•27m ago

Sometimes. To give one counterexample, golang doesn't have a way to restart the threads it uses for I/O (apparently a decision the golang developers made), so if you're embedding golang code in another binary, it better not call fork. (The reason for this warning: https://gitlab.com/nbdkit/nbdkit/-/commit/2589e6da40939af9ae...)

foota•1h ago

The thread started sounds like it's single use, not a thread handling requests in a loop. Anyway, a single thread handling requests in a loop would serialize these DNS lookups which if they're hanging would be problematic.

loeg•38m ago

Yes, but why? As GP notes, the thread doesn't have to be single-use.

throwaway81523•1h ago

There might be a way to getaddrinfo asynchronously with io_uring by now. Otherwise just call the synchronous version in another thread and let it time out so the thread exits normally, right? Why bother with pthread_cancel?

loeg•36m ago

io_uring is for calling kernel APIs; this is a userspace API.

yxhuvud•29m ago

No. Getaddrinfo is libc, not the kernel. It is of course possible, but complicated, to implement dns resolution with io_uring, but making it behave the same as glibc is very much a nontrivial piece of work.

gary_0•21m ago

The problem is that the standard library function is specified to be blocking (and it's in userspace, so io_uring is not relevant). It's quite possible to do a non-blocking DNS lookup but you have to use a separate non-standard library (like c-ares).

1over137•14m ago

io_uring is a linux-ism, curl is cross-platform.

Someone•52m ago

> Then it needs to sort them if there is more than one address. And in order to do that it needs to read /etc/gai.conf

I don’t see why glibc would have to do that inside a call to getaddrinfo. can’t it do that once at library initialization? If it has to react to changes to that file while a process is running, couldn’t it have a separate thread for polling that file for changes, or use inotify for a separate thread to be called when it changes? Swapping in the new config atomically might be problematic, but I would think that is solvable.

Even ignoring the issue mentioned it seems wasteful to open, parse, and close that file repeatedly.

loeg•39m ago

I think the libc people might argue this level of functionality is just outside the scope of libc. (Arguably, it is a mistake for DNS to be part of libc, given how complicated it is.)

ComputerGuru•34m ago

To be sure, complexity isn’t the determinator for whether something is or isn’t in scope for libc though.

jart•45m ago

Why can't they help fix the C library in question? Cancelation is really tricky to implement for the C library author. It's one of those concepts that, like fork, has implications that pervade everything. Please give your C library maintainers a little leeway if they get cancelation wrong. Especially if it's just a memory leak.

RedShift1•28m ago

I'm betting this code is so old and its behavior so ingrained everywhere else that nobody dares touching it.

gary_0•45m ago

> c-ares ... will not be able to do everything that glibc does.

Does anyone have any idea what things they're referring to here?

okl•37m ago

https://c-ares.org/

gary_0•29m ago

I meant what specific DNS-related things glibc does that c-ares can't. Your reply does not answer that question.

comex•32m ago

pthread_cancel is not a good design because it operates entirely separately from normal mechanisms of error handling and unwinding. (That is, if you’re using C. If you’re using C++ it can integrate with exception handling.)

A better approach would have been to mimic how kernels internally handle signals received during syscalls. Receiving a signal is supposed to cancel the syscall. But from the kernel’s perspective, a syscall implementation is just some code. It can call other functions, acquire locks, wait for conditions, and do anything else you would expect code to do. All of that needs to be cleanly cancelled and unwound to avoid breaking the rest of the system.

So it works like this: when a signal is sent to a thread, a persistent “interrupted” flag is set for that thread. Like with pthread_cancel, this doesn’t immediately interrupt the thread, but only has an effect once the thread calls one of a specific set of functions. For pthread_cancel, that set consists of a bunch of syscalls and other “cancellation points”. For kernel-internal code, it consists of most functions that wait for a condition. The difference is in what happens afterwards. In pthread_cancel’s case, the thread is immediately aborted with only designated cleanups running. In the kernel, the condition-waiting function simply returns an error code. The caller is expected to handle this like any other error code, i.e. by performing any necessary cleanup and then returning the same error code itself. This continues until the entire chain of calls has been unwound. Classic C manual error handling. It’s nothing special, but because interruption works the same way as regular error handling, it‘s more likely to “just work”. Once everything is unwound, the “interrupted” flag is cleared and the original signal can be handled.

(The error code for interruption is usually EINTR, but don’t confuse this with EINTR handling in userspace, which is a mess. The difference is because userspace generally doesn’t want to abort operations upon receiving EINTR, and because from userspace’s perspective there’s no persistent flag.)

pthread_cancel could have been designed the same way: cancellation points return an error code rather than forcibly unwinding. Admittedly, this system might not work quite as well in userspace as it does in kernels. Kernel code already needs to be scrupulous about proper error handling, whereas userspace code often just aborts if a syscall fails. Still, the system would work fine for well-written userspace code, which is more than can be said for pthread_cancel.

yardstick•27m ago

It’s been decades, why doesn’t getaddrinfo have a standardised way to specify a timeout? Set a timeout to 10 seconds and life becomes a lot easier.

Yes I know in Linux you can set the timeout in a config file.

But really the dns setting should be configurable by the calling code. Some code requires fast lookups and doesn’t mind failing which, while others won’t mind waiting longer. It’s not a one size fits all thing.

ComputerGuru•17m ago

I disagree, there are too many variables and ultimately the end user would be th one that knows best. The proper solution isn’t having the library or application dev, who has no idea what kind of network connection the user is running, the type of dns server (caching or not, lan or remote, etc) or the name servers of the target domain and their performance or availability. This is all really the domain of the sysadmin.

The solution is to make it a properly non-blocking api.

albertzeyer•22m ago

Why not use getaddrinfo_a / getaddrinfo_async_start / GetAddrInfoExW?

Or just use some standalone DNS resolve code or library (which basically replicates getaddrinfo but supports this in an async way)?

See also here the discussion: https://github.com/crystal-lang/crystal/issues/13619

Magical systems thinking

Show HN: A store that generates products from anything you type in search

RIP pthread_cancel

An Open-Source Maintainer's Guide to Saying No

486Tang – 486 on a credit-card-sized FPGA board

Mago: A fast PHP toolchain written in Rust

My First Impressions of Gleam

‘Someone must know this guy’: four-year wedding crasher mystery solved

Scientists are rethinking the immune effects of SARS-CoV-2

Perceived Age

How Ruby executes JIT code

Show HN: CLAVIER-36 – A programming environment for generative music

Open Source SDR Ham Transceiver Prototype

SkiftOS: A hobby OS built from scratch using C/C++ for ARM, x86, and RISC-V

Japan sets record of nearly 100k people aged over 100

UTF-8 is a brilliant design

Energy-Based Transformers [video]

Java 25's new CPU-Time Profiler

Safe C++ proposal is not being continued

How to Use Claude Code Subagents to Parallelize Development

The value of bringing a telephoto lens

QGIS is a free, open-source, cross platform geographical information system

Weird CPU architectures, the MOV only CPU (2020)

Legal win

Show HN: Vicinae – A native, Raycast-compatible launcher for Linux

Many hard LeetCode problems are easy constraint problems

‘Overworked, underpaid’ humans train Google’s AI

AI coding

Does All Semiconductor Manufacturing Depend on Spruce Pine Quartz? (2024)

An annual blast of Pacific cold water did not occur

Magical systems thinking

Show HN: A store that generates products from anything you type in search

RIP pthread_cancel

An Open-Source Maintainer's Guide to Saying No

486Tang – 486 on a credit-card-sized FPGA board

Mago: A fast PHP toolchain written in Rust

My First Impressions of Gleam

‘Someone must know this guy’: four-year wedding crasher mystery solved

Scientists are rethinking the immune effects of SARS-CoV-2

Perceived Age

How Ruby executes JIT code

Show HN: CLAVIER-36 – A programming environment for generative music

Open Source SDR Ham Transceiver Prototype

SkiftOS: A hobby OS built from scratch using C/C++ for ARM, x86, and RISC-V

Japan sets record of nearly 100k people aged over 100

UTF-8 is a brilliant design

Energy-Based Transformers [video]

Java 25's new CPU-Time Profiler

Safe C++ proposal is not being continued

How to Use Claude Code Subagents to Parallelize Development

The value of bringing a telephoto lens

QGIS is a free, open-source, cross platform geographical information system

Weird CPU architectures, the MOV only CPU (2020)

Legal win

Show HN: Vicinae – A native, Raycast-compatible launcher for Linux

Many hard LeetCode problems are easy constraint problems

‘Overworked, underpaid’ humans train Google’s AI

AI coding

Does All Semiconductor Manufacturing Depend on Spruce Pine Quartz? (2024)

An annual blast of Pacific cold water did not occur

RIP pthread_cancel

Comments