Faster asin() was hiding in plain sight

https://16bpp.net/blog/post/faster-asin-was-hiding-in-plain-sight/

90•def-pri-pub•1h ago

Comments

erichocean•1h ago

Ideal HN content, thanks!

orangepanda•1h ago

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

patchnull•1h ago

The huge gap between Intel (1.5x) and M4 (1.02x) speedups is the most interesting result here. Apple almost certainly uses similar polynomial approximations inside their libm already, tuned for the M-series pipeline. glibc on x86 tends to be more conservative with precision, leaving more room on the table. The Cg version is from Abramowitz and Stegun formula 4.4.45, which has been a staple in shader math for decades. Funny how knowledge gets siloed, game devs and GPU folks have known about this class of approximation forever but it rarely crosses into general systems programming.

stephencanon•47m ago

These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.

The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.

Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.

Pannoniae•44m ago

Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.

stephencanon•19m ago

Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).

patchnull•20m ago

Great point about the approximate sqrt being low-hanging fruit. The correctly rounded sqrt is doing way more work than the rest of the pipeline deserves at that error budget. I wonder if the author benchmarked with rsqrtss plus a Newton-Raphson refinement step — on x86 that gives you roughly 23 bits of precision for a fraction of the latency of sqrtss, which is still massive overkill for a 2.7e11 ulp result but would show an even bigger speedup.

stephencanon•7m ago

For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.

We didn't carry that algorithm forward to arm64, sadly, because the architects made fsqrt fast enough that it wasn't worth it in scalar contexts.

adampunk•1h ago

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

drsopp•1h ago

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

Pannoniae•1h ago

Microbenchmarks. A LUT will win many of them but you pessimise the rest of the code. So unless a significant (read: 20+%) portion of your code goes into the LUT, there isn't that much point to bother. For almost any pure calculation without I/O, it's better to do the arithmetic than to do memory access.

jcalvinowens•21m ago

Locality within the LUT matters too: if you know you're looking up identical or nearby-enough values to benefit from caching, an LUT can be more of a win. You only pay the cache cost for the portion you actually touch at runtime.

I could imagine some graphics workloads tend compute asin() repeatedly with nearby input values. But I'd guess the locality isn't local enough to matter, only eight double precision floats fit in a cache line.

groundzeros2015•58m ago

I don’t want to fill up L1 for sin.

jcalvinowens•40m ago

Surely the loss in precision of a 32KB LUT for double precision asin() would be unacceptable?

Jyaif•33m ago

By interpolating between values you can get excellent results with LUTs much smaller than 32KB. Will it be faster than the computation from op, that I don't know.

jcalvinowens•15m ago

I'm very skeptical you wouldn't get perceptible visual artifacts if you rounded the trig functions to 4096 linear approximations. But I'd be happy to be proven wrong :)

scottlamb•1h ago

Isn't the faster approach SIMD? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

TimorousBestie•48m ago

I don’t know much about raytracing but it’s probably tricky to orchestrate all those asin calls so that the input and output memory is aligned and contiguous. My uneducated intuition is that there’s little regularity as to which pixels will take which branches and will end up requiring which asin calls, but I might be wrong.

scottlamb•28m ago

I'd expect it to come down to data-oriented design: SoA (structure of arrays) rather than AoS (array of structures).

I skimmed the author's source code, and this is where I'd start: https://github.com/define-private-public/PSRayTracing/blob/8...

Instead of an `_objects`, I might try for a `_spheres`, `_boxes`, etc. (Or just `_lists` still using the virtual dispatch but for each list, rather than each object.) The `asin` seems to be used just for spheres. Within my `Spheres::closest_hit` class (note plural), I'd work to SIMDify it. (I'd try to SIMDify the others too of course but apparently not with `asin`.) I think it's doable: https://github.com/define-private-public/PSRayTracing/blob/8...

I don't know much about ray tracers either (having only written a super-naive one back in college) but this is the general technique used to speed up games, I believe. Besides enabling SIMD, it's more cache-efficient and minimizes dispatch overhead.

edit: there's also stuff that you can hoist in this impl. Restructuring as SoA isn't strictly necessary to do that, but it might make it more obvious and natural. As an example, this `ray_dir.length_squared()` is the same for the whole list. You'd notice that when iterating over the spheres. https://github.com/define-private-public/PSRayTracing/blob/8...

TimorousBestie•24m ago

This tracks with my experience and seems reasonable, yes. I tend to SoA all the things, sometimes to my coworkers’ amusement/annoyance.

Am4TIfIsER0ppos•28m ago

I don't do much float work but I don't think there is a single regular sine instruction only old x87 float stack ones.

I was curious what "sequence" would end up being but my compiler is too old for that intrinsic. Even godbolt didn't help for gcc or clang but it did reveal that icc produced a call https://godbolt.org/z/a3EsKK4aY

AlotOfReading•1h ago

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }

Courtesy of evil floating point bithacks.

def-pri-pub•52m ago

> floating point bithacks

The forbidden magic

chuckadams•49m ago

You brought Zalgo. I blame this decade on you.

moffkalast•42m ago

> float asinine(float x) {

FTFY :P

adampunk•16m ago

// what the fuck

LegionMammal978•57m ago

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

jason_s•39m ago

Not sure I would call Remez "simple"... it's all relative; I prefer Chebyshev approximation which is simpler than Remez.

stephencanon•13m ago

Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

herf•14m ago

They teach a lot of Taylor/Maclaurin series in Math classes (and trig functions are sometimes called "CORDIC" which is an old method too) but these are not used much in actual FPUs and libraries. Maybe we should update the curricula so people know better ways.

stephc_int13•42m ago

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

glitchc•41m ago

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

kstrauser•20m ago

> The 4% improvement doesn't seem like it's worth the effort.

People have gotten PhDs for smaller optimizations. I know. I've worked with them.

> instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs.

What's the x86-64 opcode for arcsin?

jason_s•40m ago

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

exmadscientist•20m ago

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

wolfi1•8m ago

Abramowitz/Stegun has been updated 2010 and resides now here: https://dlmf.nist.gov/

ok123456•4m ago

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

Entities enabling scientific fraud at scale are large, resilient, growing (2025)

Temporal: The 9-Year Journey to Fix Time in JavaScript

Lego's 0.002mm specification and its implications for manufacturing (2025)

Microsoft BitNet: 100B Param 1-Bit model for local CPUs

Faster asin() was hiding in plain sight

Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos

Wiz Joins Google

Where Some See Strings, She Sees a Space-Time Made of Fractals

Show HN: Klaus – OpenClaw on a VM, batteries included

Launch HN: Sentrial (YC W26) – Catch AI Agent Failures Before Your Users Do

Whistleblower claims ex-DOGE member says he took Social Security data to new job

AI Agent Hacks McKinsey

Making WebAssembly a first-class language on the Web

PeppyOS: A simpler alternative to ROS 2 (now with containers support)

Building a TB-303 from Scratch

Zig – Type Resolution Redesign and Language Changes

Show HN: Open-source browser for AI agents

Writing my own text editor, and daily-driving it

Cloudflare crawl endpoint

Yann LeCun raises $1B to build AI that understands the physical world

Create value for others and don’t worry about the returns

Tony Hoare has died

U+237C ⍼ Is Azimuth

Why the Global Elite Gave Up on Spelling and Grammar

TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

Let yourself fall down more

Agents that run while I sleep

Julia Snail – An Emacs Development Environment for Julia Like Clojure's Cider

SSH Secret Menu

RISC-V Is Sloooow

Faster asin() was hiding in plain sight

Comments

Entities enabling scientific fraud at scale are large, resilient, growing (2025)

Temporal: The 9-Year Journey to Fix Time in JavaScript

Lego's 0.002mm specification and its implications for manufacturing (2025)

Microsoft BitNet: 100B Param 1-Bit model for local CPUs

Faster asin() was hiding in plain sight

Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos

Wiz Joins Google

Where Some See Strings, She Sees a Space-Time Made of Fractals

Show HN: Klaus – OpenClaw on a VM, batteries included

Launch HN: Sentrial (YC W26) – Catch AI Agent Failures Before Your Users Do

Whistleblower claims ex-DOGE member says he took Social Security data to new job

AI Agent Hacks McKinsey

Making WebAssembly a first-class language on the Web

PeppyOS: A simpler alternative to ROS 2 (now with containers support)

Building a TB-303 from Scratch

Zig – Type Resolution Redesign and Language Changes

Show HN: Open-source browser for AI agents

Writing my own text editor, and daily-driving it

Cloudflare crawl endpoint

Yann LeCun raises $1B to build AI that understands the physical world

Create value for others and don’t worry about the returns

Tony Hoare has died

U+237C ⍼ Is Azimuth

Why the Global Elite Gave Up on Spelling and Grammar

TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

Let yourself fall down more

Agents that run while I sleep

Julia Snail – An Emacs Development Environment for Julia Like Clojure's Cider

SSH Secret Menu

RISC-V Is Sloooow