SIMD Perlin Noise: Beating the Compiler with SSE (2014)

https://scallywag.software/vim/blog/simd-perlin-noise-i

60•homarp•3d ago

Comments

jesse__•1d ago

Author here, AMA :)

0points•1d ago

Nice write-up, and congratulations on the result! Since it's about perlin and performance, have you had a look at opensimplex?

PS. bonsai looks really cool! Checking it out right now

jesse__•1d ago

I haven't looked at opensimplex. I will when I get around to doing a simplex implementation.

And thanks for the kind words!

Keyframe•1d ago

pretty sweet! I'm mostly interested in how / what did you do to measure the performance and focus on a function. Is it perf pretty much with hist or visualizer or what?

jesse__•1d ago

I just called _rdtsc() before and after the noise gen once every iteration, and pushed the sample onto a fixed size buffer .. after some N iterations (4k maybe, can't remember) of samples, computed min/max/avg.

There's a little project here that I used to benchmark in part 4

https://github.com/scallyw4g/bonsai_noise_bench

jokoon•1d ago

you should post the result at the end

and yes, make a benchmark

(although I would not know how to make one, or what reference point to use)

what do you think about the fastnoiselite implementation used in godot?

jesse__•1d ago

I kinda did post results at the end of part 4 .. I beat the SOTA by 1.8x

There's a benchmark utility here: https://github.com/scallyw4g/bonsai_noise_bench

Fastnoise2 is a high quality library. Can't speak to fastnoiselite .. never looked at it.

vlovich123•1d ago

Which compiler & optimization settings did you use? Out of curiosity, any idea why the compiler failed to auto-vectorize the loops?

jesse__•1d ago

Clang -O2

-O3 didn't seem to make any appreciable difference.

Re. the auto-vectorization, I really don't know. I didn't even read the assembly the compiler generated until at least halfway through the process. Generally I've found that you basically can't rely on the compiler auto vectorizing anything, ever, if it actually matters.

addaon•1d ago

Memories. As a personal project back in... 2003?... I decided to do something similar, implement 4D Perlin Noise in Altivec assembly. The only problem was that I had a G3 iBook; so I would write one instruction of assembly, then write a C function to interpret that assembly, building an interpreter for a very selective subset of PPC w/ Altivec that ran (slooooowly) on the G3. As I recall I got it down to ~200 instructions, and it worked perfectly the first time I ran it on a G4, which was pretty rewarding. Took me more than half a day, though. On an unrelated note, I got an intership with Apple's performance team that summer.

rincebrain•1d ago

Did you profile the results with different compilers?

The last time I tried doing this kind of microoptimization for fun, I ended up bundling actual assembly files, because the generated assembly for intrinsics was so variable in performance across compilers it was the only way to get consistent results on many platforms.

jesse__•1d ago

I only build the project this is embedded in with clang, so that's the only compiler I tested.

llm_nerd•1d ago

HN loves SIMD, and there is a "how I hand crafted a SIMD optimization" post doing numbers on here regularly. They're fun posts, and it absolutely speaks to the fact that writing code that optimizing compilers can robustly and comprehensively turn into good SIMD branches is somewhat of a black art.

Which is why you, generally, shouldn't be doing either. You shouldn't rely upon the compiler to figure out your intentions, and you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.

Instead you should reach for one of the many available libraries that not only force you into appropriately structuring your data and calls for SIMD goodness, they're massively more portable and powerful.

Google's Highway, for instance, will let you use their abstracted SIMD functions and it provides the optimization whether your target is SSE2-4, AVX, AVX2, AVX512, AVX10, or if you build for ARM NEON or SVE, for any conceivable vector size, or WASM's weird SIMD functions, or RISC-V's RVV, and several more, and when new widths and new options come out, the library adds the support and you might not have to change your code at all.

There are loads of libraries like this (xsimd, EVE, SIMDe, etc). They all force you into thinking about structuring your code in a manner that is SIMDable -- instead of hoping the optimizing compiler will figure it out on its own -- and provide targeting for a vast trove of SIMD options without hand-writing for every option.

I was going to quickly rewrite the example in Highway just to demonstrate but the Perlin stuff seems to be missing or significantly restructured.

"But that is obvious and I'm mad that you commented this" - no, it isn't obvious whatsoever, and this "I hand-rolled some SSE now my app is super awesome look at the microbenchmark results on a very narrow, specific machine" content appears on here regularly, betraying a pretty big influence of beginners who don't know that it's almost certainly the wrong approach.

63•1d ago

This is a valuable viewpoint that lines up somewhat with some other discussion I've seen on the topic [0]. I'd like to see more posts about structuring code for the auto vectorizor (with libraries or otherwise) rather than writing simd by hand. Do you have any documentation you'd recommend?

[0] https://matklad.github.io/2023/04/09/can-you-trust-a-compile...

jesse__•1d ago

I disagree pretty strongly with most of what you said, but I'd be very interested in seeing a Highway example and looking at the differences. Take a look through the comments, I left a link to the test bench I made, which contains all the code.

janwas•17h ago

Highway author here :) I'm curious what you disagree with, because it all sounds very sensible to me?

jesse__•10h ago

There's a lot to discuss.

First off, a number of statements are nonsense. Take, for example

> you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.

Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table? That makes no sense at all.

Furthermore, I was writing a library. It's just embedded in my game engine.

> Instead you should reach for one of the many available libraries

This blanket statement is only true in a narrow set of circumstances. In my mind, it requires that you ship on multiple architectures and probably multiple compilers. If you have narrower constraints, it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO. Furthermore, someone's got to write the libraries, so doing it yourself as a learning exercise has value.

> There are loads of libraries like this [...] and provide targeting for a vast trove of SIMD options without hand-writing for every option.

The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result. The fact is, the library code is super fucking boring. I barely mentioned it in the article because it's basically just boilerplate an LLM could probably spit out, first try. The interesting part of the series is the observation that you can precompute a matrix of intermediates and look them up, instead of recomputing them in the hot loop, effectively trading memory bandwidth for less instructions. A good trade for this algorithm, which saturates the instruction pipelines.

The thing the original commentor does get right is the notion that thinking about data layout is important. But, that has nothing to do with the library you're using .. you just have to do it. They seem to be conflating the use of a library with the act of writing wide code, as if you can't do one without the other, which is obviously false.

> I was going to quickly rewrite the example in Highway ..

Right. I'll believe this when I see it.

I could pick it apart more, but.. I think you get my drift.

llm_nerd•7h ago

>First off, a number of statements are nonsense.

100% of my original comment is absolutely and completely correct. Indisputable correct.

>Furthermore, I was writing a library.

Little misunderstandings like this pervade your take.

>seems to be under the impression that using a SIMD library would somehow have produced a better result.

To be clear, I wasn't speaking to you or for your benefit, or specifically to your exercise. You'll notice I didn't email a list of recommendations to you, because I do not care what you do or how you do it. I didn't address my comment to you.

I -- and I was abundantly clear on this -- was speaking to the random reader who might be considering optimizing their code with some hand-crafted SIMD. That following the path in this (and an endless chain of similar) submission(s) is usually ill advised, generally, not even speaking to this specific project, but rather to the average "I want to take advantage of SIMD in my code" consideration.

HN has a fetish for SIMD code recently and there is almost always a better approach than hand-crafting some SSE3 calls in one's random project.

>The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result.

Again, I could not care less about your project. But the average developer does care that their code runs on a wide variety of platforms optimally. You don't, but again, you and your project was tangential to my comment which was general.

>The thing the original commentor does get right is the notion that thinking about data layout is important.

Aside from the entirety of my comment being correct, the point was that many of the SIMD tools and libraries force you down a path where you are coerced into such structures. Versus often relying upon the compiler to make the best of suboptimal structures. We've seen many times where people complain that their compiler isn't vectorizing things that they think it should, but there is a choice between endlessly fighting with the compiler, and hand-rolling SSE calls, that not only supports much more hardware it leads you down the path of best practices.

Which is of course why C++ 26 is getting std::simd.

Again, you are irrelevant to my comment. Your project is irrelevant to it. I know this is tough to stomach.

>Right. I'll believe this when I see it.

I actually cloned the project but then this submission fell off the front page and it seemed not worth my time. Not to mention that it can't be built on macOS which happened to be the machine I was on at the moment.

Because again, I don't care about your or your project, and my commentary was to the SIMD sideliners considering how to approach it.

>I could pick it apart more, but.. I think you get my drift.

None of your retorts are valid, and my comment stands as completely correct. The drift is that you feel defensive about a general comment because you did something different, which....eh.

twoodfin•1d ago

The year on this article should be (2024).

dragontamer•10h ago

SSE?

It makes sense for 2014 when AVX was not too widely deployed. Today you can double the throughput with AVX and probably get 4x improvement with AVX512.

GPU SIMD is also popular but perlin noise seems too 'small' a problem to be worth traversing PCIe over. But if you had a GPU shader that needed perlin noise as an input, I'd expect GPU to easily use this methodology.

It is worth revisiting how different techniques worked out over the last decade. Strangely enough, CUDA code from 2014 would likely be still workable today (perlin noise doesn't need the new GPU instructions of 4x4 Float16 or Raytracing).

OpenCL IMO is the wrong path though, which is what many would have picked from 2014 era for GPU.

jesse__•10h ago

There are 4 parts.

Spoiler: I went to AVX and beat the state of the art by 1.8x

Also, doing it on the GPU is worth it, if you do large batches.

dragontamer•9h ago

As far as fast RNGs available, Imma just plug an old, incomplete, weekend project for ya....

https://github.com/dragontamer/AESRand

Especially because you are already in the AVX domain, a fast AVX RNG that uses like 3 registers should be useful to ya.

....

Yeah I'm pretty sure aesenc these days has more throughout than multiply. (Edit: aesenc, at least a singular round, is largely a 32-bit operation and this has less complexity than a 64-bit multiply. Yeah I know it's over 128-bits but seriously, it's surprising how 'little' AES actually shuffles bits around per round).

If you are fine with an inferior RNG, you probably can skip one or two instructions I did there. But the 'two rounds of AES' seems to be the minimum to pass PractRand or BigCrush.

-------

Today, AES on AVX512 can perform 4x AES in parallel over all 512 bits. But the overall technique I did back then should allow for arbitrary skipping ahead as well. (Ex: thread#0 starts with iteration #0. Thread#1 starts with iteration #1000000. Etc. etc. with consistency because my increment function is simple 64-bit adds, a 64-bit multiply will skip forward easily)

Alas, I don't think AESENC was ever ported to ymm registers and this your choices are 128-bit AESRAND vs 512-bit AESRAND.

jesse__•8h ago

Hey, thanks for the comment! I did actually take a look at the aes instructions and came to the conclusion that they are in fact faster than the hash I used, but I think I'd decided I would have to swizzle the data in a way that was a pain because of how the aes mixdown works (ie it mixes across lanes, so I would have to change the output pattern, if that makeshifts sense)

Maybe I'll dust it off one day and try again. That seems like it could be an easy win.

dragontamer•6h ago

> but I think I'd decided I would have to swizzle the data in a way that was a pain because of how the aes mixdown works (ie it mixes across lanes, so I would have to change the output pattern, if that makeshifts sense)

100% agree.

I solved this with a 64-bit x2 SIMD add instruction. State += 0x0305071113171923, which ensures a 1-bit 'carry bit' dependency as well so we have (barely) enough data mixing for lots of cool entropy effects.

Because this is an odd number (bottom bit is 1), it cycles every 2^64, which should be a sufficient cycle length for most simulations.

That 1-bit difference was enough to then pass PractRand and BigCrush.

Don't swizzle the bits. Just add a number across all 128-bits (as 2x 64-bit adds) and bam. We get a lot of lovely RNG properties thanks to AES mixing.

It's not 'purely' aesenc. I did a few little tidbits that fixed all the problems of AES data mixing.

------

The real fun part is that the latency/dependency limitation on my code is this Add instruction. The AES stuff is done in parallel later and thus easily parallelizes to modern 4x512-bit AES as is available on Zen5. (Maybe the compilers won't see it yet, but it's bloody obvious for humans to see it IMO).

IE: the critical path of my code is:

    simd-add state, 0x030507.....

State gets SSA'd by the out of order system on the processors and thus future iterations of the RNG loop can execute in parallel.

Scientists may have found a way to eliminate chromosome linked to Down syndrome

Graphene OS: a security-enhanced Android build

Inter-Planetary Network Special Interest Group

Positron – A next-generation data science IDE

New Aarch64 Back End

I wasted weeks hand optimizing assembly because I benchmarked on random data

There is no memory safety without thread safety

AMD CEO sees chips from TSMC's US plant costing 5%-20% more

A GPU Calculator That Helps Calculate What GPU to Use

Visa and Mastercard: The global payment duopoly (2024)

Revisiting Moneyball

Why concatenative programming matters (2012)

RE#: High performance derivative-based regular expression matching (2024)

PSA: SQLite WAL checksums fail silently and may lose data

Air Force unit suspends use of Sig Sauer pistol after shooting death of airman

Use Your Type System

Vet is a safety net for the curl | bash pattern

Intel CEO Letter to Employees

Open Source Maintenance Fee

Covers as a way of learning music and code

Superfunctions: A universal solution against sync/async fragmentation in Python

American sentenced for helping North Koreans get jobs at U.S. firms

Bus Bunching

UK: Phone networks down: EE, BT, Three, Vodafone, O2 not working in mass outage

Mwm – The smallest usable X11 window manager

Writing is thinking

The POSIX specification of vi

Show HN: Easy Python Time Parsing

Thunder Compute (YC S24) Is Hiring a C++ Systems Engineer

Building MCP servers for ChatGPT and API integrations

Scientists may have found a way to eliminate chromosome linked to Down syndrome

Graphene OS: a security-enhanced Android build

Inter-Planetary Network Special Interest Group

Positron – A next-generation data science IDE

New Aarch64 Back End

I wasted weeks hand optimizing assembly because I benchmarked on random data

There is no memory safety without thread safety

AMD CEO sees chips from TSMC's US plant costing 5%-20% more

A GPU Calculator That Helps Calculate What GPU to Use

Visa and Mastercard: The global payment duopoly (2024)

Revisiting Moneyball

Why concatenative programming matters (2012)

RE#: High performance derivative-based regular expression matching (2024)

PSA: SQLite WAL checksums fail silently and may lose data

Air Force unit suspends use of Sig Sauer pistol after shooting death of airman

Use Your Type System

Vet is a safety net for the curl | bash pattern

Intel CEO Letter to Employees

Open Source Maintenance Fee

Covers as a way of learning music and code

Superfunctions: A universal solution against sync/async fragmentation in Python

American sentenced for helping North Koreans get jobs at U.S. firms

Bus Bunching

UK: Phone networks down: EE, BT, Three, Vodafone, O2 not working in mass outage

Mwm – The smallest usable X11 window manager

Writing is thinking

The POSIX specification of vi

Show HN: Easy Python Time Parsing

Thunder Compute (YC S24) Is Hiring a C++ Systems Engineer

Building MCP servers for ChatGPT and API integrations

SIMD Perlin Noise: Beating the Compiler with SSE (2014)

Comments