frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

AVX-512: First Impressions on Performance and Programmability

https://shihab-shahriar.github.io//blog/2026/AVX-512-First-Impressions-on-Performance-and-Programmability/
24•shihab•5d ago

Comments

fithisux•4d ago
What I get in these article is that the original intent on C language stands true.

Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).

Complex compiler doing crazy optimizations, in my opinion, is not worth it.

pjmlp•4d ago
> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.

I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.

camel-cdr•4d ago
> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

shihab•4d ago
Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.
chillitom•44m ago
Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

chillitom•35m ago
Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

Remnant44•18m ago
which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

ecesena•21m ago
If you have the opportunity, try out a zen5. Significant improvements.

See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...

Gaussian Splatting – A$AP Rocky "Helicopter" music video

https://radiancefields.com/a-ap-rocky-releases-helicopter-music-video-featuring-gaussian-splatting
526•ChrisArchitect•11h ago•170 comments

Provide agents with automated feedback

https://banay.me/dont-waste-your-backpressure/
65•ghuntley•1d ago•19 comments

Flux 2 Klein pure C inference

https://github.com/antirez/flux2.c
270•antirez•11h ago•107 comments

All your OpenCodes belong to us

https://johncodes.com/archive/2026/01-18-all-your-opencodes/
19•jpmcb•2h ago•3 comments

A Social Filesystem

https://overreacted.io/a-social-filesystem/
333•icy•20h ago•147 comments

Dead Internet Theory

https://kudmitry.com/articles/dead-internet-theory/
155•skwee357•8h ago•194 comments

Fil-Qt: A Qt Base build with Fil-C experience

https://git.qt.io/cradam/fil-qt
53•pjmlp•2d ago•29 comments

Gas Town Decoded

https://www.alilleybrinker.com/mini/gas-town-decoded/
93•alilleybrinker•4d ago•79 comments

The Code-Only Agent

https://rijnard.com/blog/the-code-only-agent
26•emersonmacro•2h ago•10 comments

At least 21 killed in Spain after crash involving high-speed trains

https://www.bbc.com/news/articles/cedw6ylpynyo
42•akyuu•5h ago•28 comments

AVX-512: First Impressions on Performance and Programmability

https://shihab-shahriar.github.io//blog/2026/AVX-512-First-Impressions-on-Performance-and-Program...
24•shihab•5d ago•8 comments

Show HN: I quit coding years ago. AI brought me back

https://calquio.com/finance/compound-interest
32•ivcatcher•4h ago•21 comments

Using proxies to hide secrets from Claude Code

https://www.joinformal.com/blog/using-proxies-to-hide-secrets-from-claude-code/
49•drewgregory•5d ago•22 comments

Poking holes into bytecode with peephole optimisations

https://xnacly.me/posts/2026/purple-garden-first-optimisations/
17•xnacly•4d ago•0 comments

Show HN: Dock – Slack minus the bloat, tax, and 90-day memory loss

https://getdock.io/
83•yadavrh•8h ago•63 comments

The space and motion of communicating agents (2008) [pdf]

https://www.cl.cam.ac.uk/archive/rm135/Bigraphs-draft.pdf
11•dhorthy•3d ago•1 comments

Astrophotography visibility plotting and planning tool

https://airmass.org/
8•NKosmatos•3d ago•1 comments

Police Invested Millions in Shadowy Phone-Tracking Software Won't Say How Used

https://www.texasobserver.org/texas-police-invest-tangles-sheriff-surveillance/
273•nobody9999•8h ago•80 comments

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
332•tosh•20h ago•221 comments

Simulating the Ladybug Clock Puzzle

https://austinhenley.com/blog/ladybugclock.html
3•azhenley•1d ago•0 comments

Sins of the Children

https://asteriskmag.com/issues/07/sins-of-the-children
130•maxall4•12h ago•63 comments

The Cathedral, the Megachurch, and the Bazaar

https://opensourcesecurity.io/2026/01-cathedral-megachurch-bazaar/
144•todsacerdoti•5d ago•117 comments

Show HN: Lume 0.2 – Build and Run macOS VMs with unattended setup

https://cua.ai/docs/lume/guide/getting-started/introduction
104•frabonacci•11h ago•29 comments

Predicting OpenAI's ad strategy

https://ossa-ma.github.io/blog/openads
512•calcifer•14h ago•448 comments

Wine 11.0

https://gitlab.winehq.org/wine/wine/-/releases/wine-11.0
288•zdw•5d ago•60 comments

A free and open-source rootkit for Linux

https://lwn.net/SubscriberLink/1053099/19c2e8180aeb0438/
182•jwilk•19h ago•36 comments

Show HN: Beats, a web-based drum machine

https://beats.lasagna.pizza
52•kinduff•8h ago•12 comments

CD Projekt issue DMCA takedown notice against popular Cyberpunk VR mod

https://www.patreon.com/posts/another-one-dust-148437771
27•wjdp•2h ago•7 comments

ASCII characters are not pixels: a deep dive into ASCII rendering

https://alexharri.com/blog/ascii-rendering
1210•alexharri•1d ago•132 comments

Ultrathink is deprecated & How to enable 2x thinking tokens in Claude Code

https://decodeclaude.com/ultrathink-deprecated/
14•moona3k•6h ago•1 comments