frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

WWII Veteran Recalls Discovering a Nazi Concentration Camp [video]

https://www.youtube.com/watch?v=LGWHf8Pe320
1•thomassmith65•5m ago•0 comments

Science is almost ready to "redefine the second" with this new research

https://www.neowin.net/news/science-is-almost-ready-to-redefine-the-second-with-this-new-research/
1•Bluestein•8m ago•0 comments

Estimating the carbon footprint of ChatGPT inference

https://charmindustrial.com/blog/can-i-use-chatgpt-if-i-care-about-the-climate
2•n2parko•11m ago•0 comments

Hygiene Hypothesis

https://en.wikipedia.org/wiki/Hygiene_hypothesis
1•thunderbong•11m ago•0 comments

The AI Mirage

https://www.theatlantic.com/technology/archive/2025/07/why-are-computers-still-so-dumb/683524/
1•outrun86•20m ago•0 comments

I Once Thought Europeans Lived as Well as Americans. Not Anymore

https://www.thefp.com/p/i-once-thought-europeans-lived-as-well-americans
1•petermcneeley•23m ago•0 comments

No. The C++ mascot is not a diseased rat named Keith

https://lunduke.locals.com/post/5111104/no-the-c-mascot-is-not-a-diseased-rat-named-keith
1•ChadNauseam•27m ago•0 comments

Journalist Karen Hao on Sam Altman, OpenAI and the "Quasi-Religious" Push for AI [video]

https://www.youtube.com/watch?v=s4hZz9Vd0lY
2•mgh2•35m ago•0 comments

A curated directory for developers to discover and showcase tech products

https://devhub.best
1•allentown521•35m ago•1 comments

Python Maps

https://github.com/symmy596/PythonMaps
2•fzliu•37m ago•0 comments

Show HN: Rate Reddit – before you get your feelings hurt

https://ratereddit.com
1•rodgetech•37m ago•0 comments

The Inerter: A Retrospective

https://www.annualreviews.org/content/journals/10.1146/annurev-control-053018-023917
2•teleforce•38m ago•0 comments

China Moves Forward with $167bn, 70 Gigawatt Dam

https://www.bloomberg.com/news/articles/2025-07-21/china-moves-ahead-with-167-billion-tibet-mega-dam-despite-risks
3•master_crab•45m ago•1 comments

AI model converts hospital records into text for better emergency care decisions

https://medicalxpress.com/news/2025-07-ai-hospital-text-emergency-decisions.html
1•PaulHoule•51m ago•0 comments

The future of climate change may not be what you think

https://www.readtangle.com/future-of-climate-change/
1•debo_•53m ago•2 comments

Show HN: NetXDP – Kernel-Level DDoS Protection and Traffic Manager with eBPF/XDP

2•gaurav1086•1h ago•0 comments

HTTP/1.1 Must Die – The Desync Endgame Begins

https://http1mustdie.com/
3•pabs3•1h ago•0 comments

The Epic Battle for AI Talent–With Exploding Offers, Secret Deals and Tears

https://www.wsj.com/tech/ai/meta-ai-recruiting-mark-zuckerberg-sam-altman-140d5861
1•brandonb•1h ago•0 comments

Hi guys, any thought on this project?

https://founder-hub-waitlist.vercel.app/
3•PaulKHO•1h ago•6 comments

Geocities Backgrounds

https://pixelmoondust.neocities.org/archives/archivedtiles
1•marcodiego•1h ago•0 comments

How Higher education failed America's poor

https://www.washingtonpost.com/opinions/2025/07/20/college-degree-value-poor-inequality/
8•pseudolus•1h ago•5 comments

this let you deploy your LLM agents into production with one click

https://agentainer.io/
1•cyw•1h ago•1 comments

Stem cells prioritize wound healing over hair growth

https://www.cell.com/cell-metabolism/fulltext/S1550-4131(25)00266-9
1•bookofjoe•1h ago•0 comments

Using Virtual Machines on macOS/Linux with Tart

https://developer.mamezou-tech.com/en/blogs/2024/02/12/tart-vm/
2•srid•1h ago•0 comments

Ask HN: What is the biggest waste of money?

5•alganet•1h ago•12 comments

Transfer.it – effortless file sharing, powered by MEGA

https://blog.mega.io/introducing-transfer-it
2•dotcoma•1h ago•2 comments

Maybe(?) Composable Continuation in C

https://old.reddit.com/r/C_Programming/comments/1m55ojy/maybe_composable_continuation_in_c/
1•Trung0246•1h ago•0 comments

Log by time, not by count

https://johnscolaro.xyz/blog/log-by-time-not-by-count
14•JohnScolaro•1h ago•8 comments

Thingiverse is cracking down on gun-related models using a new automated system

https://www.tomshardware.com/3d-printing/ghost-gun-proliferation-spurs-crackdown-at-thingverse-the-worlds-largest-3d-printer-model-design-repository-lawmakers-also-ask-3d-printer-vendors-to-create-ai-based-systems-to-detect-and-block-gun-prints
3•MrMember•1h ago•0 comments

China breakthrough in indium selenide (InSe) wafers with perfect stoichiometry

https://news.cgtn.com/news/2025-07-19/China-develops-new-method-to-mass-produce-high-quality-semiconductors-1F8iTEyEwVi/p.html
5•david927•1h ago•1 comments
Open in hackernews

FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

https://www.tomshardware.com/software/the-biggest-speedup-ive-seen-so-far-ffmpeg-devs-boast-of-another-100x-leap-thanks-to-handwritten-assembly-code
180•harambae•6h ago

Comments

shmerl•5h ago
Still waiting for Pipewire + xdg desktop portal screen / window capture support in ffmpeg CLI. It's been dragging feet forever with it.
Aardwolf•5h ago
The article somtimes says 100x, other times it says 100% speed boost. E.g. it says "boosts the app’s ‘rangedetect8_avx512’ performance by 100.73%." but the screenshot shows 100.73x.

100x would be a 9900% speed boost, while a 100% speed boost would mean it's 2x as fast.

Which one is it?

pizlonator•5h ago
The ffmpeg folks are claiming 100x not 100%. Article probably has a typo
k_roy•2h ago
That would be quite the percentage difference with 100x
MadnessASAP•5h ago
100x to the single function 100% (2x) to the whole filter
torginus•4h ago
I'd guess the function operates of 8 bit values judging from the name. If the previous implementation was scalar, a double-pumped AVX512 implementation can process 128 elements at a time, making the 100x speedup plausible.
ethan_smith•3h ago
It's definitely 100x (or 100.73x) as shown in the screenshot, which represents a 9973% speedup - the article text incorrectly uses percentage notation in some places.
pavlov•4h ago
Only for x86 / x86-64 architectures (AVX2 and AVX512).

It’s a bit ironic that for over a decade everybody was on x86 so SIMD optimizations could have a very wide reach in theory, but the extension architectures were pretty terrible (or you couldn’t count on the newer ones being available). And now that you finally can use the new and better x86 SIMD, you can’t depend on x86 ubiquity anymore.

Aurornis•4h ago
AVX512 is a set of extensions. You can’t even count on an AVX512 CPU implementing all of the AVX512 instructions you want to use, unless you stick to the foundation instructions.

Modern encoders also have better scaling across threads, though not infinite. I was in an embedded project a few years ago where we spent a lot of time trying to get the SoC’s video encoder working reliably until someone ran ffmpeg and we realized we could just use several of the CPU cores for a better result anyway

AaronAPU•4h ago
When I spent a decade doing SIMD optimizations for HEVC (among other things), it was sort of a joke to compare the assembly versions to plain c. Because you’d get some ridiculous multipliers like 100x. It is pretty misleading, what it really means is it was extremely inefficient to begin with.

The devil is in the details, microbenchmarks are typically calling the same function a million times in a loop and everything gets cached reducing the overhead to sheer cpu cycles.

But that’s not how it’s actually used in the wild. It might be called once in a sea of many many other things.

You can at least go out of your way to create a massive test region of memory to prevent the cache from being so hot, but I doubt they do that.

torginus•4h ago
Sorry for the derail, but it sounds like you have a ton of experience with SIMD.

Have you used ISPC, and what are your thoughts on it?

I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.

almostgotcaught•2h ago
> Have you used ISPC

No professional kernel writer uses Auto-vectorization.

> I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand

You feel it's ridiculous because you've been sold a myth/lie (abstraction). In reality the details have always mattered.

CyberDildonics•15m ago
ISPC is a lot different from C++ compiler auto vectorization and it works extremely well. Have you tried it or not? If so where does it actually fall down? It warns you when doing slow stuff like gathers and scatters.
capyba•1h ago
Personally I’ve never been able to beat gcc or icx autovectorization by using intrinsics; often I’m slower by a factor of 1.5-2x.

Do you have any wisdom you can share about techniques or references you can point to?

jesse__•17m ago
I recently finished a 4 part series about vectorizing perlin noise.. from the very basics up to beating the state-of-the-art by 1.8x

https://scallywag.software/vim/blog/simd-perlin-noise-i

izabera•2h ago
ffmpeg is not too different from a microbenchmark, the whole program is basically just: while (read(buf)) write(transform(buf))
fuzztester•1h ago
the devil is in the details (of the holy assembly).

thus sayeth the lord.

praise the lord!

yieldcrv•2h ago
> what it really means is it was extremely inefficient to begin with

I care more about the outcome than the underlying semantics, to me thats kind of a given

jauntywundrkind•3h ago
Kind of reminds me of Sound Open Firmware (SOF), which can compile with e8ther unoptimized gcc, or using the proprietary Cadence XCC compiler that can can use the Xtensa HiFi SIMD intrinsics.

https://thesofproject.github.io/latest/introduction/index.ht...

tombert•3h ago
Actually a bit surprised to hear that assembly is faster than optimized C. I figured that compilers are so good nowadays that any gains from hand-written assembly would be infinitesimal.

Clearly I'm wrong on this; I should probably properly learn assembly at some point...

mhh__•2h ago
Compilers are extremely good considering the amount of crap they have to churn through but they have zero information (by default) about how the program is going to be used so it's not hard to beat them.
haiku2077•2h ago
If anyone is curious to learn more, look up "profile-guided optimization" which observes the running program and feeds that information back into the compiler
mananaysiempre•2h ago
Looking at the linked patches, you’ll note that the baseline (ff_detect_range_c) [1] is bog-standard scalar C code while the speedup is achieved in the AVX-512 version (ff_detect_rangeb_avx512) [2] of the same computation. FFmpeg devs prefer to write straight assembly using a library of vector-width-agnostic macros they maintain, but at a glance the equivalent code looks to be straightforwardly expressible in C with Intel intrinsics if that’s more your jam. (Granted, that’s essentially assembly except with a register allocator, so the practical difference is limited.) The vectorization is most of the speedup, not the assembly.

To a first approximation, modern compilers can’t vectorize loops beyond the most trivial (say a dot product), and even that you’ll have to ask for (e.g. gcc -O3, which in other cases is often slower than -O2). So for mathy code like this they can easily be a couple dozen times behind in performance compared to wide vectors (AVX/AVX2 or AVX-512), especially when individual elements are small (like the 8-bit ones here).

Very tight scalar code, on modern superscalar CPUs... You can outcode a compiler by a meaningful margin, sometimes (my current example is a 40% speedup). But you have to be extremely careful (think dependency chains and execution port loads), and the opportunity does not come often (why are you writing scalar code anyway?..).

[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346725.h...

[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346726.h...

kasper93•2h ago
Moreover the baseline _c function is compiled with -march=generic and -fno-tree-vectorize on GCC. Hence it's the best case comparison for handcrafted AVX512 code. And while it's is obviously faster and that's very cool, boasting the 100x may be misinterpreted by outsider readers.

I was commenting there with some suggested change and you can find more performance comparison [0].

For example with small adjustment to C and compiling it for AVX512:

  after (gcc -ftree-vectorize --march=znver4)
  detect_range_8_c:                                      285.6 ( 1.00x)
  detect_range_8_avx2:                                   256.0 ( 1.12x)
  detect_range_8_avx512:                                 107.6 ( 2.65x)
Also I argued that it may be a little bit misleading to post comparison without stating the compiler and flags used for said comparison [1].

P.S. There is related work to enable -ftree-vectorize by default [2]

[0] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346813.h...

[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346794.h...

[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346439.h...

brigade•1h ago
It's AVX512 that makes the gains, not assembly. This kernel is simple enough that it wouldn't be measurably faster than C with AVX512 intrinsics.

And it's 100x because a) min/max have single instructions in SIMD vs cmp+cmov in scalar and b) it's operating in u8 precision so each AVX512 instruction does 64x min/max. So unlike the unoptimized scalar that has a throughput under 1 byte per cycle, the AVX512 version can saturate L1 and L2 bandwidth. (128B and 64B per cycle on Zen 5.)

But, this kernel is operating on an entire frame; if you have to go to L3 because it's more than a megapixel then the gain should halve (depending on CPU, but assuming Zen 5), and the gain decreases even more if the frame isn't resident in L3.

saati•43m ago
The AVX2 version was still 64x faster than the C one, so AVX-512 is just 50% improvement over that. Hand vectorized assembly is very much the key to the gains.
mafuy•1h ago
If you ever dabble more closely in low level optimization, you will find the first instance of the C compile having a brain fart within less than an hour.

Random example: https://stackoverflow.com/questions/71343461/how-does-gcc-no...

The code in question was called quadrillions of times, so this actually mattered.

MobiusHorizons•1h ago
Almost all performance critical pieces of c/c++ libraries (including things as seemingly mundane as strlen) use specialized hand written assembly. Compilers are good enough for most people most of the time, but that’s only because most people aren’t writing software that is worth optimizing to this level from a financial perspective.
jesse__•20m ago
It's extremely easy to beat the compiler by dropping down to SIMD intrinsics. I recently wrote a 4 part .. guide? ..

https://scallywag.software/vim/blog/simd-perlin-noise-i

cpncrunch•3h ago
Article is unclear what will actually be affected. It mentions "rangedetect8_avx512" and calls it an obscure function. So, what situations is it actually used for, and what is the real-time improvement in performance for the entire conversion process?
brigade•1h ago
It's not conversion. Rather, this filter is used for video where you don't know whether the pixels are video or full range, or whether the alpha is premultiplied, and determining that information. Usually so you can tag it correctly in metadata.

And the function in question is specifically for the color range part.

cpncrunch•59m ago
It's still unclear from your explanation how it's actually used in practice. I run thousands of ffmpeg conversions every day, so it would be useful to know how/if this is likely to help me.

Are you saying that it's run once during a conversion as part of the process? Or that it's a specific flag that you give, it then runs this function, and returns output on the console?

(Either of those would be a one-time affair, so would likely result in close to zero speed improvement in the real world).

brigade•43m ago
This is a new filter that hasn’t even been committed yet, it only runs if explicitly specified, and would only ever be specified by someone that already knows that they don’t know the characteristics of their video.

So you wouldn’t ever run this.

ivanjermakov•3h ago
Related: ffmpeg's guide to writing assembly: https://news.ycombinator.com/item?id=43140614