frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Relaya – Agent calls businesses for you

https://relaya.ai/
1•rishavmukherji•1m ago•0 comments

Ruby Central Faces Backlash After Publishing Incident Timeline on RubyGems

https://socket.dev/blog/ruby-central-faces-backlash-after-publishing-incident-timeline-on-rubygem...
1•feross•1m ago•0 comments

Framework Sponsorships

https://frame.work/blog/framework-sponsorships
1•geerlingguy•1m ago•0 comments

New-Vehicle Avg Price Hits Record High in Sep, Surges Past $50k for First Time

https://www.coxautoinc.com/insights-hub/sept-2025-atp-report/
2•rntn•2m ago•0 comments

Ask HN: What is in your vibe coding list

1•ashu1461•3m ago•0 comments

Google Pixel 10 Pro Fold explodes during JerryRigEverything's durability test

https://www.dexerto.com/youtube/google-pixel-10-pro-fold-explodes-during-jerryrigeverythings-dura...
1•akersten•3m ago•0 comments

'Sharps and flats': a [guide to] cheating at games of chance and skill (1894)

https://archive.org/details/sharpsflatscompl00maskuoft
1•azalemeth•4m ago•0 comments

The EuroStack Directory

https://euro-stack.com/
1•amar-laksh•4m ago•0 comments

Functional Networking for Millions of Docker Desktops (Experience Report)

https://dl.acm.org/doi/10.1145/3747525
1•matt_d•6m ago•0 comments

Facebook removes ICE-tracking page after US Government 'outreach'

https://www.theverge.com/policy/799473/facebook-meta-ice-jawboning
2•ceejayoz•6m ago•1 comments

Open-Source AI-Native Web Server

https://github.com/okba14/NeuroHTTP
1•el_hacker•8m ago•0 comments

Secure Boot bypass risk threatens nearly 200k Linux Framework laptops

https://www.bleepingcomputer.com/news/security/secure-boot-bypass-risk-on-nearly-200-000-linux-fr...
2•cheschire•10m ago•0 comments

Are AI coding tools fundamentally changing Agile/team software development?

1•justdep•10m ago•0 comments

Athlete-Owned Media: Owning the Narrative

https://www.mediaimpactproject.org/sports.html
1•PaulHoule•11m ago•0 comments

Synthetic kratom is exploding in California

https://www.sfgate.com/bayarea/article/synthetic-kratom-addiction-and-ban-california-21076514.php
3•tqi•11m ago•0 comments

Bare Metal (The Emacs Essay)

https://waxbanks.wordpress.com/2025/08/01/bare-metal-the-emacs-essay/
1•hpaone•12m ago•0 comments

Show HN: BuzEntry – your apt buzzer answers itself

2•deephire•13m ago•0 comments

The state of the US econonomy hinges on rare earth minerals from China

https://prospect.org/world/2025-10-14-china-trump-tariffs-rare-earth-minerals/
3•jbrins1•13m ago•0 comments

Claude Commands: Build Predictable AI Coding Workflows

https://www.msthgn.com/articles/closing-the-loop-claude-commands-for-predictable-ai-workflows
2•msthgn•17m ago•0 comments

Governor vetoes California bill banning cookware with PFAS

https://www.theguardian.com/us-news/2025/oct/14/california-pfas-ban-gavin-newsom
2•Jimmc414•18m ago•0 comments

The data model behind notion

https://www.notion.com/blog/data-model-behind-notion
1•olayiwoladekoya•18m ago•0 comments

An angry rant about locales and filenames in libarchive

https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe
1•fanf2•21m ago•0 comments

Nextcloud withdraws European Commission OneDrive complaint

https://www.theregister.com/2025/10/09/nextcloud_withdraws_ec_onedrive_bundling/
2•raybb•23m ago•0 comments

Field Guide to TSL and WebGPU

https://blog.maximeheckel.com/posts/field-guide-to-tsl-and-webgpu/
1•barremian•24m ago•0 comments

Do You Have to Stop Using Windows 10?

https://www.wired.com/story/do-you-really-have-to-stop-using-windows-10/
2•01-_-•25m ago•0 comments

I Managed to Grow Countable Yeast Colonies

https://chillphysicsenjoyer.substack.com/p/i-managed-to-grow-countable-yeast
2•crescit_eundo•27m ago•0 comments

Surveillance Secrets

https://www.lighthousereports.com/investigation/surveillance-secrets/
7•_tk_•27m ago•0 comments

The .NET Security Group

https://devblogs.microsoft.com/dotnet/announcing-dotnet-security-group/
4•mikece•28m ago•0 comments

Key Analyst Says U.S. Is 'Going Broke' Under Trump

https://www.thedailybeast.com/key-jpmorgan-analyst-david-kelly-says-us-is-going-broke-under-trump/
5•mdhb•29m ago•0 comments

Why Is SQLite Coded in C and not Rust

https://www.sqlite.org/whyc.html
8•plainOldText•30m ago•1 comments
Open in hackernews

Prefix sum: 20 GB/s (2.6x baseline)

https://github.com/ashtonsix/perf-portfolio/tree/main/delta
62•ashtonsix•4h ago

Comments

tonetegeatinst•3h ago
Wonder if PTX programming for a GPU would accelerate this.
almostgotcaught•3h ago
Lol do you think "PTX programming" is some kind of trick path to perf? It's just inline asm. Sometimes it's necessary but most of the time "CUDA is all you need":

https://github.com/b0nes164/GPUPrefixSums

ashtonsix•2h ago
If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck.

When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.

This note holds for all GPU-based approaches, not just PTX.

_zoltan_•2h ago
what is a typical workload that you speak of, where CPUs are better?

We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.

The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.

ashtonsix•2h ago
By typical I imagined adoption within commonly-deployed TSDBs like Prometheus, InfluxDB, etc.

GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.

The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.

_zoltan_•1h ago
GB is CPU+2xGPU.

GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.

ashtonsix•1h ago
Each Grace CPU has multiple cores: https://www.nvidia.com/en-gb/data-center/grace-cpu-superchip

This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...

tmostak•57m ago
Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

ashtonsix•28m ago
Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

bassp•1h ago
Yes! There’s a canonical algorithm called the “Blelloch scan” for prefix sum (aka prefix scan, because you can generalize “sum” to “any binary associative function”) that’s very gpu friendly. I have… fond is the wrong word, but “strong” memories of implementing in a parallel programming class :)

Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

ashtonsix•1h ago
Mm, I used that exact writeup as a reference to implement this algorithm in WebGL 3 years ago: https://github.com/ashtonsix/webglc/blob/main/src/kernel/sca...

It even inspired the alternative "transpose" method I describe in the OP README.

nullbyte•2h ago
This code looks like an alien language to me. Or maybe I'm just rusty at C.
ashtonsix•2h ago
The weirdness probably comes from heavy use of "SIMD intrinsics" (Googleable term). These are functions with a 1:1 correspondence to assembly instructions, used for processing multiple values per instruction.
mananaysiempre•1h ago
SIMD intrinsics are less C and more assembly with overlong mnemonics and a register allocator, so even reading them is something of a separate skill. Unlike the skill of achieving meaningful speedups by writing them (i.e. low-level optimization), it’s nothing special, but expect to spend a lot of time jumping between the code and the reference manuals[1,2] at first.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[2] https://developer.arm.com/architectures/instruction-sets/int...

yogishbaliga•2h ago
Way back in time, I used delta encoding for storing posting list (inverted index for search index). I experimented with using GPUs for decoding the posting list. It turned out that, as another reply mentioned copying posting list from CPU memory to GPU memory was taking way too long. If posting list is static, it can be copied to GPU memory once. This will make the decoding faster. But still there is a bottle neck of copying the result back into CPU memory.

Nvidia's unified memory architecture may make it better as same memory can be shared between CPU and GPU.

Certhas•2h ago
AMD has had unified memory for ages in HPC and for a while now in the Strix Halo systems. I haven't had the chance to play with one yet, but I have high hopes for some of our complex simulation workloads.
ashtonsix•1h ago
Oh neat. I have some related unpublished SOTA results I want to release soon: PEF/BIC-like compression ratios, with faster boolean algebra than Roaring Bitsets.
hughw•25m ago
The shared memory architecture doesn't eliminate copying the data across to the device. Edit: or back.
Galanwe•1h ago
While the results look impressive, I can't help but think "yeah but had you stored an absolute value every X deltas instead of just a stream of deltas, you would have had a perfectly scalable parallel decoding"
ashtonsix•1h ago
I just did a mini-ablation study for this (prefix sum). By getting rid of the cross-block carry (16 values), you can increase perf from 19.85 to 23.45 GB/s: the gain is modest as most performance is lost on accumulator carry within the block.

An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.

It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.

jdonaldson•58m ago
While that sounds like a dealbreaker, I can't help think "yeah but if a decoding method took advantage of prefix in a similarly scalable way, one would reap the same benefits".