frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Personalizing esketamine treatment in TRD and TRBD

https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1736114
1•PaulHoule•45s ago•0 comments

SpaceKit.xyz – a browser‑native VM for decentralized compute

https://spacekit.xyz
1•astorrivera•1m ago•1 comments

NotebookLM: The AI that only learns from you

https://byandrev.dev/en/blog/what-is-notebooklm
1•byandrev•1m ago•1 comments

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

https://github.com/ClickHouse/postgres-clickhouse-stack
1•saisrirampur•2m ago•0 comments

Game Boy Advance d-pad capacitor measurements

https://gekkio.fi/blog/2026/game-boy-advance-d-pad-capacitor-measurements/
1•todsacerdoti•2m ago•0 comments

South Korean crypto firm accidentally sends $44B in bitcoins to users

https://www.reuters.com/world/asia-pacific/crypto-firm-accidentally-sends-44-billion-bitcoins-use...
1•layer8•3m ago•0 comments

Apache Poison Fountain

https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fcee1d5
1•atomic128•5m ago•1 comments

Web.whatsapp.com appears to be having issues syncing and sending messages

http://web.whatsapp.com
1•sabujp•5m ago•2 comments

Google in Your Terminal

https://gogcli.sh/
1•johlo•7m ago•0 comments

Shannon: Claude Code for Pen Testing: #1 on Github today

https://github.com/KeygraphHQ/shannon
1•hendler•7m ago•0 comments

Anthropic: Latest Claude model finds more than 500 vulnerabilities

https://www.scworld.com/news/anthropic-latest-claude-model-finds-more-than-500-vulnerabilities
1•Bender•11m ago•0 comments

Brooklyn cemetery plans human composting option, stirring interest and debate

https://www.cbsnews.com/newyork/news/brooklyn-green-wood-cemetery-human-composting/
1•geox•12m ago•0 comments

Why the 'Strivers' Are Right

https://greyenlightenment.com/2026/02/03/the-strivers-were-right-all-along/
1•paulpauper•13m ago•0 comments

Brain Dumps as a Literary Form

https://davegriffith.substack.com/p/brain-dumps-as-a-literary-form
1•gmays•13m ago•0 comments

Agentic Coding and the Problem of Oracles

https://epkconsulting.substack.com/p/agentic-coding-and-the-problem-of
1•qingsworkshop•14m ago•0 comments

Malicious packages for dYdX cryptocurrency exchange empties user wallets

https://arstechnica.com/security/2026/02/malicious-packages-for-dydx-cryptocurrency-exchange-empt...
1•Bender•14m ago•0 comments

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

https://github.com/pheonix-delta/axiom-voice-agent
1•shubham-coder•15m ago•0 comments

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

https://arstechnica.com/health/2026/02/penisgate-erupts-at-olympics-scandal-exposes-risks-of-bulk...
4•Bender•15m ago•0 comments

Arcan Explained: A browser for different webs

https://arcan-fe.com/2026/01/26/arcan-explained-a-browser-for-different-webs/
1•fanf2•17m ago•0 comments

What did we learn from the AI Village in 2025?

https://theaidigest.org/village/blog/what-we-learned-2025
1•mrkO99•17m ago•0 comments

An open replacement for the IBM 3174 Establishment Controller

https://github.com/lowobservable/oec
1•bri3d•20m ago•0 comments

The P in PGP isn't for pain: encrypting emails in the browser

https://ckardaris.github.io/blog/2026/02/07/encrypted-email.html
2•ckardaris•22m ago•0 comments

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

https://github.com/fokdelafons/lustra
1•fokdelafons•22m ago•1 comments

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

1•Chance-Device•24m ago•0 comments

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
1•ColinWright•26m ago•0 comments

Jim Fan calls pixels the ultimate motor controller

https://robotsandstartups.substack.com/p/humanoids-platform-urdf-kitchen-nvidias
1•robotlaunch•30m ago•0 comments

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

https://www.jeffgeerling.com/blog/2026/exploring-a-modern-smpte-2110-broadcast-truck-with-my-dad/
1•HotGarbage•30m ago•0 comments

AI UX Playground: Real-world examples of AI interaction design

https://www.aiuxplayground.com/
1•javiercr•31m ago•0 comments

The Field Guide to Design Futures

https://designfutures.guide/
1•andyjohnson0•31m ago•0 comments

The Other Leverage in Software and AI

https://tomtunguz.com/the-other-leverage-in-software-and-ai/
1•gmays•33m ago•0 comments
Open in hackernews

Prefix sum: 20 GB/s (2.6x baseline)

https://github.com/ashtonsix/perf-portfolio/tree/main/delta
89•ashtonsix•3mo ago

Comments

tonetegeatinst•3mo ago
Wonder if PTX programming for a GPU would accelerate this.
almostgotcaught•3mo ago
Lol do you think "PTX programming" is some kind of trick path to perf? It's just inline asm. Sometimes it's necessary but most of the time "CUDA is all you need":

https://github.com/b0nes164/GPUPrefixSums

ashtonsix•3mo ago
If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck.

When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.

This note holds for all GPU-based approaches, not just PTX.

_zoltan_•3mo ago
what is a typical workload that you speak of, where CPUs are better?

We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.

The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.

ashtonsix•3mo ago
By typical I imagined adoption within commonly-deployed TSDBs like Prometheus, InfluxDB, etc.

GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.

The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.

_zoltan_•3mo ago
GB is CPU+2xGPU.

GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.

ashtonsix•3mo ago
Each Grace CPU has multiple cores: https://www.nvidia.com/en-gb/data-center/grace-cpu-superchip

This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...

tmostak•3mo ago
Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

ashtonsix•3mo ago
Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

_zoltan_•3mo ago
GH200 is $1.5/hr at lambda and can do 450GB/s to the GPU. seems still cheaper?
bassp•3mo ago
Yes! There’s a canonical algorithm called the “Blelloch scan” for prefix sum (aka prefix scan, because you can generalize “sum” to “any binary associative function”) that’s very gpu friendly. I have… fond is the wrong word, but “strong” memories of implementing in a parallel programming class :)

Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

ashtonsix•3mo ago
Mm, I used that exact writeup as a reference to implement this algorithm in WebGL 3 years ago: https://github.com/ashtonsix/webglc/blob/main/src/kernel/sca...

It even inspired the alternative "transpose" method I describe in the OP README.

TinkersW•3mo ago
Your average none shared memory GPU communicates with the CPU over PCIe which is dogshit slow, like 100x slower than DRAM.

I can upload about an average of 3.7 MBs per millisecond to my GPU(PCIe gen 3, x8), but it can be spiky and sometimes take longer than you might expect.

By comparison a byte based AVX2 prefix scan can pretty much run at the speed of DRAM, so there is never any reason to transfer to the GPU.

nullbyte•3mo ago
This code looks like an alien language to me. Or maybe I'm just rusty at C.
ashtonsix•3mo ago
The weirdness probably comes from heavy use of "SIMD intrinsics" (Googleable term). These are functions with a 1:1 correspondence to assembly instructions, used for processing multiple values per instruction.
mananaysiempre•3mo ago
SIMD intrinsics are less C and more assembly with overlong mnemonics and a register allocator, so even reading them is something of a separate skill. Unlike the skill of achieving meaningful speedups by writing them (i.e. low-level optimization), it’s nothing special, but expect to spend a lot of time jumping between the code and the reference manuals[1,2] at first.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[2] https://developer.arm.com/architectures/instruction-sets/int...

ack_complete•3mo ago
This is partially due to the compromises of mappingvector intrinsics into C (with C++ only being marginally better). In a more vector-oriented language, such as shader languages, this:

  s1 = vaddq_u32(s1, vextq_u32(z, s1, 2));
  s1 = vaddq_u32(s1, vdupq_laneq_u32(s0, 3));
would be more like this:

  s1.xy += s1.zw;
  s1 += s0.w;
mananaysiempre•3mo ago
To be fair, even in standard C11 you can do a bit better than the CPU manufacturer’s syntax

  #define vaddv(A, B) _Generic((A),
      int8x8_t:    vaddv_s8((A), (B)),
      uint8x8_t:   vaddv_u8((A), (B)),
      int8x16_t:   vaddvq_s8((A), (B)),
      uint8x16_t:  vaddvq_u8((A), (B)),
      int16x4_t:   vaddv_s16((A), (B)),
      uint16x4_t:  vaddv_u16((A), (B)),
      int16x8_t:   vaddvq_s16((A), (B)),
      uint16x8_t:  vaddvq_u16((A), (B)),
      int32x2_t:   vaddv_s32((A), (B)),
      uint32x2_t:  vaddv_u32((A), (B)),
      float32x2_t: vaddv_f32((A), (B)),
      int32x4_t:   vaddvq_s32((A), (B)),
      uint32x4_t:  vaddvq_u32((A), (B)),
      float32x4_t: vaddvq_f32((A), (B)),
      int64x2_t:   vaddvq_s64((A), (B)),
      uint64x2_t:  vaddvq_u64((A), (B)),
      float64x2_t: vaddvq_f64((A), (B)))
while in GNU C you can in fact use normal arithmetic and indexing (but not swizzles) on vector types.
yogishbaliga•3mo ago
Way back in time, I used delta encoding for storing posting list (inverted index for search index). I experimented with using GPUs for decoding the posting list. It turned out that, as another reply mentioned copying posting list from CPU memory to GPU memory was taking way too long. If posting list is static, it can be copied to GPU memory once. This will make the decoding faster. But still there is a bottle neck of copying the result back into CPU memory.

Nvidia's unified memory architecture may make it better as same memory can be shared between CPU and GPU.

Certhas•3mo ago
AMD has had unified memory for ages in HPC and for a while now in the Strix Halo systems. I haven't had the chance to play with one yet, but I have high hopes for some of our complex simulation workloads.
ashtonsix•3mo ago
Oh neat. I have some related unpublished SOTA results I want to release soon: PEF/BIC-like compression ratios, with faster boolean algebra than Roaring Bitsets.
hughw•3mo ago
The shared memory architecture doesn't eliminate copying the data across to the device. Edit: or back.
yogishbaliga•3mo ago
If it is unified memory, CPU can access the result of GPU processing without copying it to CPU memory (theoretically)
hughw•3mo ago
If the CPU touches an address mapped to the GPU doesn't it fault a page into the CPU address space? I mean the program doesn't do anything special, but a page gets faulted in I believe.
Galanwe•3mo ago
While the results look impressive, I can't help but think "yeah but had you stored an absolute value every X deltas instead of just a stream of deltas, you would have had a perfectly scalable parallel decoding"
ashtonsix•3mo ago
I just did a mini-ablation study for this (prefix sum). By getting rid of the cross-block carry (16 values), you can increase perf from 19.85 to 23.45 GB/s: the gain is modest as most performance is lost on accumulator carry within the block.

An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.

It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.

zamadatix•3mo ago
I think what GP is saying is you can drop an absolute value every e.g. 8192 elements (or even orders of magnitude more if you're actually processing GBs of element) and this frees you to compute the blocks in parallel threads in a dependency free manor. The latency for a block would still bound by the single core rate, but the throughput of a stream is likely memory bound after 2-3 cores. It still hurts the point of doing delta compression, but not nearly as bad as every 16 values would.

Even if one is willing to adopt such an encoding scheme, you'd still want to optimize what you have here anyways though. It also doesn't help, as mentioned, if the goal is actually latency of small streams rather than throughput of large ones.

ashtonsix•3mo ago
Oh right. That's sensible enough. Makes total sense to parallelise across multiple cores.

I wouldn't expect a strictly linear speed-up due to contention on the memory bus, but it's not as bad as flat-lining after engaging 2-3 cores. On most AWS Graviton instances you should be able to pull ~5.5 GB/s per-core even with all cores active, and that becomes less of a bottleneck when considering you'll typically run a sequence of compute operations between memory round-trips (not just delta).

jdonaldson•3mo ago
While that sounds like a dealbreaker, I can't help think "yeah but if a decoding method took advantage of prefix in a similarly scalable way, one would reap the same benefits".
shoo•3mo ago
fgiesen has a 3 part series of blog posts from 2018, "reading bits in far too many ways", focusing on the problem of reading data encoded a variable-bit-length code

Part 3 explores a straightforward idea that can be used to improve performance of bitstream decoding - decode multiple streams at once.

> if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once?

> So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine.

> If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams

For delta coding, perhaps instead of resetting the delta encoding stream with an absolute value every X deltas, for a small value of X, this suggests we could use a much larger value of X, i.e. partition the input into coarse chunks and delta encode each, and then rewrite our decoder to decode a pair (or more) of the resulting streams at once.

Unclear if that would help at all here, or if there's already enough independent work for the CPU to keep busy with.

Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far...

Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-in-far...

jasonthorsness•3mo ago
"We achieve 19.8 GB/s prefix sum throughput—1.8x faster than a naive implementation and 2.6x faster than FastPFoR"

"FastPFoR is well-established in both industry and academia. However, on our target platform (Graviton4, SIMDe-compiled) it benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at ~10.8 GB/s."

I thought the first bit was a typo but it was correct; the naive approach was faster than a "better" method. Another demonstration of how actually benchmarking on the target platform is important!

coolThingsFirst•3mo ago
I had the opportunity(misfortune) to click on that resume and it could some aesthetic improvements. :)

Fun project btw, HPC is always interesting.