frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
399•klaussilveira•5h ago•90 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
755•xnx•10h ago•462 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
133•isitcontent•5h ago•14 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
123•dmpetrov•5h ago•53 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
20•SerCe•1h ago•15 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
33•quibono•4d ago•2 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
235•vecti•7h ago•114 comments

A century of hair samples proves leaded gas ban worked

https://arstechnica.com/science/2026/02/a-century-of-hair-samples-proves-leaded-gas-ban-worked/
60•jnord•3d ago•3 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
302•aktau•11h ago•152 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
305•ostacke•11h ago•82 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
162•eljojo•8h ago•123 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
381•todsacerdoti•13h ago•215 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
310•lstoll•11h ago•230 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
45•phreda4•4h ago•7 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
103•vmatsiiako•10h ago•34 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
173•i5heu•8h ago•128 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
139•limoce•3d ago•76 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
225•surprisetalk•3d ago•30 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
963•cdrnsf•14h ago•413 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
10•gfortaine•3h ago•0 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
37•rescrv•13h ago•17 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
7•kmm•4d ago•0 comments

Evaluating and mitigating the growing risk of LLM-discovered 0-days

https://red.anthropic.com/2026/zero-days/
33•lebovic•1d ago•11 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
76•antves•1d ago•56 comments

The Oklahoma Architect Who Turned Kitsch into Art

https://www.bloomberg.com/news/features/2026-01-31/oklahoma-architect-bruce-goff-s-wild-home-desi...
17•MarlonPro•3d ago•2 comments

I'm going to cure my girlfriend's brain tumor

https://andrewjrod.substack.com/p/im-going-to-cure-my-girlfriends-brain
31•ray__•2h ago•7 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
38•nwparker•1d ago•8 comments

Claude Composer

https://www.josh.ing/blog/claude-composer
98•coloneltcb•2d ago•68 comments

Evolution of car door handles over the decades

https://newatlas.com/automotive/evolution-car-door-handle/
38•andsoitis•3d ago•61 comments

Planetary Roller Screws

https://www.humanityslastmachine.com/#planetary-roller-screws
34•everlier•3d ago•6 comments
Open in hackernews

Helion: A high-level DSL for performant and portable ML kernels

https://pytorch.org/blog/helion/
150•jarbus•3mo ago

Comments

dachworker•3mo ago
I'm super excited to give this one a spin. It seems like a neat idea, Triton, but simpler and with automatic autotuning. My head is spinning with options right now. I love how everyone was hyping up CUDA this and CUDA that a couple of years ago, and now CUDA is all but irrelevant. There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.

It's also kinda of ironic that right now in 2025, we have all this diversity in tooling, but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.

almostgotcaught•3mo ago
> and now CUDA is all but irrelevant.

Lol this is so wrong it's cringe.

> There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.

There are literally only 2: SIMT (ie the same as it always was) and tiles (ie Triton). That's it. Helion is just Triton with more auto-tuning (Triton already has auto-tuning).

the__alchemist•3mo ago
Even for non-ML things like chem simulations: CUDA (and cuFFT) are more pleasant to use than Vulkan Compute and vkFFT.
ozgrakkurt•3mo ago
I just learned the graphics api of vulkan, can’t imagine anything being less pleasant than vulkan
porridgeraisin•3mo ago
Yeah it's quite something. If anyone wants a preview, here's the triangle hello world in vulkan: https://gist.github.com/Overv/7ac07356037592a121225172d7d78f...

But then again, I've heard that it's this low level because its meant for engine developers.

simlevesque•3mo ago
Oh wow that's horrible.
anvuong•3mo ago
Really? How low is this level actually? Because I remember my OpenGL class' professor did this in less than 50 lines.
porridgeraisin•3mo ago
Imagine writing GlCreateContext yourself, for starters, as has been done in the link I posted.
embedding-shape•3mo ago
> CUDA that a couple of years ago, and now CUDA is all but irrelevant

What? CUDA won't be irrelevant for years even if all the competitors figure out the holy grail, the ecosystem doesn't suddenly migrate over night. People learning CUDA today will continue to be find jobs and opportunities across the sector for the near future without any worries.

> but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.

That's also not true, the ML space is still growing, and lots of things outside of Transformers, but it requires you to actually look and pay attention, not just browse the HN and r/localllama frontpage.

Overall, these do not seem to be the sentiments coming from someone inside the ML space, but rather from an onlookers perspective.

pjmlp•3mo ago
In what alternative reality is that the case?
brap•3mo ago
Asking as someone who is really out of the loop: how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level, and the infra would be mostly commoditized.
brrrrrm•3mo ago
a recent wave of interest in bitwise equivalent execution had a lot of kernels this level get pumped out.

new attention mechanisms also often need new kernels to run at any reasonable rate

theres definitely a breed of frontend-only ML dev that dominates the space, but a lot novel exploration needs new kernels

embedding-shape•3mo ago
> how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level

Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.

westurner•3mo ago
How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics? How reconfigurable are TPUs? Are there any standardized feature flags for TPUs yet?

Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

From https://news.ycombinator.com/item?id=45775181 re: current TPUs in 2025; "AI accelerators" :

> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?

almostgotcaught•3mo ago
this is like 5 different questions all across the landscape - what exactly do you think answers will do for you?

> How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics?

LLMs? all the time? "tensor physics" (whatever that is) never

> How reconfigurable are TPUs?

very? as reconfigurable as any other programmable device?

> Are there any standardized feature flags for TPUs yet?

have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

> Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

i don't see why it wouldn't be? you're just asking is (stuff done)/(energy consumed) a good measure of efficiency to which the answer is yes?

westurner•3mo ago
> have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

X86, ARM, and RISC have all standardized on feature flags which can be reviewed on Linux with /proc/cpuinfo or with dmidecode.

  cat /proc/cpuinfo | grep -E '^processor|Features|^BogoMIPS|^CPU'
There are multiple TPU vendors. I listed multiple AI accelerator TPU products in the comment you are replying to.

> How reconfigurable are TPUs?

TIL Google's TPUs are reconfigurable with OCS Optical Circuit Switches that can be switched between for example 3D torus or twisted torus configurations.

(FWIW also, quantum libraries mostly have Line qubits and Lattice qubits. There is a recent "Layer Coding" paper; to surpass Surface Coding.)

But classical TPUs;

I had already started preparing a response to myself to improve that criteria; And then paraphrasing from 2.5pro:

> Don't rank by TOPS/wHr alone; rank by TOPS/wHr @ [Specific Precision]. Don't rank by Memory Bandwidth alone; rank by Effective Bandwidth @ [Specific Precision].

Hardware Rank criteria for LLM hosting costs:

Criterion 1: EGB (Effective Generative Bandwidth) Memory Bandwidth (GB/s) / Precision (Bytes)

Criterion 2: GE (Generative Efficiency) EGB / Total Board Power (Watts)

Criterion 3: TTFT Potential Raw TOPS @ Prompt Precision

LLM hosting metrics: Tokens Per Second (TPS) for throughput, Time to First Token (TTFT) for latency, and Tokens Per Joule for efficiency.

almostgotcaught•3mo ago
> There are multiple TPU vendors

There are not - TPU is literally a Google trademark:

> Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google.

https://en.wikipedia.org/wiki/Tensor_Processing_Unit

The rest of what you're talking about is irrelevant

westurner•2mo ago
"A Brief Guide of xPU for AI Accelerators" https://www.sigarch.org/a-brief-guide-of-xpu-for-ai-accelera...

NPU: Neural Processing Unit: https://en.wikipedia.org/wiki/Neural_processing_unit

Coprocessor: https://en.wikipedia.org/wiki/Coprocessor

anvuong•3mo ago
There are some not so niche communities, like FlashAttention and LinearFlashAttention repos. New code/optimizations get committed on a weekly basis. They find a couple of percents here and there all the time. How useful their kernels actually are in term of producing good results remain to be seen, but their implementations are often much better (in FLOPS) compared to what were proposed in the original papers.

It's just like game optimization, cache-friendliness and memory hierarchy-awareness are huge in attention mechanism. But programming backward pass in these lower-level stacks is definitely not fun, tensor calculus breaks my brain.

markush_•3mo ago
Interesting choice from PyTorch to release yet another DSL, on positive side it's one more point in the design space on the other hand it's even more difficult to choose the right technology among Triton, Gluon, CuTe, ThunderKittens and a few others.
chillee•3mo ago
I think unlike Gluon/CuTe/ThunderKittens (which distinguish themselves from Triton by being lower level giving you more control, thus being less performance portable and harder to write), Helion distinguishes itself from Triton by being higher level and easier to write.

IMO, this is something that makes sense for PyTorch to release, as "neutral ground" in the industry.

bobajeff•3mo ago
It's good to see more effort for making things not device specific but I only see benchmarks for NVIDIA B200 and AMD MI350X. Also what's the experience of using one of these Python DSLs like? Are the tools good enough to make code completion, jump to definition, setting breakpoints, watching variables, copying as expression etc. nice?
saagarjha•3mo ago
Generally you are unlikely to get Python-level debugging for code that is going to run on GPUs.
giovannibonetti•3mo ago
That's Mojo's selling point.

https://www.modular.com/mojo

doctorpangloss•3mo ago
Is contributing to Triton so bad? It looks like the blocker is usually LLVM.
saagarjha•3mo ago
It’s not that bad, but I’m not sure why this is relevant?
uoaei•3mo ago
Tangential question related to the example kernel: in GPU programming is it idiomatic/standard to initialize the out array as zeros rather than empty? are the performance savings negligible?
saagarjha•3mo ago
It saves a kernel launch and memory bandwidth for a fill kernel. If you’re going to overwrite the data anyway, why bother?
porridgeraisin•3mo ago
They have made it empty only.

>> out = torch.empty([m, n], dtype=x.dtype, device=x.device)

The accumulator has been initialized to zero, since well, they have to add stuff into it.

>> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)

> idiomatic

No as far as I have seen they generally try to not initialize if its not necessary.

> overhead

There is the memory bandwidth point as you might expect. But additionally when using high level interfaces like pytorch, when you write torch.zeros(512, 512) in pytorch, it launches a whole kernel (tens of micros) just for that line. So that's cpu -> gpu -> back to cpu, and then it does the next line, where it goes to gpu again and uses that memory. So in these cases you make sure to avoid it if its in a hot path. Ideally you want the 2nd kernel to do the initialization itself. When you write cuda c++ yourself this is how you typically do it. Helion being a compiler might be doing this optimization, but runtime based torch can't clearly.

darknoon•3mo ago
The developers also gave a talk about Helion on GPU Mode: https://www.youtube.com/watch?v=1zKvCLuvUYc
bwfan123•3mo ago
I dont get the point of helion as compared to its alternatives like gluon.

For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.

krapht•3mo ago
Funny, I feel the same way about Triton. Performant Triton looks like CUDA (but with tiles!) except it's ten times harder to debug since it doesn't have the tooling NVIDIA provides.

If I had to run on AMD I'd rather deal with their hipify tooling.

saagarjha•3mo ago
Performant Triton programs are usually simpler and shorter than their CUDA equivalents. This alone makes it easier to write, and I would argue that it helps with debugging too because the model provides a lot more guarantees on how your code executes. That said, some of the tooling is notably poor (such as cuda-gdb support).
krapht•3mo ago
Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.

Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.

For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.

It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.

saagarjha•3mo ago
I don’t think it is possible to write high performance code without understanding how the hardware works. I just think staring at code that coalesces your loads or swizzles your layouts for the hundredth time is a waste of screen space, though. Just let the compiler do it and when it gets it wrong then you can bust out the explicit code you were going to write in CUDA, anyway.
chillee•3mo ago
What's the point of Triton compared to Gluon? What's the point of PyTorch compared to Triton?

One of the main values of Triton is that it significantly expanded the scope of folks who can write kernels - I think Helion could expand the scope even more.

singularity2001•3mo ago
Anything as long as I don't have to touch propriety cuda and mpx
saagarjha•3mo ago
You’ll need an execution backend.
jarbus•3mo ago
I posted this 5 days ago, how did this resurface?
koolba•3mo ago
https://news.ycombinator.com/item?id=26998308
mshockwave•3mo ago
Is it normal to spend 10minutes on tuning nowadays? Do we need to spend another 10 minutes upon changing the code?
anvuong•3mo ago
You mean autotune? I think 10 minutes is pretty normal, torch.compile('max-autotune') can be much slower than that for large models.
Mars008•3mo ago
Add to that it can be done only once by developers before distribution for major hardware. Configs saved. Then on client side selected.
ballpug•3mo ago
Compiling a kernel after assemblage in low-level object oriented languages either uses stable kernel or the cargo fuzzed raw_spinlock code.

Helion abstracts syntax and design for calculating λ-functions, which converts language in a kernel config.

a-dub•3mo ago
numba for gpu kernels... cool!
sega_sai•3mo ago
I switched from pytorch to jax just before triton appeared. Does anyone know how jax compares to this autotuning machinery in pytorch ? I know jax does jit, but i don't have a good intuition if jit is better than this type of autotuning.
yarri•3mo ago
Pallas is the Triton equivalent in JAX land. There are some old auto tuning prototypes if you search for Pallas, like this https://github.com/jax-ml/jax-triton/pull/108
maknee•3mo ago
How does this compare against other DSLs?
chillee•3mo ago
If you think of Triton as a "baseline", most other DSLs are lower-level than Triton, whereas this is higher-level.