frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
289•theblazehen•2d ago•95 comments

Software Engineering Is Back

https://blog.alaindichiappari.dev/p/software-engineering-is-back
20•alainrk•1h ago•11 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
34•AlexeyBrin•1h ago•5 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
15•onurkanbkrc•1h ago•1 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
717•klaussilveira•16h ago•218 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
978•xnx•21h ago•562 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
94•jesperordrup•6h ago•35 comments

France's homegrown open source online office suite

https://github.com/suitenumerique
4•nar001•35m ago•2 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
138•matheusalmeida•2d ago•36 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
74•videotopia•4d ago•11 comments

Ga68, a GNU Algol 68 Compiler

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
16•matt_d•3d ago•4 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
46•helloplanets•4d ago•46 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
242•isitcontent•16h ago•27 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
242•dmpetrov•16h ago•128 comments

Cross-Region MSK Replication: K2K vs. MirrorMaker2

https://medium.com/lensesio/cross-region-msk-replication-a-comprehensive-performance-comparison-o...
4•andmarios•4d ago•1 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
344•vecti•18h ago•153 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
510•todsacerdoti•1d ago•248 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
393•ostacke•22h ago•101 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
309•eljojo•19h ago•192 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
361•aktau•22h ago•187 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
437•lstoll•22h ago•286 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
32•1vuio0pswjnm7•2h ago•31 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
73•kmm•5d ago•11 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
26•bikenaga•3d ago•13 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
98•quibono•4d ago•22 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
278•i5heu•19h ago•227 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
43•gmays•11h ago•14 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1088•cdrnsf•1d ago•469 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
312•surprisetalk•3d ago•45 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
36•romes•4d ago•3 comments
Open in hackernews

Multiplatform Matrix Multiplication Kernels

https://burn.dev/blog/sota-multiplatform-matmul/
86•homarp•6mo ago

Comments

raphaelty•6mo ago
Very interesting, willing to try burn
nathanielsimard•6mo ago
One of the author here, don't hesitate if you have any question or comment!
burnt-resistor•6mo ago
Reminds me of ye olden days when kernel transforms were merely weighted multiplicative and/or additive matrixes applied to every point in the source arriving at pixel data in the target. Blur, sharpen, color channel filter, color swap, invert, etc. An extremely diagonalizable problem suitable for massive parallelism and concurrent calculation because there is little/no dependency on prior calculations.
almostgotcaught•6mo ago
I'm sorry this is a low brow comment but this is the dumbest thing you can do in this space:

> Unit (thread in CUDA, invocation in Vulkan/Wgpu): the smallest execution entity performing computations.

> Plane (warp in CUDA, subgroup in Vulkan/Wgpu): a group of (typically 32) units executing in lockstep and able to share data efficiently through registers.

> Cube (thread block in CUDA, workgroup in Vulkan/Wgpu): a group of units that execute on the same SM, sharing memory and able to synchronize

It's already bad enough that the vendors themselves insisted on different names but why in the bejesus would you rename these concepts and diverge from literally all existing naming conventions when you're providing middleware. Ie when using your tool I'm still going to reference NVIDIA's or AMD's docs to understand how the hardware actually works. Like do you really think otherwise - that your thing is gonna be end of the line???

FYI the word warp isn't random techno babble but is actually a very clever pun that actually fits very well conceptually:

https://en.m.wikipedia.org/wiki/Warp_and_weft

nathanielsimard•6mo ago
Using the naming from one of the existing API would put too much bias towards that API. It started as a WebGPU project early on, but some features are not present so mixing terms wasn't ideal. We're also working on extending CubeCL to CPU, so we want terms not only tied to the GPU word.
almostgotcaught•6mo ago
Thread, group, workgroup.

There you go you've hit basically two of 3 completely (AMD and Vulkan) and are close enough to CUDA that people would get it.

I have no idea what a plane connotes and a cube literally gives a distinct enough picture from block that I will be continuously reminding myself of the mapping.

What you did was pointless - you assigned new words to objects that you don't own and now your conceptual framework is askew from the actual underlying (true) conceptual framework.

> CubeCL to CPU

There is zero affinity between GPU programing models and multicore CPU programing models. If you don't believe me go ask the OpenMP people how they're doing supporting GPUs.

nathanielsimard•6mo ago
Well we can agree to disagree, CubeCL also has the concept of instruction parallelism, which would be used to target simd instructions on CPU. Our algorithms are normally flexible on both the plane size and the line size, adapting to the hardware with comptime logique. You are free to dislike the naming, but imo a mix of multiple APIs is worse than something new.
almostgotcaught•6mo ago
> Our algorithms are normally flexible on both the plane size and the line size

Congrats - I have no idea what this means lol.

syl20bnr•6mo ago
It will make more sense once you start using CubeCL. There's now a CubeCL book available: https://burn.dev/books/cubecl/.

It does come with some mental overhead, but let’s be honest, there’s no objectively “good” choice here without introducing bias toward a specific vendor API.

Learning the core concepts takes effort, but if CubeCL is useful for your work, it’s definitely worth it.

gyrovagueGeist•6mo ago
For people who are interested Kokkos (a C++ library for writing portable kernels) also has a naming scheme for hierarchical parallelism. They use ThreadTeam, Thread (for individual threads within a group), and ThreadVector (for per thread SIMD).

Just commenting to share, personally I have no naming preference but the hierarchal abstractions in general are incredibly useful.

sroussey•6mo ago
Why unit instead of point?

Unit, plane (as vs train), and cube?

Or point, plane, cube (1d, 2d, 3d)?

nathanielsimard•6mo ago
I don't recall the reason why, point is a valid name.
kevindamm•6mo ago
Actually, points are zero dimensional, lines are one dimensional.
threeducks•6mo ago
Relevant XKCD: https://xkcd.com/927/
airstrike•6mo ago
burn is awesome
Lerc•6mo ago
Has there been much research into slightly flawed matrix multiplications?

If you have a measure of correctness, and a measure of performance. Is there a maximum value of correctness per some unit of processing that exists below a full matrix multiply

Obviously it can be done with precision, since that is what floating point is. But is there anything where you can save x% of computation and have fewer than x% incorrect values in a matrix multiplications?

Gradient descent wouldn't really care about a few (Reliably) dud values.

wuubuu•6mo ago
Randomized matrix sketching is one way to get at this (see https://arxiv.org/abs/2302.11474), the problem is hardware is heavily optimized for dense multiplies so what you save in flops doesn't translate to real runtime speeds ups.
WithinReason•6mo ago
If you do it in 8-bit it's usually 2x as fast as 16 bit on Tensorcores
MayeulC•6mo ago
Well, approximate computing seems to be a superset of the field you describe here, with many different approaches, including analog computation. As you say, some algorithms care a bit less about precision, especially for LSBs.
kolinko•6mo ago
I did research on vector-matrix last year:

https://kolinko.github.io/effort/

For semi-random weights you cam get down to 20-30% multiplications/mem reads and maintain ~0.98 cosine similarity output between the approximated and full result.

As far as LLM inference goes, the speedup from removing multiplications is at best comparable to the speedup of quantisation (that is - you get at best similar KL divergence score whether you remove calculations or quantise).

apitman•6mo ago
Could something like this be done in WebGPU?
nathanielsimard•6mo ago
CubeCL supports WebGPU and can be used with wasm!
semessier•6mo ago
I had bet that matmult would be in transformer-optimized hardware costing a fraction of GPUs first class in torch 2 years ago with no reason to use GPUs any more. Wrong.
almostgotcaught•6mo ago
> matmult would be in transformer-optimized hardware

It is... it's in GPUs lol

> first class in torch

It is

> costing a fraction of GPUs

Why would anyone give you this for cheaper than GPUs lol?

atty•6mo ago
I think they’re referring to hardware like TPUs and other ASICs. Which also exist, of course :)
almostgotcaught•6mo ago
Sure but GPUs literally have MMA engines now
gchadwick•6mo ago
The real bottleneck is the memory, optimize your matmul architecture all you like whilst you still have it connected to a big chunk of HBM memory (or whatever your chosen high bandwidth memory is) you can only do so much.

So really GPU v not GPU (e.g. TPU) doesn't matter a whole lot if you've got fundamentally the same memory architecture.

burnt-resistor•6mo ago
GPUs came about because of the need for faster float 4x4 and 3x3 matrix, and 3 and 4 vector math ops like multiply, multiply-accumulate, and such, and faster pushing of pixels with things like texture mapping. All hail OpenGL and dual Voodoo2 SLI. ;)
saagarjha•6mo ago
Seems kind of CUTLASS-inspired in terms of how its API is designed. I'm curious how they plan to expose more interesting operations, though, since CUTLASS gives you the ability to write custom epilogs and tiling patterns, and I'm not sure their API is expressive enough to do this.
Archit3ch•6mo ago
Metal support is most welcome (and rare in similar projects)!

Typically, you are not doing a matrix multiplication for the sake of it, but as part of a broader algorithm (e.g. a simulation). Without fusing those other operations into the MatMul kernel, you are leaving performance on the table. How will the Burn devs address this?