frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
1•helloplanets•2m ago•0 comments

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•10m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•11m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•13m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•13m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•16m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•16m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•21m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•23m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•23m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•23m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•26m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•29m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•31m ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•37m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•39m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•44m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•46m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•46m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•49m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•50m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•52m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•54m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•56m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•58m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•1h ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•1h ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•1h ago•2 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•1h ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

3•prateekdalal•1h ago•0 comments
Open in hackernews

How to tile matrix multiplication (2023)

https://alvinwan.com/how-to-tile-matrix-multiplication/
82•pbd•4mo ago

Comments

slwvx•4mo ago
See https://en.wikipedia.org/wiki/Block_matrix#Multiplication
GolDDranks•4mo ago
There is something off with the explanation.

At first, there is 16 fetches per row x column, 1024 in total. Then, it is observed that an input row needs to be fetched only once per output row, reducing the amount to 8 fetches per row, plus 8 per row x column, 8 * 8 + 8 * 64 = 576 in total. This requires the same amount of 16 numbers to be kept in registers.

But then it is claimed that by doing one quadrant at a time, all that is needed is 64 fetches per quadrant or 256 fetches in total. But that assumes we can keep 4 rows and 4 columns, 8 numbers per row or column = 64 numbers in registers! If we can only keep 16 numbers like above, each row of the quadrant is going to take 40 fetches, and we get 160 fetches per quadrant or 640 fetches in total, a pessimization from 576 fetches!

alvinwan•4mo ago
That’s a valid point - I’m assuming infinite register capacity at that point in the post.

The next section discusses what you’re talking about eg, how to deal with finite register/shared capacity by splitting the k dimension. I’ll mention the shared/register memory limitation sooner to clarify confusion.

imtringued•4mo ago
The overall problem with your blog post is that it is beating around the bush rather than getting to the point. Overall, it feels like the blog post is explaining tiling in reverse order of what is needed to understand it.

"How effective is tiling?" and "Why tiling tiling is so fast" should be at the end, while the key section "Why there's a limit to tiling" which should be front and center is in the middle, followed by a subversion of the entire concept in "How to sidestep tiling limits"

It's also incredibly jarring to read this:

"Wondering how we were able to reduce memory usage "for free"? Indeed, the reduction wasn't free. In fact, we paid for this reduction a different way — by incurring more writes."

This is again, completely backwards. Let's assume you don't have a cache at all, you'll have to write out everything to DRAM every single time. The opposite is also true. Imagine you had an infinite number of registers. Every addition operation will accumulate into a register, which is a write operation. Hence, the number of write operations doesn't change.

Really the main points should be in this order: 1. matrix multiplication works best with square or almost square matrices. 2. registers and SRAM (including caches) is limited, forcing you to process matrices of finite size (aka tiles) 3. memory hierarchy means that the biggest matrix you can store at a given hierarchy gets bigger. 4. you can split matrix multiplication using inner and outer products 5. outer products take few inputs and have many outputs/accumulators, inner products take many inputs and have few outputs/accumulators. 6. You want to calculate the biggest outer product you can get away with, since this significantly reduces the memory needed to store inputs and maximizes number of cycles doing calculations, once you hit the limit, you want to reuse the accumulator, so you calculate inner products of outer products.

alvinwan•4mo ago
I see, thanks for the feedback - the current blog post’s flow certainly isn’t optimal. I’ll try reordering to eliminate jarring bits and see how it flows.
epistasis•4mo ago
When thinking about block matrix multiplication, it's always a fun time to revisit Strassen's algorithm, which is less than O(n^3).

Normal block multiplication works like:

    [ A11  A12 ] [ B11  B12 ] = [ A11*B11 + A12*B21  A11*B12 + A12*B22 ] = [ C11  C12 ]
    [ A21  A22 ] [ B21  B22 ]   [ A21*B11 + A22*B21  A21*B12 + A22*B22 ] = [ C21  C22 ] 
Which takes 8 matrix multiplications on the sub blocks. But by cleverly defining only 7 different matrix multiplications on top of block additions and subtractions, like:

    M3 = A11 * (B12 - B22)
You can make the C blocks out of just additions and subtractions of the 7 different matrix multiplications.

https://en.wikipedia.org/wiki/Strassen_algorithm

As far as I know this is not useful in the major GPU libraries for saving bandwidth, but I have never bothered to spend the time to figure out why. It must have something to do with the ratio of bandwidth to FLOPs, which is way past my knowledge of GPUs.

adgjlsfhk1•4mo ago
The tricky parts with Strassen are that it requires some fairly large changes to your looping strategy, and that it decreases accuracy, It also only helps once you are compute rather than bandwidth bound, and GPUs have lots of compute.
pkhuong•4mo ago
> only helps once you are compute rather than bandwidth bound

Asymptotically, I don't think Strassen performs Theta(n^3) memory operations in sub-n^3 time.

jansenmac•4mo ago
See also http://ulaff.net/