frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Highly efficient matrix transpose in Mojo

https://veitner.bearblog.dev/highly-efficient-matrix-transpose-in-mojo/
110•timmyd•17h ago

Comments

sestep•16h ago
I'm not an expert in this space, but is this meaningful? I'd assume that it's more common to fuse together transposition with an operation that precedes or follows it (e.g. matmul), which should be far more efficient than materializing the entire transposition in memory if it's just an intermediate value.
musebox35•6h ago
Matrix transpose is a canonical example of a memory bound operation and often used to showcase optimization in a particular programming language or library. See for example the cutlass matrix transpose tutorial from Jay Shah of flash attention 3 paper: https://research.colfax-intl.com/tutorial-matrix-transpose-i...
saagarjha•2h ago
Unfortunately the issue (alluded to in the blog post you linked) is that transposes do absolutely no work but memory loads. Sure, they test that you can swizzle your accesses, but modern accelerators are all about pipelining and feeding matrix multiply units, which is considerably harder than loading from memory as fast as possible. Actually, even the Mojo post barely beats CUDA for most of its kernels, because you can hit memory bandwidth for transpose on the latest hardware using techniques from 5-10 years ago. This is definitely not true for more interesting operations.
musebox35•10m ago
I totally agree that the resulting kernel will be rarely useful. I just wanted to highlight that it is a commonly used educational exercise to showcase how to optimize for memory throughput. If the post showed how to fuse a transpose + rmsnorm epilogue to a gemm then the kernel would be more functional but the blog post would be much harder to follow for newcomers.

Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.

colesantiago•16h ago
Does anyone use Mojo in production at all or are even hiring for Mojo?
melodyogonna•3h ago
Modular (the company behind Mojo) uses it in production. I imagine that if they have any clients then those also use Mojo in production - albeit indirectly - since all the GPU kernels used by Modular are written in Mojo.
htrp•16h ago
Left unsaid, the 14% improvement in performance came at the cost of increasing dev time by 35%
bravetraveler•16h ago
Reminds me of this, lol:

> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."

14% all the time vs 35% some of the time

edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone

vlan121•16h ago
Mojos compiler is closed source. Thats a big no-no
dgurchenkov•13h ago
I work on Mojo. The whole compiler, runtime etc. will get open sourced, most likely within a year. It is just a matter of time and us getting all the required work done.

https://docs.modular.com/mojo/faq/#open-source

almostgotcaught•12h ago
> runtime

Are you talking about your libc equivalent or MAX?

xiphias2•2m ago
,,will get open sourced'' means closed source, parent wrote the same
voronar•16h ago
Mr. Mojo Risin'
arjvik•16h ago
Where's the 14%? Looks like their final kernels show a 0.14% improvement of Mojo over the equivalent CUDA kernel?
77pt77•16h ago
It looks because it does.

>(2771.35/2775.49 - 1) * 100 = -.14916285052369131300

Flagged.

timmyd•15h ago
Updated the title to the original. I did base the numbers on

"This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) which is still impressive

jsnell•16h ago
The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized, the original is "Highly efficient matrix transpose in Mojo".

Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.

baal80spam•16h ago
0.14% is within the limits of statistical error. So this is a nothing-"article".
jsnell•16h ago
I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.
jebarker•16h ago
Yeah, it seems like the blog post is just meant to be an example of how to do something in Mojo and not a dunk on CUDA.
timmyd•13h ago
FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.
atomicapple•16h ago
I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason
timmyd•13h ago
[op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are:

transpose_naive - Basic implementation with TMA transfers

transpose_swizzle - Adds swizzling optimization for better memory access patterns

transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling

Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:

transpose_naive: 1056.08 GB/s (32.0025% of max)

transpose_swizzle: 1437.55 GB/s (43.5622% of max)

transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)

via the GitHub - simveit/efficient_transpose_mojo

Comparing to the CUDA implementations mentioned in the article:

Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s

Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s

Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s

So there is highly efficient matrix transpose in Mojo

All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).

The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.

jsnell•12h ago
Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).

Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.

timmyd•12h ago
thanks jsnell - i did they and they appreciated the comment above, and unflagged it. i appreciate it!
noracists•16h ago
slop
londons_explore•13h ago
Why do we ever need to transpose a matrix?

Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?

throwawayabcdef•12h ago
The next operation might need the data in column major order to read it fast. So you might have to transpose first. And these maybe be concurrent stages of a processing pipeline.
viraptor•11h ago
Now I'm curious, how many times do you have to fully read the matrix in GPU for the total impact of reading columns to be higher than one-off actual transpose and then sequential row reads? I know it depends on lots of things, I'm after a rough estimate.
saagarjha•2h ago
It's quite rare. Usually problems are tiled anyway and you can amortize the cost of having data in the "wrong" layout by loading coalesced in whatever is the best layout for your data and then transposing inside your tile, which gives you access to much faster memory.
hogepodge•12h ago
You're right that a good graph compiler will do this for you. There still may be times, like if you're interfacing with another library, where you'll need to switch a matrix between row major or column major layouts.
meindnoch•17m ago
Serious linear algebra libraries expect a flag that tells if elements are column-major or row-major.
fulafel•6h ago
This could make Mojo look even better as it would ld be more compute heavy and the last step thread reduction would be less relevant.
almostgotcaught•12h ago
As someone said below - you'd never write just a transpose kernel - it'll be fused into something else.
saagarjha•2h ago
Look the frontier AI companies need something other than reversing binary trees to give interview candidates
melodyogonna•5h ago
I wonder if there is a reason for not using the high level abstractions provided by Modular
saagarjha•2h ago
Most interesting algorithms (e.g. with dynamic shapes, mixed computation) are typically better scheduled by hand.
saagarjha•2h ago
> This kernel archives a bandwidth of 1056.08 GB/s which is faster than the 875.46 GB/s we archived using CUDA. I believe that to be the reason because we use the PTX api for TMA transfers in Mojo.

I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)

iandanforth•1h ago
I'm probably just ignorant but shouldn't the graphic of the tiled transpose have the green vector column-oriented in the final matrix?

Taurine and aging: Is there anything to it?

https://www.science.org/content/blog-post/taurine-and-aging-there-anything-it
18•etiam•48m ago•3 comments

Low-Level Optimization with Zig

https://alloc.dev/2025/06/07/zig_optimization
124•Retro_Dev•5h ago•34 comments

The FAIR Package Manager: Decentralized WordPress infrastructure

https://joost.blog/path-forward-for-wordpress/
126•twapi•8h ago•26 comments

The time bomb in the tax code that's fueling mass tech layoffs

https://qz.com/tech-layoffs-tax-code-trump-section-174-microsoft-meta-1851783502
939•booleanbetrayal•2d ago•593 comments

Researchers develop ‘transparent paper’ as alternative to plastics

https://japannews.yomiuri.co.jp/science-nature/technology/20250605-259501/
295•anigbrowl•15h ago•162 comments

Why We're Moving on from Nix

https://blog.railway.com/p/introducing-railpack
6•mooreds•1h ago•0 comments

How we decreased GitLab repo backup times from 48 hours to 41 minutes

https://about.gitlab.com/blog/2025/06/05/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes/
449•immortaljoe•21h ago•188 comments

Gander (YC F24) Is Hiring Founding Engineers and Interns

https://www.ycombinator.com/companies/gander/jobs/vwkK1FC-founding-engineer
1•arjanguglani•49m ago

A year of funded FreeBSD development

https://www.daemonology.net/blog/2025-06-06-A-year-of-funded-FreeBSD.html
291•cperciva•17h ago•87 comments

Getting Past Procrastination

https://spectrum.ieee.org/getting-past-procastination
137•WaitWaitWha•9h ago•65 comments

Why are smokestacks so tall?

https://practical.engineering/blog/2025/6/3/why-are-smokestacks-so-tall
101•azeemba•11h ago•26 comments

Sharing everything I could understand about gradient noise

https://blog.pkh.me/p/42-sharing-everything-i-could-understand-about-gradient-noise.html
88•ux•21h ago•3 comments

The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf]

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
224•amrrs•18h ago•121 comments

Medieval Africans had a unique process for purifying gold with glass (2019)

https://www.atlasobscura.com/articles/medieval-african-gold
107•mooreds•14h ago•55 comments

Highly efficient matrix transpose in Mojo

https://veitner.bearblog.dev/highly-efficient-matrix-transpose-in-mojo/
110•timmyd•17h ago•39 comments

Uber Just Reinvented the Bus Again

https://www.wired.com/story/uber-just-reinvented-the-bus-again/
8•beardyw•1h ago•3 comments

Reverse Engineering Cursor's LLM Client

https://www.tensorzero.com/blog/reverse-engineering-cursors-llm-client/
37•paulwarren•9h ago•3 comments

Falsehoods programmers believe about aviation

https://flightaware.engineering/falsehoods-programmers-believe-about-aviation/
310•cratermoon•14h ago•133 comments

NASA delays next flight of Boeing's alternative to SpaceX Dragon

https://theedgemalaysia.com/node/758199
39•bookmtn•9h ago•28 comments

Sandia turns on brain-like storage-free supercomputer

https://blocksandfiles.com/2025/06/06/sandia-turns-on-brain-like-storage-free-supercomputer/
182•rbanffy•21h ago•68 comments

A tool for burning visible pictures on a compact disc surface

https://github.com/arduinocelentano/cdimage
4•carlesfe•4h ago•0 comments

A masochist's guide to web development

https://sebastiano.tronto.net/blog/2025-06-06-webdev/
227•sebtron•23h ago•32 comments

Show HN: AI game animation sprite generator

https://www.godmodeai.cloud/ai-sprite-generator
95•lyogavin•17h ago•71 comments

Odyc.js – A tiny JavaScript library for narrative games

https://odyc.dev
218•achtaitaipai•23h ago•49 comments

I Read All of Cloudflare's Claude-Generated Commits

https://www.maxemitchell.com/writings/i-read-all-of-cloudflares-claude-generated-commits/
147•maxemitchell•14h ago•110 comments

Smalltalk, Haskell and Lisp

https://storytotell.org/smalltalk-haskell-and-lisp
93•todsacerdoti•15h ago•40 comments

Workhorse LLMs: Why Open Source Models Dominate Closed Source for Batch Tasks

https://sutro.sh/blog/workhorse-llms-why-open-source-models-win-for-batch-tasks
76•cmogni1•18h ago•22 comments

Wendelstein 7-X sets new fusion record

https://www.heise.de/en/news/Wendelstein-7-X-sets-new-fusion-record-10422955.html
160•doener•4d ago•35 comments

Windows 10 spies on your use of System Settings (2021)

https://www.michaelhorowitz.com/Windows10.spying.onsettings.php
97•userbinator•5h ago•97 comments

Too Many Open Files

https://mattrighetti.com/2025/06/04/too-many-files-open
129•furkansahin•21h ago•99 comments