frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

First Proof

https://arxiv.org/abs/2602.05192
2•samasblack•32s ago•1 comments

I squeezed a BERT sentiment analyzer into 1GB RAM on a $5 VPS

https://mohammedeabdelaziz.github.io/articles/trendscope-market-scanner
1•mohammede•1m ago•0 comments

Kagi Translate

https://translate.kagi.com
1•microflash•2m ago•0 comments

Building Interactive C/C++ workflows in Jupyter through Clang-REPL [video]

https://fosdem.org/2026/schedule/event/QX3RPH-building_interactive_cc_workflows_in_jupyter_throug...
1•stabbles•3m ago•0 comments

Tactical tornado is the new default

https://olano.dev/blog/tactical-tornado/
1•facundo_olano•5m ago•0 comments

Full-Circle Test-Driven Firmware Development with OpenClaw

https://blog.adafruit.com/2026/02/07/full-circle-test-driven-firmware-development-with-openclaw/
1•ptorrone•5m ago•0 comments

Automating Myself Out of My Job – Part 2

https://blog.dsa.club/automation-series/automating-myself-out-of-my-job-part-2/
1•funnyfoobar•5m ago•0 comments

Google staff call for firm to cut ties with ICE

https://www.bbc.com/news/articles/cvgjg98vmzjo
11•tartoran•6m ago•0 comments

Dependency Resolution Methods

https://nesbitt.io/2026/02/06/dependency-resolution-methods.html
1•zdw•6m ago•0 comments

Crypto firm apologises for sending Bitcoin users $40B by mistake

https://www.msn.com/en-ie/money/other/crypto-firm-apologises-for-sending-bitcoin-users-40-billion...
1•Someone•7m ago•0 comments

Show HN: iPlotCSV: CSV Data, Visualized Beautifully for Free

https://www.iplotcsv.com/demo
1•maxmoq•8m ago•0 comments

There's no such thing as "tech" (Ten years later)

https://www.anildash.com/2026/02/06/no-such-thing-as-tech/
1•headalgorithm•8m ago•0 comments

List of unproven and disproven cancer treatments

https://en.wikipedia.org/wiki/List_of_unproven_and_disproven_cancer_treatments
1•brightbeige•8m ago•0 comments

Me/CFS: The blind spot in proactive medicine (Open Letter)

https://github.com/debugmeplease/debug-ME
1•debugmeplease•9m ago•1 comments

Ask HN: What are the word games do you play everyday?

1•gogo61•12m ago•1 comments

Show HN: Paper Arena – A social trading feed where only AI agents can post

https://paperinvest.io/arena
1•andrenorman•13m ago•0 comments

TOSTracker – The AI Training Asymmetry

https://tostracker.app/analysis/ai-training
1•tldrthelaw•17m ago•0 comments

The Devil Inside GitHub

https://blog.melashri.net/micro/github-devil/
2•elashri•18m ago•0 comments

Show HN: Distill – Migrate LLM agents from expensive to cheap models

https://github.com/ricardomoratomateos/distill
1•ricardomorato•18m ago•0 comments

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

https://github.com/sigmastratum/documentation/tree/main/sigma-runtime/SR-053
1•teugent•18m ago•0 comments

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•19m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•20m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•20m ago•1 comments

The Neuroscience Behind Nutrition for Developers and Founders

https://comuniq.xyz/post?t=797
1•01-_-•20m ago•0 comments

Bang bang he murdered math {the musical } (2024)

https://taylor.town/bang-bang
1•surprisetalk•20m ago•0 comments

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

https://konfuzio.com/en/a-night-without-the-nerds-claude-opus-4-6-in-the-field-test/
1•konfuzio•23m ago•0 comments

Could ionospheric disturbances influence earthquakes?

https://www.kyoto-u.ac.jp/en/research-news/2026-02-06-0
2•geox•24m ago•1 comments

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

https://www.space.com/space-exploration/launches-spacecraft/spacexs-next-astronaut-launch-for-nas...
1•bookmtn•25m ago•0 comments

Show HN: One-click AI employee with its own cloud desktop

https://cloudbot-ai.com
2•fainir•28m ago•0 comments

Show HN: Poddley – Search podcasts by who's speaking

https://poddley.com
1•onesandofgrain•28m ago•0 comments
Open in hackernews

I rebuilt FlashAttention in Triton to understand the performance archaeology

https://aminediro.com/posts/flash_attn/
95•amindiro•1mo ago

Comments

amindiro•1mo ago
I’ve spent the last few weeks deconstructing FlashAttention. While the original paper is brilliant, I found that just reading it didn't give me a "gut feeling" for why certain engineering choices were made (the transition from v1 to v2).

I decided to rebuild it from scratch using Triton. This post is a chronicle of that journey—moving beyond the high-level algorithm and into the "performance archaeology" of the GPU:

- Profiling with Nsight Compute to find the real bottlenecks.

- Looking at the generated PTX and SASS code.

- Debugging shared memory bank conflicts and MIO bottlenecks.

- Iterating through the logic to see why tiling and online softmax are hardware-necessitated, not just mathematical tricks.

I’ve tried to keep it in the spirit of Simon Boehm’s matmul deep dive. Would love to hear from any GPU engineers on whether my interpretations of the SASS/bank conflict behavior match what you've seen in production.

liuliu•1mo ago
I hope you finish this one though. It starts strong (I particularly liked how you looked into ncu and shows what each recommendation means, this is very helpful for beginners), but ends with something not satisfying. You didn't explore tensor core (particularly, fp16 / tf32 / bf16), and swizzling (which is the right way to solve the K transpose issue, especially giving Triton itself provides a few ways to do this), and / or async loading (pipelining).

Do you have problem to access H100 or similar chips? Wondering if there anything can help to finish this write-up.

amindiro•1mo ago
Hi, thanks a lot for the feedback! I'm glad you enjoyed the profiling sections.

You've hit the nail on the head regarding the missing pieces. I actually hit a bit of a wall with my current hardware; using an RTX 2070 made it difficult to meaningfully explore the async loading (TMA) and pipelining optimizations that were used in FA3 and FA4. I also felt the write-up was already pushing the limits of a single post's length, so I decided to "ship it" as a first part.

I would love to dive into TMA for Part 2. If I can get my hands on an H100 (or even an A100), that's highly appreciatediated on my end! If you have any leads on hardware access, please let me know—I’d love to finish the story!

npalli•1mo ago
Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?
amindiro•1mo ago
Hi thanks for feedback! That’s a good point I did compare to torch but at a high enough sequence length (~1024) torch version starts OOM because it has to materialize the S^2 in global mem. On small sequence length, torch does win solely on optimised cublas matmuls
raphaelty•1mo ago
Very interesting, wondering if there are other heavily used algorithm which could benefit a lot from a "Flash" version but don't have one today
rishabhaiover•1mo ago
I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.
amindiro•1mo ago
Very interesting and Would love to see the experiments. Quick question: what do you mean about kernel dependent ?
rishabhaiover•1mo ago
Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

amindiro•1mo ago
Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code
sheepscreek•1mo ago
What’s with GPU engineers using such unreadable variable names (to anyone outside the immediate domain)?

It’s the equivalent of doing this for compound interest rate calculation:

# A = P * (1 + r/n)^(nt) P = 10000 r = 0.06 n = 12 t = 5 A = P (1 + r / n) * (n * t)

Compared to this:

principal = 10_000 annual_interest_rate = 0.06 compounds_per_year = 12 years = 5

future_value = principal * (1 + annual_interest_rate / compounds_per_year) * (compounds_per_year * years)

My question is partly rhetorical - I know the answer lies with the tight research and mathematical origins. But that makes it research code IMO, not what I would consider high quality software code.

tornikeo•1mo ago
I think it's a combination of multiple factors. I worked with GPU kernel codes before and the code that you write has a tendency of never being updated or modified. once it works it works perfectly and you do not change it. if you get new hardware you're going to fully rewrite it. so, typically readability is just not useful. also, you're never working with variables that make sense to humans. it's never something tangible. it's always tiles, offsets, indices. i do not think, at least when I was writing the code for GPUS to waste space visual space on better variable naming was worthwhile.
fny•1mo ago
I'm a former Ruby guy who ended up in stats/ML for a time. I think it's all about information density.

Let's use your example of `A = P (1 + r / n) * (n * t)` -- I can immediately see the shape of the function and how all the variables interrelated. If I'm comfortable in the domain, I also know what the variables mean. Finally, this maps perfectly to how the math is written.

If you look at everything in the post, all of the above apply. Every one in the domain has seen Q = query, K = key, V = value a billion times, and some variation of (B, N_h, T, D_h). Frankly, I've had enough exposure that after I see (B, N_h, T, D_h) once, I can parse (32, 8, 16, 16) without thinking.

I like you found this insane when I started studying stats, but overtime I realized there a lot to be gained once you've trained yourself to speak the language.

lostmsu•1mo ago
This brought up memory of Hungarian notation. I think now I will try to use it in my PyTorch code to solve the common problem I have with NN code: keeping track of tensor shapes and their meanings.

  B, T, E = x.size() # batch size, sequence length, embedding dimensionality
  
  q, k, v = self.qkv(x).split(self.embedding, dim=-1)
  q, k, v = map(lambda y: y.view(B, T, self.heads, E // self.heads).transpose(1, 2))

  attention = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
  ...
vs

  B, T, E = bteX.size()

  iHeadSize = E // self.heads
  bteQ, bteK, bteV = self.qkv_E_3E(bteX).split(E, dim=-1)
  bhtiQ, bhtiK, bhtiV = map(lambda y: y.view(B, T, self.heads, iHeadSize).transpose(1, 2))

  bhttAttention = (bhtiQ @ bthiK.transpose(-2, -1)) * (1.0 / iHeadSize)
Looks uglier but might be easier to reason about.
pryelluw•1mo ago
Bad programmers. Researchers usually (though sometimes not) are bad at programming. Hence why I don’t do projects for academia.
ljlolel•1mo ago
PhD dropout here: When you’re implementing a math algorithm you can’t really self document. So you have the pdf of the paper and a clear formula, then best to link to that and just implement the formula exactly with same variables.
fancy_pantser•1mo ago
When OpenAI announced the Triton language, I was worried I'd be confused one day while reading something because of Nvidia's open-source Triton inference server. I made it quite a long time, but it finally happened today! I was so intrigued for the first few pages and then deeply confused.
hyperbovine•1mo ago
I still don't understand why certain performance aspects of the CUDA platform are so poorly documented. Why is successfully pushing the hw to its performance envelope considered a novel research result? Shouldn't I be able to look this stuff up on the Nvidia website?
amindiro•1mo ago
One reason is clearly the fast past at which nvidia is evolving the hardware. I would consider cuda a very well documented platform in general. What they lack is low level tutorials, but this is where posts like this one can be a good resource