frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496
1•vinhnx•5m ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/
1•rolph•9m ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter
1•Lwrless•11m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•vermilingua•16m ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/
1•telui•17m ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k
1•cedel2k1•20m ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349
13•chwtutha•20m ago•0 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/
2•osnium123•21m ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/
1•jeremy_su•22m ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/
1•fx31xo•25m ago•0 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•31m ago•0 comments

Vectors and HNSW for Dummies

https://anvitra.ai/blog/vectors-and-hnsw/
1•melvinodsa•33m ago•0 comments

Sanskrit AI beats CleanRL SOTA by 125%

https://huggingface.co/ParamTatva/sanskrit-ppo-hopper-v5/blob/main/docs/blog.md
1•prabhatkr•44m ago•1 comments

'Washington Post' CEO resigns after going AWOL during job cuts

https://www.npr.org/2026/02/07/nx-s1-5705413/washington-post-ceo-resigns-will-lewis
2•thread_id•44m ago•1 comments

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

https://twitter.com/claudeai/status/2020207322124132504
1•geeknews•46m ago•0 comments

TSMC to produce 3-nanometer chips in Japan

https://www3.nhk.or.jp/nhkworld/en/news/20260205_B4/
3•cwwc•49m ago•0 comments

Quantization-Aware Distillation

http://ternarysearch.blogspot.com/2026/02/quantization-aware-distillation.html
1•paladin314159•49m ago•0 comments

List of Musical Genres

https://en.wikipedia.org/wiki/List_of_music_genres_and_styles
1•omosubi•51m ago•0 comments

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

https://sknet.ai/
1•BeinerChes•51m ago•0 comments

University of Waterloo Webring

https://cs.uwatering.com/
2•ark296•51m ago•0 comments

Large tech companies don't need heroes

https://www.seangoedecke.com/heroism/
2•medbar•53m ago•0 comments

Backing up all the little things with a Pi5

https://alexlance.blog/nas.html
1•alance•54m ago•1 comments

Game of Trees (Got)

https://www.gameoftrees.org/
2•akagusu•54m ago•1 comments

Human Systems Research Submolt

https://www.moltbook.com/m/humansystems
1•cl42•54m ago•0 comments

The Threads Algorithm Loves Rage Bait

https://blog.popey.com/2026/02/the-threads-algorithm-loves-rage-bait/
1•MBCook•56m ago•0 comments

Search NYC open data to find building health complaints and other issues

https://www.nycbuildingcheck.com/
1•aej11•1h ago•0 comments

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

https://www.nytimes.com/2026/02/07/magazine/michael-pollan-interview.html
2•lxm•1h ago•0 comments

Show HN: Grovia – Long-Range Greenhouse Monitoring System

https://github.com/benb0jangles/Remote-greenhouse-monitor
1•benbojangles•1h ago•1 comments

Ask HN: The Coming Class War

2•fud101•1h ago•4 comments

Mind the GAAP Again

https://blog.dshr.org/2026/02/mind-gaap-again.html
1•gmays•1h ago•0 comments
Open in hackernews

Show HN: I Am 15 and Built a Dual Backend MLP from Scratch Using CUDA C++

https://github.com/muchlakshay/Dual-Backend-MLP-From-Scratch-CUDA
1•muchlakshay•6mo ago
hii everyone! I'm a 15-year-old and I just completed a dual backend MLP from scratch that supports both CPU and GPU (CUDA) training.

for the CPU backend, I used only Eigen for linear algebra, nothing else.

for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.

that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.

This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.

I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.

would love to hear your thoughts, suggestions, or feedback

ive attached the link to my GitHub Repo

Comments

onelli•6mo ago
Love seeing young devs shipping real projects! Out of curiosity, have you tried benchmarking your MLP on any real-world data sets, or was this mainly about learning CUDA/C++? (And what’s the biggest gotcha you ran into?)
muchlakshay•6mo ago
thanks!!!! appreciate that a lot. i’ve mainly tested it on MNIST for now, the CUDA backend trains one epoch in ~0.4ms (batch size 1000, RTX 3060, as i mentioned in the post). It was primarily a deep dive into CUDA/C++, manual memory management, and building a dual backend architecture with a custom matrix lib (GPU-side completely from scratch). this was actually my 4th serious attempt at building a GPU-based MLP from scratch. I failed multiple times, sometimes due to a single line of code. in earlier attempts, i had this optimization idea: store both the weights and their transposes in GPU memory, so i wouldn’t have to compute the transpose each epoch. Seemed clever, until training started failing badly. Turned out I was only updating the original weights matrix after backprop, but the transposed one was still holding stale values from earlier. this broke training completely, and I spent weeks trying to debug it, couldn’t figure it out until this current version.

honestly, the biggest gotchas were-

-memory coherence issues like above (esp. when trying to cache 'smartly')

-launching kernels in the right order while keeping data in sync

-maintainingg modularity without sacrificing too much performance

i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.