Show HN: I Am 15 and Built a Dual Backend MLP from Scratch Using CUDA C++

https://github.com/muchlakshay/Dual-Backend-MLP-From-Scratch-CUDA

1•muchlakshay•7h ago

hii everyone! I'm a 15-year-old and I just completed a dual backend MLP from scratch that supports both CPU and GPU (CUDA) training.

for the CPU backend, I used only Eigen for linear algebra, nothing else.

for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.

that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.

This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.

I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.

would love to hear your thoughts, suggestions, or feedback

ive attached the link to my GitHub Repo

Comments

onelli•7h ago

Love seeing young devs shipping real projects! Out of curiosity, have you tried benchmarking your MLP on any real-world data sets, or was this mainly about learning CUDA/C++? (And what’s the biggest gotcha you ran into?)

muchlakshay•6h ago

thanks!!!! appreciate that a lot. i’ve mainly tested it on MNIST for now, the CUDA backend trains one epoch in ~0.4ms (batch size 1000, RTX 3060, as i mentioned in the post). It was primarily a deep dive into CUDA/C++, manual memory management, and building a dual backend architecture with a custom matrix lib (GPU-side completely from scratch). this was actually my 4th serious attempt at building a GPU-based MLP from scratch. I failed multiple times, sometimes due to a single line of code. in earlier attempts, i had this optimization idea: store both the weights and their transposes in GPU memory, so i wouldn’t have to compute the transpose each epoch. Seemed clever, until training started failing badly. Turned out I was only updating the original weights matrix after backprop, but the transposed one was still holding stale values from earlier. this broke training completely, and I spent weeks trying to debug it, couldn’t figure it out until this current version.

honestly, the biggest gotchas were-

-memory coherence issues like above (esp. when trying to cache 'smartly')

-launching kernels in the right order while keeping data in sync

-maintainingg modularity without sacrificing too much performance

i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.

Show HN: Self-updating MCP server for official pip, uv, poetry and conda docs

I Gave Every iPhone USB-C [video]

Uber will let women drivers and riders request to avoid being paired with men

Show HN: Bskysrch – An Advanced Search for Bluesky

California Forever changes its plans from a startup city to a startup Foundry

Simulate Harsh User Review for Claude code

Show HN: Agilepitch – The Superhuman for CRMs

All Being Dragged into a Giant Invisible Structure, Scientists Say

Building Telemetry Pipelines with the OpenTelemetry Collector

Texas Instruments AI Productivity Roundtable (1987) [video]

Grok CLI – Open-source AI agent that brings the power of Grok in your terminal

Furusato Johositu

BrainScaleS: A Wafer-Scale, Neuromorphic System [pdf]

Seaweed powder lowers concrete's carbon emissions without sacrificing strength

New PS5 beta previews DualSense controller pairing across multiple devices

Lumo: Privacy-first AI assistant from Proton, based in Europe

Remove All AI Features from Firefox

Battery-powered Starlink Mini is here

A new kidney – free of daily meds

Show HN: I built a form filler that generates realistic test data

The End Is Nigh, for the Beta Days for Reticulum

Voyage-context-3: focused chunk-level details with global document context

Show HN: Gitpatch – Send Patches in Seconds

NP-Complete Structure – Tabular Problem Classifier

Turn rough sketches into works of art directly in Figma

America's AI Action Plan [pdf]

Pogocache 1.0 – Claims Better Performance Than Memcache, Valkey and Redis

Solving the cyber talent gap: Three lessons from Ireland

Twelve Basic Principles of Animation

Can Generative AI coding make Emacs and Org-Mode worth it now?