for the CPU backend, I used only Eigen for linear algebra, nothing else.
for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.
that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.
This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.
I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.
would love to hear your thoughts, suggestions, or feedback
ive attached the link to my GitHub Repo
onelli•7h ago
muchlakshay•6h ago
honestly, the biggest gotchas were-
-memory coherence issues like above (esp. when trying to cache 'smartly')
-launching kernels in the right order while keeping data in sync
-maintainingg modularity without sacrificing too much performance
i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.