Hey everyone! As a fun side project before college, I created an ahead-of-times automatic differentiation library from scratch (only dependency I would allow is numpy for tensor storage). Inspired by tinygrad, I decided to make as little operations as possible.
The backend I support for now is OpenCL (so I can run it on my Mac), but I do want to expand to NVIDIA CUDA (or even PTX?).
I had some "preliminary results". For 1-2 layer Transformer, my library was able to outpace PyTorch 2.0. But for larger layers PyTorch outperforms by a lot (my OpenCL kernels were very naive).
Would love to hear your thoughts/questions!
Thanks
Eshaan B.
eshaanb@stanford.edu if you want to reach out!
eshaanb•1h ago
The backend I support for now is OpenCL (so I can run it on my Mac), but I do want to expand to NVIDIA CUDA (or even PTX?).
I had some "preliminary results". For 1-2 layer Transformer, my library was able to outpace PyTorch 2.0. But for larger layers PyTorch outperforms by a lot (my OpenCL kernels were very naive).
Would love to hear your thoughts/questions!
Thanks Eshaan B. eshaanb@stanford.edu if you want to reach out!