It's not an activation function, because it has the learnable weights of a linear projection (mat vec multiplication) and the clamping properties of an activation function all in one.
My personal issue with the proposal is that it essentially doubles the amount of memory needed on-chip.
Yat-Product GEMMV now needs to store the running total of the inner product and the norm of the input vectors. That's a big cost increase for something that might not improve performance all that much.
mlnomadpy•3d ago
I was able to create a new kernel that allows you to learn non-linearity without using activation functions, making the models whitebox, and without any information loss.
MiniGPT with huggingface datasets streaming: https://www.kaggle.com/code/skywolfmo/yat-nnx-minigpt-finewe...
rytill•3h ago
To my knowledge they’re a negligible portion of the total compute during training or inference and work well to provide non-linearity.
Very open to learning more.
russfink•3h ago