I love these optimization tales. Memory throughput bottlenecks (extremely common, perhaps moreso than they seem) are my favorite to tackle - there are frequently some juicy optimizations that can apply there.
Do model weights have any spatial locality that can be exploited? If so, there are some more general pre-compression techniques that might be interesting to try, e.g. bitshuffle is one I've worked with (https://github.com/kiyo-masui/bitshuffle).
Another fun fact: in some scenarios (depends a lot on CPU and memory characteristics), gzip+memcpy+gunzip can be faster end-to-end than just memcpy. I forget where I first heard this but my familiarity comes from the blosc compression library.
ttd•1h ago
Do model weights have any spatial locality that can be exploited? If so, there are some more general pre-compression techniques that might be interesting to try, e.g. bitshuffle is one I've worked with (https://github.com/kiyo-masui/bitshuffle).
Another fun fact: in some scenarios (depends a lot on CPU and memory characteristics), gzip+memcpy+gunzip can be faster end-to-end than just memcpy. I forget where I first heard this but my familiarity comes from the blosc compression library.