We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
We believe our improvements would hold on BF16, but let me check.
ntonozzi•1h ago
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
lukebechtel•1h ago