Would be interested in feedback from people who have written transformer implementations before, are there any implementation "tricks" that I'm missing (e.g, cleaner KV cache for PyTorch/Jax or rope tricks)?
Would be interested in feedback from people who have written transformer implementations before, are there any implementation "tricks" that I'm missing (e.g, cleaner KV cache for PyTorch/Jax or rope tricks)?