Hello HN, I’ve turned my Master’s research on stabilizing very deep Transformers into an open-source PyTorch library called AION-Torch. Instead of a fixed residual connection, it uses an adaptive residual that looks at how “energetic” the block’s input and output are and dials the residual strength up or down to keep things stable. On my small setup (RTX 4060) it seemed to help very deep Transformer stacks keep gradients under control and reach lower loss without special tuning.
The repo has a drop-in AionResidual module, some basic tooling to log what’s happening inside the network, and small examples to show how to plug it into existing models. I’d love feedback on whether this idea makes sense beyond toy setups, how you would benchmark it against standard residuals/DeepNorm on real tasks, and if the API feels natural for people who train larger models.