I built FactorizedAttention - a new attention mechanism based on
the GWO framework. Instead of simple QK^T dot products, it uses
factorized quadratic forms to model higher-order token interactions.
Testing on GPT-2 small + LoRA fine-tuning:
Math reasoning: 3.4% PPL improvement
Competitive programming: 3.2%
Python code: 1.9%
The bigger gains on reasoning tasks suggest the approach helps
with complex relationships. Still early stage (only GPT-2 small),
but the results are encouraging. Happy to answer questions! Code + repro steps in the repo.
mynti•1h ago
Cool idea! I had a look at the code and have been wondering about the sigmoid gating, it is used to add some of the q_struct and k_struct into the original key and query. But I wonder why this gating is independend of the input? I would have expected this gating to be dependednd on the input, so if the model sees something more complex it needs more of this information (or something similar). But it is just a fix, learnable parameter per layer, or am I mistaken? What is the intuition about this?
umjunsik132•52m ago
For this initial version, I kept the gating static to keep the model as simple as possible while validating the core idea. Making the gate dynamic based on the input is a great suggestion for the next step, and I agree it could lead to better performance. I really appreciate the feedback
umjunsik132•2h ago
I built FactorizedAttention - a new attention mechanism based on the GWO framework. Instead of simple QK^T dot products, it uses factorized quadratic forms to model higher-order token interactions.
Testing on GPT-2 small + LoRA fine-tuning:
Math reasoning: 3.4% PPL improvement
Competitive programming: 3.2%
Python code: 1.9%
The bigger gains on reasoning tasks suggest the approach helps with complex relationships. Still early stage (only GPT-2 small), but the results are encouraging. Happy to answer questions! Code + repro steps in the repo.
mynti•1h ago
umjunsik132•52m ago