frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Making GPT-2 better at math reasoning with a new attention mechanism

https://github.com/Kim-Ai-gpu/FactorizedAttention
3•umjunsik132•3mo ago

Comments

umjunsik132•3mo ago
Hi HN Author here.

I built FactorizedAttention - a new attention mechanism based on the GWO framework. Instead of simple QK^T dot products, it uses factorized quadratic forms to model higher-order token interactions.

Testing on GPT-2 small + LoRA fine-tuning:

Math reasoning: 3.4% PPL improvement

Competitive programming: 3.2%

Python code: 1.9%

The bigger gains on reasoning tasks suggest the approach helps with complex relationships. Still early stage (only GPT-2 small), but the results are encouraging. Happy to answer questions! Code + repro steps in the repo.

mynti•3mo ago
Cool idea! I had a look at the code and have been wondering about the sigmoid gating, it is used to add some of the q_struct and k_struct into the original key and query. But I wonder why this gating is independend of the input? I would have expected this gating to be dependednd on the input, so if the model sees something more complex it needs more of this information (or something similar). But it is just a fix, learnable parameter per layer, or am I mistaken? What is the intuition about this?
umjunsik132•3mo ago
For this initial version, I kept the gating static to keep the model as simple as possible while validating the core idea. Making the gate dynamic based on the input is a great suggestion for the next step, and I agree it could lead to better performance. I really appreciate the feedback