Cool stuff. In a recent podcast Karpathy was also talking about this. He sees this as the next "target": models that don't memorise, because you can look it up in an oracle, but still keep the "reasoning" qualities.
From Spikes to Heavy Tails: Unveiling the Spectral Evolution of Neural Networks (https://openreview.net/pdf?id=DJHB8eBUnt)
andy12_•4h ago
1. Run the model once across a dataset to estimate loss curvature per MLP weight matrix via K-FAC (activation/gradient covariances).
2. Decompose each weight matrix into curvature-ordered components; low-curvature directions correspond most to verbatim memorization, higher curvature to shared/general mechanisms.
3. Edit by dropping the low-curvature subspace and keep only the top directions.
vessenes•4h ago
Now, about the paper-that’s super interesting. I imagine the dream here is to distil down into a “reasoning” core. Or maybe reclaim space for more generalization. Lots of interesting use cases.
getnormality•4h ago
I think you may have accidentally switched low and high in #2, no? The abstract speaks of high curvature as associated with memorization:
> curvature for memorized training points is much sharper than non memorized
radarsat1•3h ago
getnormality•3h ago
Say you have a smooth but highly flexible model y = f(x) and some data points you are fitting with a machine learning algorithm. For whatever reason, the algorithm decides it wants to reduce training error by interpolating some specific point, (x0,y0), without negatively affecting training error on nearby points. The direct, guaranteed successful way to do this is to adjust the model to y0 = f(x0) exactly on x0 by adding a Dirac delta there, leaving the rest of f exactly as-is. But this cannot be done on a differentiable model, as it would create a discontinuity. The next best thing that such a model can actually do is replace the Dirac delta with a smooth but very narrow bump (e.g. Gaussian). But this narrow bump will inevitably have extremely high curvature at x0, since the bump is flat at x0 and it has to merge with the neighborhood around x0 in a very short distance.
Think of driving: if you have to change lanes in a very short distance, you're going to have to steer hard. Steering is curvature.
woadwarrior01•2h ago
andy12_•2h ago
> In extending from studying per-example to bulk memorization, we propose a novel inversion of the previous interpretation of loss curvature: while individual memorized points are associated with high curvature, the direction of curvature varies across examples, meaning that, averaged across multiple examples, memorization directions are actually flatter than generalizing directions, which maintain a consistent moderate curvature across points
getnormality•2h ago
vatsachak•56m ago