https://arxiv.org/html/2506.13018v2 - Here's an interesting paper that can help inform how you might look at networks, especially in the context of lottery tickets, gauge quotients, permutations, and what gradient descent looks like in practice.
Kolmogorov Arnold Networks are better about exposing gauge symmetry and operating in that space, but aren't optimized for the hardware we have - mechinterp and other reasons might inspire new hardware, though. If you know what your layer function should look like, if it were ordered such that it resembled a smooth spline, you could initialize and freeze the weights of that layer, and force the rest of the network to learn within the context of your chosen ordering.
The number of "valid" configurations for a layer is large, especially if you have more neurons in the layer than you need, and the number of subsequent layer configurations is much larger than you'd think. The lottery ticket hypothesis is just circling that phenomenon without formalizing it - some surprisingly large percentage of possible configurations will approximate the function you want a network to learn. It doesn't necessarily gain you advantages in achieving the last 10% , and there could be counterproductive configurations that collapse before reaching an optimal configuration.
There are probably optimizer strategies that can exploit initializations of certain types, for different classes of activation functions, and achieve better performance for architectures - and all of those things are probably open to formalized methods based on existing number theory around gauge invariant systems and gauge quotients, with different layer configurations existing as points in gauge orbits in hyperdimensional spaces.
It'd be really cool if you could throw twice as many neurons as you need into a model, randomly initialize a bunch of times until you get a winning ticket, then distill the remainder down to your intended parameter count, and train from there as normal.
It's more complex with architectures like transformers, but you're not dealing with a combinatorial explosion with the LTH - more like a little combinatorial flash flood, and if you engineer around it, it can actually be exploited.
- you can solve neural networks in analytic form with a hodge star approach* [0]
- if you use a picture to set your initial weights for your nn, you can see visually how close or far your choice of optimizer is actually moving the weights - eg non-dualized optimizers look like they barely change things whereas dualized Muon changes the weights much more to the point you cannot recognize the originals [1]
*unfortunately, this is exponential in memory
[0] M. Pilanci — From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity https://arxiv.org/abs/2309.16512
This would tie in with the observation that flat/shallow minimas are easier to find with stochastic gradient descent and such weights generalise better.
It would be awesome to have a way of finding them in advance but this is also just a case of avoid pure DNNs due to their strong reliance on initialization parameters.
Looking at transformers by comparison you see a much much weaker dependence of the model on the input initial parameters. Does this mean the model is better or worse at learning or just more stable?
[1] https://distill.pub/2020/circuits/branch-specialization/
I think the idea still holds. Although the interest has been shifted towards test-time scaling and thinking, researcher still care about architectures like nemotron 3, recently published.
Can anyone give more updates on this direction of research, more recent papers?
laughingcurve•1d ago
yorwba•1d ago
kingstnap•1d ago
https://youtu.be/WW1ksk-O5c0?list=PLCq6a7gpFdPgldPSBWqd2THZh... (timestamped)
At the timestamp they discuss how actually the original ICLR results only worked on these extremely tiny models and larger ones didn't work. The adaptation you need to sort of fix it is to train densely first for a few epochs, only then you can start increasing sparsity.
paulsutter•23h ago
Ioannu is saying the paper's idea for training a dense network doesn't work in non-toy networks (the paper's method for selecting promising weights early doesn't improve the network)
BUT the term "lottery ticket" refers to the true observation that a small subset of weights drive functionality (see all pruning papers). It's great terminology because they truly are coincidences based on random numbers.
All that's been disproven is that paper's specific method to create a dense network based on this observation
swyx•1d ago
gwern•1d ago
oofbey•1d ago
gwern•1d ago
oofbey•1d ago
aaronblohowiak•1d ago
IshKebab•1d ago
oofbey•1d ago
sailingparrot•1d ago
laughingcurve•1d ago
swyx•20h ago
laughingcurve•1d ago
also; note to self: If I publish and disown my papers, shawn will interview me :)