The universal weight subspace hypothesis

358•lukeplato•2mo ago

Comments

CGMthrowaway•2mo ago

They compressed the compression? Or identified an embedding that can "bootstrap" training with a headstart ?

Not a technical person just trying to put it in other words.

vlovich123•2mo ago

They identified that the compressed representation has structure to it that could potentially be discovered more quickly. It’s unclear if it would also make it easier to compress further but that’s possible.

mapontosevenths•2mo ago

To use an analogy: Imagine a spreadsheet with 500 smoothie recipes one in each row, each with a dozen ingredients as the columns.

Now imagine you discover that all 500 are really just the same 11 base ingredients plus something extra.

What they've done here is use SVD, (which is normally used for image compression and noise reduction), to find that "base recipe". Now we can reproduce those other recipes by only recording the one igredient that differs.

More interestingly it might tell us something new about smoothies in general to know that they all share a common base. Maybe we can even build a simpler base using this info.

At least in theory. The code hasn't actually been released yet.

https://toshi2k2.github.io/unisub/#key-insights

CGMthrowaway•2mo ago

Yeah that's pretty much how I understood it. Good analogy. We are finding the French Mother Sauce. Reading the comments it seems everyone is still clear on the practical implications of that.

canjobear•2mo ago

What's the relationship with the Platonic Representation Hypothesis?

MarkusQ•2mo ago

From what I can tell, they are very closely related (i.e. the shared representational structures would likely make good candidates for Platonic representations, or rather, representations of Platonic categories). In any case, it seems like there should be some sort of interesting mapping between the two.

brillcleaner•2mo ago

My first thought was that this was somehow distilling universal knowledge. Platonic ideals. Truth. Beauty. Then I realized- this was basically just saying that given some “common sense”, the learning essence of a model is the most important piece, and a lot of learned data is garbage and doesn’t help with many tasks. That’s not some ultimate truth, that’s just optimization. It’s still a faulty LLM, just more efficient for some tasks.

unionjack22•2mo ago

I hope someone much smarter than I answers this. I’ve been noticing an uptick platonic and neo-platonic discourse in the zeitgeist and am wondering if we’re converging on something profound.

nowittyusername•2mo ago

I've been noticing that as well....

altairprime•2mo ago

Same hat, except 18 months later, assuming it survives peer review, reproduction, etc. (or: "The newer one proposes evidence that appears to support the older one.")

https://arxiv.org/abs/2405.07987

kacesensitive•2mo ago

interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time

odyssey7•2mo ago

Or an architecture chosen for that subspace or some of its properties as inductive biases.

bigbuppo•2mo ago

Wouldn't this also mean that there's an inherent limit to that sort of model?

rhaen•2mo ago

Not strictly speaking? A universal subspace can be identified without necessarily being finite.

As a really stupid example: the sets of integers less than 2, 8, 5, and 30 can all be embedded in the set of integers less than 50, but that doesn’t require that the set of integer is finite. You can always get a bigger one that embeds the smaller.

scotty79•2mo ago

> Wouldn't this also mean that there's an inherent limit to that sort of model?

If all need just 16 dimensions if we ever make one that needs 17 we know we are making progress instead of running in circles.

moelf•2mo ago

you can always make a new vector that's orthogonal to all the ones currently used and see if the inclusion improves performance on your tasks

scotty79•2mo ago

> see if the inclusion improves performance on your tasks

Apparently it doesn't at least not in our models with our training applied to our tasks.

So if we expand one of those 3 things and notice that 17-th vector makes a difference then we are having progress.

markisus•2mo ago

On the contrary, I think it demonstrates an inherent limit to the kind of tasks / datasets that human beings care about.

It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.

It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.

moelf•2mo ago

>instead of spending massive compute rediscovering it from scratch every time

it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering"

tsurba•2mo ago

You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).

Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.

All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI

VikingCoder•2mo ago

I find myself wanting genetic algorithms to be applied to try to develop and improve these structures...

But I always want Genetic Algorithms to show up in any discussion about neural networks...

EvanAnderson•2mo ago

I have a real soft spot for the genetic algorithm as a result of reading Levy's "Artificial Life" when I was a kid. The analogy to biological life is more approachable to my poor math education than neural networks. I can grok crossover and mutation pretty easily. Backpropagation is too much for my little brain to handle.

DennisP•2mo ago

I do too, and for the same reasons. Levy's book had a huge impact on me in general.

nrhrjrjrjtntbt•2mo ago

Backprop is learnable through karpathy videos but it takes a lot of patience. The key thing is the chain rule. Get that and the rest is mostly understanding what the bulk operations on tensors are doing (they are usually doing something simple enough but so easy to make mistakes)

embedding-shape•2mo ago

> Backpropagation is too much for my little brain to handle.

I just stumbled upon a very nice description of the core of it, right here: https://www.youtube.com/watch?v=AyzOUbkUf3M&t=133s

Almost all talks by Geoffrey Hinton (left side on https://www.cs.toronto.edu/~hinton/) are in very approachable if you're passingly familiar with some ML.

acjohnson55•2mo ago

You can definitely understand backpropagation, you just gotta find the right explainer.

On a basic level, it's kind of like if you had a calculation for aiming a cannon, and someone was giving you targets to shoot at 1 by 1, and each time you miss the target, they tell you how much you missed by and what direction. You could tweak your calculation each time, and it should get more accurate if you do it right.

Backpropagation is based on a mathematical solution for how exactly you make those tweaks, taking advantage of some calculus. If you're comfortable with calculus you can probs understand it. If not, you might have some background knowledge to pick up first.

bob1029•2mo ago

My entire motivation for using GAs is to get away from back propagation. When you aren't constrained by linearity and chain rule of calculus, you can approach problems very differently.

For example, evolving program tapes is not something you can back propagate. Having a symbolic, procedural representation of something as effective as ChatGPT currently is would be a holy grail in many contexts.

VikingCoder•2mo ago

In grad school, I wrote an ant simulator. There was a 2D grid of squares. I put ant food all over it, in hard-coded locations. Then I had a neural network for an ant. The inputs were "is there any food to the left? to the diagonal left? straight ahead? to the diagonal right? to the right?" The outputs were "turn left, move forward, turn right."

Then I had a multi-layer network - I don't remember how many layers.

Then I was using a simple Genetic Algorithm to try to set the weights.

Essentially, it was like breeding up a winner for the snake game - but you always know where all of the food is, and the ant always started in the same square. I was trying to maximize the score for how many food items the ant would eventually find.

In retrospect, it was pretty stupid. Too much of it was hard-coded, and I didn't have near enough middle layers to do anything really interesting. And I was essentially coming up with a way to not have to do back-propagation.

At the time, I convinced myself I was selecting for instinctive knowledge...

And I was very excited by research that said that, rather than having one pool of 10,000 ants...

It was better to have 10 islands of 1,000 ants, and to occasionally let genetic information travel from one island to another island. The research claimed the overall system would converge faster.

I thought that was super cool, and made me excited that easy parallelism would be rewarded.

I daydream about all of that, still.

CalChris•2mo ago

I'm the same but with vector quantization.

altairprime•2mo ago

That would be an excellent use of GA and all the other 'not based on training a network' methods, now that we have a target and can evaluate against it!

joquarky•2mo ago

I got crazy obsessed with EvoLisa¹ back in the day and although there is nothing in common between that algorithm and those that make up training an LLM, I can't help but feel like they are similar.

¹ https://www.rogeralsing.com/2008/12/07/genetic-programming-e...

dcrimp•2mo ago

I've been messing around with GA recently, esp indirect encoding methods. This paper seems in support of perspectives I've read while researching. In particular, that you can decompose weight matrices into spectral patterns - similar to JPEG compression and search in compressed space.

Something I've been interested in recently is - I wonder if it'd be possible to encode a known-good model - some massive pretrained thing - and use that as a starting point for further mutations.

Like some other comments in this thread have suggested, it would mean we can distill the weight patterns of things like attention, convolution, etc. and not have to discover them by mutation - so - making use of the many phd-hours it took to develop those patterns, and using them as a springboard. If papers like this are to be believed, more advanced mechanisms may be able to be discovered.

api•2mo ago

I immediately started thinking that if there are such patterns maybe they capture something about the deeper structure of the universe.

EvanAnderson•2mo ago

On a hike this weekend my daughter and I talked about the similarities of the branching and bifurcating patterns in the melting ice on a pond, the branches of trees, still photos of lightning, the circulatory system, and the filaments in fractals.

api•2mo ago

Find some images of the entire huge scale structure of the universe. It looks a bit like… a brain.

What does this mean? Probably not nothing, but probably not “the cosmos is the mind of god.” It probably means that we live in a universe that tends to produce repeating nested patterns at different scales.

But maybe that’s part of what makes it possible to evolve or engineer brains that can understand it. If it had no regularity there’d be no common structural motifs.

EvanAnderson•2mo ago

Similar feeling here re: "mind of God". I interpret these patterns as a very simple property of mathematics producing complex-looking patterns and evolution exploiting that complexity. Evolution is the ultimate procedural content generation machine.

kroaton•2mo ago

I thought that was Houdini.

mwkaufma•2mo ago

(Finds a compression artifact) "Is this the meaning of consciousness???"

ibgeek•2mo ago

They are analyzing models trained on classification tasks. At the end of the day, classification is about (a) engineering features that separate the classes and (b) finding a way to represent the boundary. It's not surprising to me that they would find these models can be described using a small number of dimensions and that they would observe similar structure across classification problems. The number of dimensions needed is basically a function of the number of classes. Embeddings in 1 dimension can linearly separate 2 classes, 2 dimensions can linearly separate 4 classes, 3 dimensions can linearly separate 8 classes, etc.

mlpro•2mo ago

The analysis is on image classification, LLMs, Diffusion models, etc.

farhanhubble•2mo ago

Would you see a lower rank subspace if the learned weights were just random vectors?

imtringued•1mo ago

This is a good point, but I think this only works for D*A, where D=Sigma is a diagonal matrix with learnable parameters. It probably doesn't work for a full singular value decomposition (SVD) UDV^T.

Basically, what if we're not actually "training" the model, but rather the model was randomly initialized and the learning algorithm is just selecting the vectors that happen to point into the right direction? A left multiplication of the form D*A with a diagonal matrix is equivalent to multiplying each row in A with the corresponding diagonal element. Low values mean the vector in question was a lottery blank and unnecessary. High values means that this turns out to be correct vector, yay!

But this trivial explanation doesn't work for the full SVD, because you now have a right multiplication U*D. This means each column gets multiplied against the corresponding diagonal element. Both the column in U and row vector in V^T have to perfectly coincide to make the "selection" theory work, which is unlikely to be true for small models, which happen to work just fine.*

AIorNot•2mo ago

Interesting - I wonder if this ties into the Platonic Space Hypothesis recently being championed by computational biologist Mike Levin

E.g

https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY

https://thoughtforms.life/symposium-on-the-platonic-space/

e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2

"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.

In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"

Also from the OP's Paper we see this on statement:

"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.

First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).

Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).

Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).

If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."

unionjack22•2mo ago

Dr. Levin’s work is so fascinating. Glad to see his work referenced. If anyone wishes to learn more while idle or commuting, check out Lex Friedman’s podcast episode with him linked above

altairprime•2mo ago

For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:

> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance

> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model

So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]

[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.

[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].

[***] As a second footnote.

westoncb•2mo ago

> So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?

I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?

altairprime•2mo ago

Prior to this paper, no one knew that X existed. If this paper proves sound, then now we know that X exists at all.

No matter how large X is, one copy of X baked into the OS / into the silicon / into the GPU / into CUDA, is less than 50+177+8 copies of X baked into every single model. Would that permit future models to be shipped with #include <X.model> as line 1? How much space would that save us? Could X.model be baked into chip silicon so that we can just take it for granted as we would the mathlib constant "PI"? Can we hardware-accelerate the X.model component of these models more than we can a generic model, if X proves to be a 'mathematical' constant?

Given a common X, theoretically, training for models could now start from X rather than from 0. The cost of developing X could be brutal; we've never known to measure it before. Thousands of dollars of GPU per complete training at minimum? Between Google, Meta, Apple, and ChatGPT, the world has probably spent a billion dollars recalculating X a million times. In theory, they probably would have spent another billion dollars over the next year calculating X from scratch. Perhaps now they won't have to?

We don't have a lot of "in practice" experience here yet, because this was first published 4 days ago, and so that's why I'm suggesting possible, plausible, ways this could help us in the future. Perhaps the authors are mistaken, or perhaps I'm mistaken, or perhaps we'll find that the human brain has X in it too. As someone who truly loathes today's "AI", and in an alternate timeline would have completed a dual-major CompSci/NeuralNet degree in ~2004, I'm extremely excited to have read this paper, and to consider what future discoveries and optimizations could result from it.

EDIT:

Imagine if you had to calculate 3.14159 from basic principles every single time you wanted to use pi in your program. Draw a circle to the buffer, measure it, divide it, increase the memory usage of your buffer and resolution of your circle if necessary to get a higher precision pi. Eventually you want pi to a billion digits, so every time your program starts, you calculate pi from scratch to a billion digits. Then, someday, someone realizes that we've all been independently calculating the exact same mathematical constant! Someone publishes Pi: An Encyclopedia (Volume 1 of ∞). It becomes inconceivably easier to render cones and spheres in computer graphics, suddenly! And then someone invents radians, because now that we can map 0..360° onto 0..τ, and no one predicted radians at all but it's incredibly obvious in hindsight.

We take for granted knowledge of things like Pi, but there was a time when we did not know it existed at all. And then for a long time it was 3. And then someone realized the underlying commonality of every circle and defined it plainly, and now we have Pi Day, and Tau Day, because not only do we know it exists, but we can argue about it. How cool is that! So if someone has discovered a new 'constant', then that's always a day of celebration in my book, because it means that we're about to see not only things we consider "possible, but difficult" to instead be "so easy that we celebrate their existence with a holiday", but also things that we could never have remotely dreamed of before we knew that X existed at all.

(In less tangible analogies, see also: postfix notation which was repeatedly invented for decades (by e.g. Dijkstra) as a programming advance, or the movie "Arrival" (2019) as a linguistic advance, or the BLIT Parrot (don't look!) as a biological advance. :)

AIchemist•2mo ago

If even remotely fact what you suggest here, I see two antipodal trajectories the authors secretly huddled and voted on:

1. As John Napier, who freely, generously, gifted his `Mirifici' for the benefit of all.

2. Here we go, patent trolls, have at it. OpenAI, et al burning midnight oil to grab as much real estate on this to erase any (even future?) debt stress, deprecating the AGI Philospher's Stone to first owning everything conceivable from a new miraculous `my precious' ring, not `open', closed.

farhanhubble•2mo ago

It might we worth it to use that subset to initialize the weights of future models but more importantly you could save a huge number of computational cycles by using the lower dimensional weights at the time of inference.

westoncb•2mo ago

Ah interesting, I missed that possibility. Digging a little more though my understanding is that what's universal is a shared basis in weight space, and particular models of same architecture can express their specific weights via coefficients in a lower-dimensional subspace using that universal basis (so we get weight compression, simplified param search). But it also sounds like to what extent there will be gains during inference is in the air?

Key point being: the parameters might be picked off a lower dimensional manifold (in weight space), but this doesn't imply that lower-rank activation space operators will be found. So translation to inference-time isn't clear.

farhanhubble•2mo ago

My understanding differs and I might be wrong. Here's what I inferred:

Let's say you finetune a Mistral-7B. Now, there are hundreds of other fine-tuned Mistral-7B's, which means it's easy to find the universal subspace U of the weights of all these models combined. You can then decompose the weights of your specific model using U and a coefficient matrix C specific to your model. Then you can convert any operation of the type `out=Wh` to `out=U(C*x)` Both U and C are much smaller dimension that W and so the number of matrix operations as well as the memory required is drastically lower.

scotty79•2mo ago

"16 dimensions is all you need" ... to do human achievable stuff at least

scotty79•2mo ago

16 seems like a suspiciously round number ... why not 17 or 13? ... is this just result of some bug in the code they used to do their science?

or is it just that 16 was arbitrarily chosen by them as close enough to the actual minimal number of dimensions necessary?

altairprime•2mo ago

There’s lots of hockey stick charts in the paper that might answer this visually, if that’s of interest.

woopsn•2mo ago

It's a little arbitrary. Look at the graph on page 6, there's no steep gap in the spectrum there. 16 just about the balance point

moi2388•2mo ago

But there is a steep gap in the spectrum at 16 on page 7

yorwba•1mo ago

That's the spectrum of LoRAs, which are LoW RAnk by design.

moi2388•1mo ago

Yes. But from their paper: “In our analysis, we present compelling empirical evidence for the existence of universal subspaces within LoRA adapters across different modalities and tasks.”

I also don’t understand what they write under figure 2, since resnet50 has 50 layers, not 31.

N_Lens•2mo ago

If models naturally occupy shared spectral subspaces, this could dramatically reduce

- Training costs: We might discover these universal subspaces without training thousands of models

- Storage requirements: Models could share common subspace representations

tsurba•2mo ago

Edit: actually this paper is the canonical reference (?): https://arxiv.org/abs/2007.00810 models converge to same space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.

Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space, up to some, likely linear, transformations. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.

All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI

They also have a previous paper (”CEBRA”) published in Nature with similar results.

masteranza•2mo ago

It's basically way better than LoRA under all respects and could even be used to speed up inference. I wonder whether the big models are not using it already... If not we'll see a blow up in capabilities very, very soon. What they've shown is that you can find the subset of parameters responsible for transfer of capability to new tasks. Does it apply to completely novel tasks? No, that would be magic. Tasks that need new features or representations break the method, but if it fits in the same domain then the answer is "YES".

Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.

  Think of 3D character animation rigs:
  
   • The mesh has millions of vertices (11M weights).
  
   • Expressions are controlled via:
  
   • “smile”
  
   • “frown”
  
   • “blink”
  
  Each expression is just:
  
  mesh += α_i \* basis_expression_i
  
  Hundreds of coefficients modify millions of coordinates.

mlpro•2mo ago

It does seem to be working for novel tasks.

topspin•2mo ago

> Does it apply to completely novel tasks? No, that would be magic.

Are there novel tasks? Inside the limits of physics, tasks are finite, and most of them are pointless. One can certainly entertain tasks that transcend physics, but that isn't necessary if one merely wants an immortal and indomitable electronic god.

janalsncm•2mo ago

Within the context of this paper, novel just means anything that’s not a vision transformer.

odyssey7•2mo ago

Now that we know about this, that the calculations in the trained models follow some particular forms, is there an approximation algorithm to run the models without GPUs?

nothrowaways•2mo ago

What if all models are secretly just fine tunes of llama?

nextworddev•2mo ago

The central claim, or "Universal Weight Subspace Hypothesis," is that deep neural networks, even when trained on completely different tasks (like image recognition vs. text generation) and starting from different random conditions, tend to converge to a remarkably similar, low-dimensional "subspace" in their massive set of weights.

nothrowaways•2mo ago

> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA- 8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization.

Isn't it obvious?

stingraycharles•2mo ago

Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance.

It isn’t obvious that these parameters are universal across all models.

mlpro•2mo ago

Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal.

crooked-v•2mo ago

Now I wonder how much this "Universal Subspace" corresponds to the same set of scraped Reddit posts and pirated books that apparently all the bigcorps used for model training. Is it 'universal' because it's universal, or because the same book-pirating torrents got reused all over?

energy123•2mo ago

Every vision task needs edge/contrast/color detectors and these should be mostly the same across ViTs, needing only a rotation and scaling in the subspace. Likewise with language tasks and encoding the basic rules of language which are the same regardless of application. So it is no surprise to see intra-modality shared variation.

The surprising thing is inter-modality shared variation. I wouldn't have bet against it but I also wouldn't have guessed it.

I would like to see model interpretability work into whether these subspace vectors can be interpreted as low level or high level abstractions. Are they picking up low level "edge detectors" that are somehow invariant to modality (if so, why?) or are they picking up higher level concepts like distance vs. closeness?

TheOtherHobbes•2mo ago

It hints there may be common higher-level abstraction and compression processes in human consciousness.

The "human" part of that matters. This is all human-made data, collected from human technology, which was created to assist human thinking and experience.

So I wonder if this isn't so much about universals or Platonic ideals. More that we're starting to see the outlines of the shapes that define - perhaps constrict - our own minds.

levocardia•2mo ago

This general idea shows up all over the place though. If you do 3D scans on thousands of mammal skulls, you'll find that a few PCs account for the vast majority of the variance. If you do frequency domain analysis of various physiological signals...same thing. Ditto for many, many other natural phenomena in the world. Interesting (maybe not surprising?) to see it in artificial phenomena as well

vintermann•2mo ago

It's almost an artifact of PCA. You'll find "important" principal components everywhere you look. It takes real effort to construct a dataset where you don't. That doesn't mean though, for instance, that throwing away the less important principal components of an image is the best way to compress an image.

horsepatties•2mo ago

I hope that this leads to more efficient models. And it’s intuitive- it seems as though you could find the essence of a good model and a model reduced to that essence would be more efficient. But, this is theoretical. I can also theorize flying cars- many have, it seems doable and achievable, but yet I see no flying cars on my way to work.

hn_throwaway_99•2mo ago

I read the abstract (not the whole paper) and the great summarizing comments here.

Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.

gedy•2mo ago

It could, though maybe "just" in a similar way that human brains are the same basic structure.

themaxice•2mo ago

Somewhat of a tangent, but if you enjoy the philosophy of AI and mathematics, I highly recommend reading Gödel, Escher, Bach: an Eternal Golden Braid by D. Hofstadter. It is primarily about the Incompleteness Theorem, but does touch on AI and what we understand as being an intelligence

inciampati•2mo ago

The authors study a bunch of wild low rank fine tunes and discover that they share a common... low rank! ... substructure which is itself base model dependent. Humans are (genetically) the same. You need only a handful of PCs to represent the cast majority of variation. But that's because of our shared ancestry. And maybe the same thing is going on here.

modeless•2mo ago

This seems confusingly phrased. When they say things like "500 Vision Transformers", what they mean is 500 finetunes of the same base model, downloaded from the huggingface accounts of anonymous randos. These spaces are only "universal" to a single pretrained base model AFAICT. Is it really that surprising that finetunes would be extremely similar to each other? Especially LoRAs?

I visited one of the models they reference and huggingface says it has malware in it: https://huggingface.co/lucascruz/CheXpert-ViT-U-MultiClass

mlpro•2mo ago

Why would they be similar if they are trained on very different data? Also, trained from scratch models are also analyzed, imo.

modeless•2mo ago

They are trained on exactly the same data in the same order with the same optimizer because they are literally the same base model. With a little fine tuning added on top.

I see now that they did one experiment with trained from scratch models. They trained five Resnet-50s on five disjoint datasets of natural images, most quite small. And IIUC they were able to, without further training, combine them into one "universal" model that can be adapted to have only somewhat worse performance on any one of the five datasets (actually one of them is pretty bad) using only ~35 adaptation parameters. Which is kind of cool I guess but I also don't find it that surprising?

I don't expect that you'd get the same finding at large scale in LLMs trained from scratch on disjoint and dissimilar data with different optimizers etc. I would find that surprising. But it would be very expensive to do that experiment so I understand why they weren't able to.

mlpro•2mo ago

They are not trained on the same data. Even a skim of the paper shows very disjoint data.

The LLMs are finetuned on very disjoint data. I checked some are on Chinese and other are for Math. The pretrained model provides a good initialization. I'm convinced.

godelski•2mo ago

I think there's two maybe subtle, but key concepts you're missing.

  1) "pertaining"
  2) architecture

1) Yes, they're trained on different data but "tune" implies most of the data is identical. So it should be surprising if the models end up significantly different.

2) the architecture and training methods matter. As a simple scenario to make things a bit easier to understand let's say we have two models with identical architectures and we'll use identical training methods (e.g. optimizer, learning rate, all that jazz) but learn on different data. Also to help so you can even reproduce this on your own let's train one on MNIST (numbers) and the other in FashionMNIST (clothing).

Do you expect these models to have similar latent spaces? You should! This is because despite the data being very different visually there are tons of implicit information that's shared (this is a big reason we do tuning in the first place!). One of the most obvious things you'll see is subnetworks that do edge detection (there's a famous paper showing this with convolutions but transformers do this too, just in a bit different way). The more similar the data (orders shouldn't matter too much with modern training methods but it definitely influences things) the more similar this will be too. So if we trained on LAION we should expect it to do really well on ImageNet because even if there aren't identical images (there are some) there are the same classes (even if labels are different)[0].

If you think a bit here you'll actually realize that some of this will happen even if you change architectures because some principles are the same. Where the architecture similarity and training similarity really help is that they bias features being learned at the same rate and in the same place. But this idea is also why you can distill between different architectures, not just by passing the final output but even using intermediate information.

To help, remember that these models converge. Accuracy jumps a lot in the beginning then slows. For example you might get 70% accuracy in a few epochs but need a few hundred to get to 90% (example numbers). So ask yourself "what's being learned first and why?" A lot will make more sense if you do this.

[0] I have a whole rant on the indirect of saying "zero shot" on ImageNet (or COCO) when trained in things like LAION or JFT. It's not zero shot because ImageNet is in distribution! We wouldn't say "we zero shotted the test set" smh

augment_me•2mo ago

The trained from scratch models are similar because CNN's are local and impose a strong inductive bias. If you train a CNN for any task of recognizing things, you will find edge detection filters in the first layers for example. This can't happen for attention the same way because its a global association, so the paper failed to find this using SVD and just fine-tuned existing models instead.

daemonologist•2mo ago

I agree - the results on the finetunes are not very surprising. The trained-from-scratch ResNets (Figure 2 and Section 3.2.1) are definitely more interesting, though somewhat limited in scope.

In any case, my impression is that this is not immediately more useful than a LoRA (and is probably not intended to be), but is maybe an avenue for further research.

augment_me•2mo ago

I don't think its that surprising actually. And I think the paper in general completely oversells the idea.

The ResNet results hold from scratch because strict local constraints (e.g., 3x3 convolutions) force the emergence of fundamental signal-processing features (Gabor/Laplacian filters) regardless of the dataset. The architecture itself enforces the subspace.

The Transformer/ViT results rely on fine-tunes because of permutation symmetry. If you trained two ViTs from scratch, "Attention Head 4" in Model A might be functionally identical to "Head 7" in Model B, but mathematically orthogonal.

Because the authors' method (SVD) lacks a neuron-alignment step, scratch-trained ViTs would not look aligned. They had to use pre-trained models to ensure the weights shared a coordinate system. Effectively, I think that they proved that CNNs converge due to it's arch, but for Transformers, they mostly just confirmed that fine-tuning doesn't drift far from the parent model.

rhaps0dy•2mo ago

Thank you for saving me a skim

swivelmaster•2mo ago

You’ve explained this in plain and simple language far more directly than the linked study. Score yet another point for the theory that academic papers are deliberately written to be obtuse to laypeople rather than striving for accessibility.

bmacho•2mo ago

Vote for the Party that promises academic grants for people that write 1k character long forum posts for the laypeople instead of other experts of the field.

swivelmaster•2mo ago

I’m not sure that’s something we get to vote on.

eru•2mo ago

On the margin, you can let anything influence your voting decision.

swivelmaster•2mo ago

File under "technically true but not particularly useful"

eru•1mo ago

Well, it's not like voting is particularly useful in the first place.

rocqua•2mo ago

I don't think the parent post is complaining that academics are writing proposals (e.g as opposed to people with common sense). Instead, it seems to me that he is complaining that academics are writing proposals and papers to impress funding committees and journal editors, and to some extend to increase their own clout among their peers. Instead of writing to communicate clearly and honestly to their peers, or occasionally to laymen.

And this critique is likely not aimed at academics so much as the systems and incentives of academia. This is partially on the parties managing grants (caring much more about impact and visibility than actually moving science forwards, which means everyone is scrounging for or lying about low hanging fruit). It is partially on those who set (or rather maintain) the culture at academic institutions of gathering clout by getting 'impactful' publications. And those who manage journals also share blame, by trying to defend their moat, very much hamming up "high impact", and aggressively rent-seeking.

shoubidouwah•2mo ago

and hope for a president that can do both

mapt•2mo ago

We have this already. It's called an abstract. Some do it better than others.

Perhaps we need to revisit the concept and have a narrow abstract and a lay abstract, given how niche science has become.

mlpro•2mo ago

I think its very surprising, although I would like the paper to show more experiments (they already have a lot, i know).

The ViT models are never really trained from scratch - they are always finetuned as they require large amounts of data to converge nicely. The pretraining just provides a nice initialization. Why would one expect two ViT's finetuned on two different things - image and text classification end up in the same subspace as they show? I think this is groundbreaking.

I don't really agree with the drift far from the parent model idea. I think they drift pretty far in terms of their norms. Even the small LoRA adapters drift pretty far from the base model.

Havoc•2mo ago

Looks like both mistral and llamas per text but yeah incredibly underwhelming for „universal“

markisus•2mo ago

Each fine tune drags the model weights away from the base model in a certain direction.

Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.

The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.

Another way to say it is that you can compress fine tune weights into a vector of 40 floats.

Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?

I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.

I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.

mapontosevenths•2mo ago

Agreed. What's surprising here to me isn't that the fine tunes are compressible, it's the degree to which they're compressible. It seems like very little useful new information is being added by the fine-tune.

They're using SVD to throw away almost all of the "new information" and apparently getting solid results anyhow. Which of course raises interesting questions if replicable. The code doesn't seem to have been released yet though.

farhanhubble•1mo ago

Yeah but it also made me think if deep down neural networks are curated random basis vectors, like in random projections.

tech_ken•2mo ago

This is an important clarification; from the abstract and title I was super confused how they identified a "subspace" that could be consistently identified across model structures (I was assuming they meant that they saw stability in the dimension of the weight subspace or something), but if they're just referring to one model class that clears things up substantially. It's definitely also a much weaker result IMO, basically just confirming that the model's loss function has a well-posed minima, which...duh? I mean I guess I'm glad someone checked that, but called it "the universal weight subspace hypothesis" seems a bit dramatic.

alyxya•2mo ago

I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful.

What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.

seeknotfind•2mo ago

Yeah, it sounds platonic the way it's written, but it seems more like a hyped model compression technique.

RandyOrion•2mo ago

> From their project page:

> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.

I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.

The 'universal' in the title is not that universal.

lucid-dev•2mo ago

Pretty funny if you ask me. Maybe we can start to realize now: "The common universal subspace between human individuals makes it easier for all of them to do 'novel' tasks so long as their ego and personality doesn't inhibit that basic capacity."

And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"

Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!

Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?

That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"

tsurba•2mo ago

Many discriminative models converge to same representation space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.

https://arxiv.org/abs/2007.00810

Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.

mlpro•2mo ago

It's about weights/parameters, not representations.

tsurba•2mo ago

True, good point, maybe not a straightforward consequence to extend to weights.

Atlas667•2mo ago

Imagine collectively trying to recreate a human brain with semiconductors so capitalists can save money by not having to employ as many people

tim333•2mo ago

There are other reasons beyond the employment thing. Understanding how the mind works maybe.

Atlas667•2mo ago

Understand the mind to then exploit it.

Why else would they put so much money into something if not to try and get more out of it?

Capitalists' morals are driven by their social position. To them this is right becauae its rewarding. To us its an akin abomination we create that destroys us

But the problem isnt inherently tech. Its how society is structured around it that allows it to be used against us.

tim333•1mo ago

A lot of the basic research including this paper is out of academia rather than business.

Atlas667•1mo ago

Do you know how a lot of research is funded?

Simplita•2mo ago

Curious if this connects with the sparse subnetwork work from last year. There might be an overlap in the underlying assumptions.

augment_me•2mo ago

I think the paper in general completely oversells the idea of "universality".

For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.

For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.

sigbottle•2mo ago

For me at least, I wasn't even under the impression that this was a possible research angle to begin with. Crazy stuff that people are trying, and very cool too!

tempestn•2mo ago

After reading the title I'm disappointed this isn't some new mind-bending theory about the relativistic nature of the universe.

zkmon•2mo ago

Something tells me this is probably as important as the "attention is all you need".

pmkary•2mo ago

Plato's forms finally being proven...

statusfailed•2mo ago

I saw a similar (I think!) paper "Grassmannian Optimization Drives Generalization in Overparameterized DNN" at OPT-ML at neurips last week[0]

This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"

I think this question is super interesting though: why can massively overparametrised models can still generalise?

[0]: https://opt-ml.org/papers/2025/paper90.pdf

zkmon•2mo ago

So, while the standard models are like herbivores grazing on the internet data, they built a model that is a carnivore or a predator species trained on other models? Sounds like an evolution of the species.

IAmBroom•2mo ago

If I can understand your metaphor, it's probably not sophisticated enough to be relevant.

- I know what I do not know.

-- I do not know AI.

hagsdp00•1mo ago

I have been trying to reproduce ("vibecoded" with some care) their results for the 500 loras part which I am familiar with and unfortunately can not see that drop at rank 16 that they show in their Figure and use for further claims. Looking forward to their code :)

mlpro•1mo ago

Read the paper end to end today. I think its the most outrageous ideas of 2025 - at least amongst the papers I've read. So counterintuitive initially and yet so intuitive. Personally, kinda hate the implications. But, a paper like this was definitely needed.

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Tiny C Compiler

Speed up responses with fast mode

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

Al Lowe on model trains, funny deaths and working with Disney

I write games in C (yes, C)

Italy Railways Sabotaged

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Reinforcement Learning from Human Feedback

Selection Rather Than Prediction

72M Points of Interest

A Fresh Look at IBM 3270 Information Display System

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Tiny C Compiler

Speed up responses with fast mode

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

Al Lowe on model trains, funny deaths and working with Disney

I write games in C (yes, C)

Italy Railways Sabotaged

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Reinforcement Learning from Human Feedback

Selection Rather Than Prediction

72M Points of Interest

A Fresh Look at IBM 3270 Information Display System

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The universal weight subspace hypothesis

Comments