frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The universal weight subspace hypothesis

https://arxiv.org/abs/2512.05117
110•lukeplato•3h ago•38 comments

Kroger acknowledges that its bet on robotics went too far

https://www.grocerydive.com/news/kroger-ocado-close-automated-fulfillment-centers-robotics-grocer...
85•JumpCrisscross•3h ago•78 comments

Icons in Menus Everywhere – Send Help

https://blog.jim-nielsen.com/2025/icons-in-menus/
272•ArmageddonIt•7h ago•109 comments

Jepsen: NATS 2.12.1

https://jepsen.io/analyses/nats-2.12.1
293•aphyr•8h ago•106 comments

Horses: AI progress is steady. Human equivalence is sudden

https://andyljones.com/posts/horses.html
197•pbui•3h ago•119 comments

The Lost Machine Automats and Self-Service Cafeterias of NYC (2023)

https://www.untappedcities.com/automats-cafeterias-nyc/
33•walterbell•2h ago•15 comments

OSHW: Small tablet based on RK3568 and AMOLED screen

https://oshwhub.com/oglggc/rui-xin-wei-rk3568-si-ceng-jia-li-chuang-mian-fei-gong-yi
29•thenthenthen•5d ago•3 comments

Strong earthquake hits northern Japan, tsunami warning issued

https://www3.nhk.or.jp/nhkworld/en/news/20251209_02/
272•lattis•12h ago•135 comments

AMD GPU Debugger

https://thegeeko.me/blog/amd-gpu-debugging/
217•ibobev•11h ago•36 comments

Scientific and Technical Amateur Radio

https://destevez.net/
25•gballan•2h ago•2 comments

Let's put Tailscale on a jailbroken Kindle

https://tailscale.com/blog/tailscale-jailbroken-kindle
243•Quizzical4230•11h ago•57 comments

Latency Profiling in Python: From Code Bottlenecks to Observability

https://quant.engineering/latency-profiling-in-python.html
16•rundef•6d ago•2 comments

Hunting for North Korean Fiber Optic Cables

https://nkinternet.com/2025/12/08/hunting-for-north-korean-fiber-optic-cables/
226•Bezod•10h ago•58 comments

Microsoft increases Office 365 and Microsoft 365 license prices

https://office365itpros.com/2025/12/08/microsoft-365-pricing-increase/
277•taubek•13h ago•326 comments

IBM to acquire Confluent

https://www.confluent.io/blog/ibm-to-acquire-confluent/
352•abd12•13h ago•283 comments

Has the cost of building software dropped 90%?

https://martinalderson.com/posts/has-the-cost-of-software-just-dropped-90-percent/
185•martinald•8h ago•334 comments

Trials avoid high risk patients and underestimate drug harms

https://www.nber.org/papers/w34534
78•bikenaga•8h ago•31 comments

Show HN: Fanfa – Interactive and animated Mermaid diagrams

https://fanfa.dev/
71•bairess•4d ago•16 comments

Cassette tapes are making a comeback?

https://theconversation.com/cassette-tapes-are-making-a-comeback-yes-really-268108
47•devonnull•4d ago•57 comments

Microsoft Download Center Archive

https://legacyupdate.net/download-center/
130•luu•3d ago•15 comments

Paramount launches hostile bid for Warner Bros

https://www.cnbc.com/2025/12/08/paramount-skydance-hostile-bid-wbd-netflix.html
261•gniting•13h ago•249 comments

AI should only run as fast as we can catch up

https://higashi.blog/2025/12/07/ai-verification/
118•yuedongze•9h ago•114 comments

A series of tricks and techniques I learned doing tiny GLSL demos

https://blog.pkh.me/p/48-a-series-of-tricks-and-techniques-i-learned-doing-tiny-glsl-demos.html
146•ibobev•10h ago•16 comments

Launch HN: Nia (YC S25) – Give better context to coding agents

https://www.trynia.ai/
94•jellyotsiro•10h ago•68 comments

Deep dive on Nvidia circular funding

https://philippeoger.com/pages/deep-dive-into-nvidias-virtuous-cycle
282•jeanloolz•8h ago•159 comments

We collected 10k hours of neuro-language data in our basement

https://condu.it/thought/10k-hours
96•nee1r•10h ago•58 comments

Legion Health (YC S21) is hiring a founding engineer (SF, in-person)

1•the_danny_g•10h ago

Nova Programming Language

https://nova-lang.net
90•surprisetalk•12h ago•47 comments

No more O'Reilly subscriptions for me

https://zerokspot.com/weblog/2025/12/05/no-more-oreilly-subscriptions-for-me/
124•speckx•11h ago•114 comments

Intel 8086 Microcode Explorer

https://nand2mario.github.io/8086_microcode.html
20•todsacerdoti•4d ago•3 comments
Open in hackernews

The universal weight subspace hypothesis

https://arxiv.org/abs/2512.05117
107•lukeplato•3h ago

Comments

CGMthrowaway•2h ago
They compressed the compression? Or identified an embedding that can "bootstrap" training with a headstart ?

Not a technical person just trying to put it in other words.

vlovich123•2h ago
They identified that the compressed representation has structure to it that could potentially be discovered more quickly. It’s unclear if it would also make it easier to compress further but that’s possible.
canjobear•2h ago
What's the relationship with the Platonic Representation Hypothesis?
MarkusQ•2h ago
From what I can tell, they are very closely related (i.e. the shared representational structures would likely make good candidates for Platonic representations, or rather, representations of Platonic categories). In any case, it seems like there should be some sort of interesting mapping between the two.
unionjack22•2h ago
I hope someone much smarter than I answers this. I’ve been noticing an uptick platonic and neo-platonic discourse in the zeitgeist and am wondering if we’re converging on something profound.
altairprime•18m ago
Same hat, except 18 months later, assuming it survives peer review, reproduction, etc. (or: "The newer one proposes evidence that appears to support the older one.")

https://arxiv.org/abs/2405.07987

kacesensitive•2h ago
interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time
odyssey7•1h ago
Or an architecture chosen for that subspace or some of its properties as inductive biases.
bigbuppo•42m ago
Wouldn't this also mean that there's an inherent limit to that sort of model?
rhaen•23m ago
Not strictly speaking? A universal subspace can be identified without necessarily being finite.

As a really stupid example: the sets of integers less than 2, 8, 5, and 30 can all be embedded in the set of integers less than 50, but that doesn’t require that the set of integer is finite. You can always get a bigger one that embeds the smaller.

VikingCoder•2h ago
I find myself wanting genetic algorithms to be applied to try to develop and improve these structures...

But I always want Genetic Algorithms to show up in any discussion about neural networks...

EvanAnderson•2h ago
I have a real soft spot for the genetic algorithm as a result of reading Levy's "Artificial Life" when I was a kid. The analogy to biological life is more approachable to my poor math education than neural networks. I can grok crossover and mutation pretty easily. Backpropagation is too much for my little brain to handle.
DennisP•1h ago
I do too, and for the same reasons. Levy's book had a huge impact on me in general.
nrhrjrjrjtntbt•30m ago
Backprop is learnable through karpathy videos but it takes a lot of patience. The key thing is the chain rule. Get that and the rest is mostly understanding what the bulk operations on tensors are doing (they are usually doing something simple enough but so easy to make mistakes)
embedding-shape•9m ago
> Backpropagation is too much for my little brain to handle.

I just stumbled upon a very nice description of the core of it, right here: https://www.youtube.com/watch?v=AyzOUbkUf3M&t=133s

Almost all talks by Geoffrey Hinton (left side on https://www.cs.toronto.edu/~hinton/) are in very approachable if you're passingly familiar with some ML.

CalChris•1h ago
I'm the same but with vector quantization.
api•2h ago
I immediately started thinking that if there are such patterns maybe they capture something about the deeper structure of the universe.
EvanAnderson•1h ago
On a hike this weekend my daughter and I talked about the similarities of the branching and bifurcating patterns in the melting ice on a pond, the branches of trees, still photos of lightning, the circulatory system, and the filaments in fractals.
mwkaufma•2h ago
(Finds a compression artifact) "Is this the meaning of consciousness???"
ibgeek•1h ago
They are analyzing models trained on classification tasks. At the end of the day, classification is about (a) engineering features that separate the classes and (b) finding a way to represent the boundary. It's not surprising to me that they would find these models can be described using a small number of dimensions and that they would observe similar structure across classification problems. The number of dimensions needed is basically a function of the number of classes. Embeddings in 1 dimension can linearly separate 2 classes, 2 dimensions can linearly separate 4 classes, 3 dimensions can linearly separate 8 classes, etc.
mlpro•22m ago
The analysis is on image classification, LLMs, Diffusion models, etc.
farhanhubble•1h ago
Would you see a lower rank subspace if the learned weights were just random vectors?
AIorNot•1h ago
Interesting - I wonder if this ties into the Platonic Space Hypothesis recently being championed by computational biologist Mike Levin

E.g

https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY

https://thoughtforms.life/symposium-on-the-platonic-space/

e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2

"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.

In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"

Also from the OP's Paper we see this on statement:

"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.

First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).

Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).

Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).

If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."

unionjack22•1h ago
Dr. Levin’s work is so fascinating. Glad to see his work referenced. If anyone wishes to learn more while idle or commuting, check out Lex Friedman’s podcast episode with him linked above
altairprime•1h ago
For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:

> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance

> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model

So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]

[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.

[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].

[***] As a second footnote.

westoncb•41m ago
> So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?

I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?

altairprime•26m ago
Prior to this paper, no one knew that X existed. If this paper proves sound, then now we know that X exists at all.

No matter how large X is, one copy of X baked into the OS / into the silicon / into the GPU / into CUDA, is less than 50+177+8 copies of X baked into every single model. Would that permit future models to be shipped with #include <X.model> as line 1? How much space would that save us? Could X.model be baked into chip silicon so that we can just take it for granted as we would the mathlib constant "PI"? Can we hardware-accelerate the X.model component of these models more than we can a generic model, if X proves to be a 'mathematical' constant?

Given a common X, theoretically, training for models could now start from X rather than from 0. The cost of developing X could be brutal; we've never known to measure it before. Thousands of dollars of GPU per complete training at minimum? Between Google, Meta, Apple, and ChatGPT, the world has probably spent a billion dollars recalculating X a million times. In theory, they probably would have spent another billion dollars over the next year calculating X from scratch. Perhaps now they won't have to?

We don't have a lot of "in practice" experience here yet, because this was first published 4 days ago, and so that's why I'm suggesting possible, plausible, ways this could help us in the future. Perhaps the authors are mistaken, or perhaps I'm mistaken, or perhaps we'll find that the human brain has X in it too. As someone who truly loathes today's "AI", and in an alternate timeline would have completed a dual-major CompSci/NeuralNet degree in ~2004, I'm extremely excited to have read this paper, and to consider what future discoveries and optimizations could result from it.

EDIT:

Imagine if you had to calculate 3.14159 from basic principles every single time you wanted to use pi in your program. Draw a circle to the buffer, measure it, divide it, increase the memory usage of your buffer and resolution of your circle if necessary to get a higher precision pi. Eventually you want pi to a billion digits, so every time your program starts, you calculate pi from scratch to a billion digits. Then, someday, someone realizes that we've all been independently calculating the exact same mathematical constant! Someone publishes Pi: An Encyclopedia (Volume 1 of ∞). It becomes inconceivably easier to render cones and spheres in computer graphics, suddenly! And then someone invents radians, because now that we can map 0..360° onto 0..τ, and no one predicted radians at all but it's incredibly obvious in hindsight.

We take for granted knowledge of things like Pi, but there was a time when we did not know about Pi at all. And then someone realized the underlying commonality of every circle and defined it simply for us, and now we have Pi Day, and Tau Day. How cool is that! So if someone has discovered a new 'constant', then that's always a day of celebration in my book, because it means that we're about to see not only things we consider "possible, but difficult" to instead be "so easy that we celebrate their existence with a holiday", but also things that we could never have remotely dreamed of before we knew that X existed at all.

farhanhubble•11m ago
It might we worth it to use that subset to initialize the weights of future models but more importantly you could save a huge number of computational cycles by using the lower dimensional weights at the time of inference.
pagekicker•1h ago
I asked Grok to visualize this:

https://grok.com/share/bGVnYWN5_463d51c8-d473-47d6-bb1f-6666...

*Caption for the two images:*

Artistic visualization of the universal low-parameter subspaces discovered in large neural networks (as described in “The Unreasonable Effectiveness of Low-Rank Subspaces,” arXiv:2512.05117).

The bright, sparse linear scaffold in the foreground represents the tiny handful of dominant principal directions (often ≤16 per layer) that capture almost all of the signal variance across hundreds of independently trained models. These directions form a flat, low-rank “skeleton” that is remarkably consistent across architectures, tasks, and random initializations.

The faint, diffuse cloud of connections fading into the dark background symbolizes the astronomically high-dimensional ambient parameter space (billions to trillions of dimensions), almost all of whose directions carry near-zero variance and can be discarded with negligible loss in performance. The sharp spectral decay creates a dramatic “elbow,” leaving trained networks effectively confined to this thin, shared, low-dimensional linear spine floating in an otherwise vast and mostly empty void.

100721•1h ago
Acting as a pass-through for LLMs is logically equivalent to wiring up a bot account.
masteranza•1h ago
It's basically way better than LoRA under all respects and could even be used to speed up inference. I wonder whether the big models are not using it already... If not we'll see a blow up in capabilities very, very soon. What they've shown is that you can find the subset of parameters responsible for transfer of capability to new tasks. Does it apply to completely novel tasks? No, that would be magic. Tasks that need new features or representations break the method, but if it fits in the same domain then the answer is "YES".

Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.

  Think of 3D character animation rigs:
  
   • The mesh has millions of vertices (11M weights).
  
   • Expressions are controlled via:
  
   • “smile”
  
   • “frown”
  
   • “blink”
  
  Each expression is just:
  
  mesh += α_i \* basis_expression_i
  
  Hundreds of coefficients modify millions of coordinates.
mlpro•23m ago
It does seem to be working for novel tasks.
odyssey7•1h ago
Now that we know about this, that the calculations in the trained models follow some particular forms, is there an approximation algorithm to run the models without GPUs?
nothrowaways•51m ago
What if all models are secretly just fine tunes of llama?
nextworddev•49m ago
The central claim, or "Universal Weight Subspace Hypothesis," is that deep neural networks, even when trained on completely different tasks (like image recognition vs. text generation) and starting from different random conditions, tend to converge to a remarkably similar, low-dimensional "subspace" in their massive set of weights.
nothrowaways•41m ago
> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA- 8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization.

Isn't it obvious?

stingraycharles•31m ago
Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance.

It isn’t obvious that these parameters are universal across all models.

mlpro•24m ago
Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal.