All AI models might be the same

https://blog.jxmo.io/p/there-is-only-one-model

39•jxmorris12•3h ago

Comments

tyronehed•1h ago

Especially if they are all me-too copies of a Transformer.

When we arrive at AGI, you can be certain it will not contain a Transformer.

jxmorris12•1h ago

I don't think architecture matters. It seems to be more a function of the data somehow.

I once saw a LessWrong post claiming that the Platonic Representation Hypothesis doesn't hold when you only embed random noise, as opposed to natural images: http://lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-p...

blibble•25m ago

> I don't think architecture matters. It seems to be more a function of the data somehow.

of course it matters

if I supply the ants in my garden with instructions on how to build tanks and stealth bombers they're still not going to be able to conquer my front room

TheSaifurRahman•1h ago

This only works when different sources share similar feature distributions and semantic relationships.

The M or B game breaks down when you play with someone who knows obscure people you've never heard of. Either you can't recognize their references, or your sense of "semantic distance" differs from theirs. The solution is to match knowledge levels: experts play with experts, generalists with generalists.

The same applies to decoding ancient texts, if ancient civilizations focused on completely different concepts than we do today, our modern semantic models won't help us understand their writing.

npinsker•32m ago

I've played this game with friends occasionally and -- when it's a person -- don't think I've ever completed a game.

TheSaifurRahman•58m ago

Has there been research on using this to make models smaller? If models converge on similar representations, we should be able to build more efficient architectures around those core features.

yorwba•43m ago

It's more likely that such an architecture would be bigger rather than smaller. https://arxiv.org/abs/2412.20292 demonstrated that score-matching diffusion models approximate a process that combines patches from different training images. To build a model that makes use of this fact, all you need to do is look up the right patch in the training data. Of course a model the size of its training data would typically be rather unwieldy to use. If you want something smaller, we're back to approximations created by training the old-fashioned way.

giancarlostoro•43m ago

I've been thinking about this a lot. I want to know what's the smallest a model needs to be, before letting it browse search engines, or files you host locally is actually an avenue an LLM can go through to give you more informed answers. Is it 2GB? 8GB? Would love to know.

empath75•38m ago

This is kind of fascinating because I just tried to play mussolini or bread with chatgpt and it is absolutely _awful_ at it, even with reasoning models.

It just assumes that your answers are going to be reasonably bread-like or reasonably mussolini-like, and doesn't think laterally at all.

It just kept asking me about varieties of baked goods.

edit: It did much better after I added some extra explanation -- that it could be anything that it may be very unlike either choice, and not to try and narrow down too quickly

fsmv•26m ago

I think an LLM is a bit too high level for this game or maybe it just would need a lengthy prompt to explain the game.

If you used word2vec directly it's the exact right thing to play this game with. Those embeddings exist in an LLM but it's trained to respond like text found online not play this game.

Xcelerate•24m ago

I sort of agree with the gist of the article (which IMO is basically that universal computation is universal regardless of how you perform it), but there are two big issues that prevent this observation from helping us in a practical sense:

1. Not all models are equally efficient. We already have many methods to perform universal search (e.g., Levin's, Hutter's, and Schmidhuber's versions), but they are painfully slow despite being optimal in a narrow sense that doesn't extrapolate well to real world performance.

2. Solomonoff induction is only optimal for infinite data (i.e., it can be used to create a predictor that asymptotically dominates any other algorithmic predictor). As far as I can tell, the problem remains totally unsolved for finite data, due to the additive constant that results from the question: which universal model of computation should be applied to finite data? You can easily construct a Turing machine that is universal and perfectly reproduces the training data, yet nevertheless dramatically fails to generalize. No one has made a strong case for any specific natural prior over universal Turing machines (and if you try to define some measure to quantify the "size" of a Turing machine you realize this method starts to fail once the number of transition tables becomes large enough to start exhibiting redundancy).

dr_dshiv•7m ago

What about the platonic bits? Any other articles that give more details there?

somethingsome•7m ago

Mmmh I'm deeply skeptical of some parts.

> One explanation for why this game works is that there is only one way in which things are related

There is not, this is a completely non transitive relationship.

On another point, suppose you keep the same vocabulary, but permute the signification of the words, the neural network will still learn relationships, completely different ones, but it's representation may converge toward a better compression for that set of words, but I'm dubious that this new compression scheme will ressemble the previous one (?)

I would say that given an optimal encoding of the relationships, we can achieve an extreme compression, but not all encodings lead to the same compression at the end.

If I add 'bla' between every words in a text, that is easy to compress, but now, if I add an increasing sequence of words between each words, the meaning is still there, but the compression will not be the same, as the network will try to generate the words in-between.

(thinking out loud)

Are We Alone?

National Guard hacked by Chinese 'Salt Typhoon' campaign for nearly a year

Stop Hertz's AI Rental Car Damage Scanners from Screwing You

The "Perfect" YouTube Video

Oldest known dog breed reveals hidden human history

Fascism for First Time Founders

World Bank tried to transform care for the poor. Drove them deeper into poverty

Pictures from Paper Reflections and a Single Pixel

Port of Oakland container volume declines in June

Rerun 0.24 – Light Mode, Streaming Video, Tagged Components

An example of drifting away in dev

The Daily Life of a Medieval King

It's Not Just the Room That We're Escaping

OpenAI investor suspected to fall into ChatGTP-induced psychosis

How to Use Model Context Protocol

Against Single-File Codebases

Bollinger Shipyards converting barge into landing platform for returning rockets

A local website was hijacked and filled with AI-generated 'coherent gibberish'

The Ghost in the Ice Cream Machine (2020)

The design of forms in government departments

Running TypeScript Natively in Node.js

Why AI Dev Tools Need Different Growth

How Did Elon Musk Turn Grok into MechaHitler?

Bringing granular updates to React, the Clojure way

Garry Taubes is wrong: low-carb is not superior for weight loss

Replication of Quantum Factorisation Records; a Home Computer, Abacus, and a Dog

Maybe writing speed is a bottleneck for programming

Grafana and LLMs

Inside a lab making the advanced fuel to power growing US nuclear ambitions

Breaking to Build: Fuzzing the Kotlin Compiler