Even if existing theory is inadequate, would an operating theory not be beneficial?
Or is the mystique combined with guess&check drudgery job security?
For instance, we do not have consensus on what a theory should accomplish - should it provide convergence bounds/capability bounds? Should it predict optimal parameter counts/shapes? Should it allow more efficient calculation of optimal weights? Does it need to do these tasks in linear time?
Even materials science in metals is still cycling through theoretical models after thousands of years of making steel and other alloys.
There is an enormous amount of theory used in the various parts of building models, there just isn't an overarching theory at the very most convenient level of abstraction.
It almost has to be this way. If there was some neat theory, people would use it and build even more complex things on top of it in an experimental way and then so on.
See? Everything lives in the manifold.
Now for a great visualization about the Manifold Hypothesis I cannot recommend more this video: https://www.youtube.com/watch?v=pdNYw6qwuNc
That helps to visualize how the activation functions, bias and weights (linear transformations) serve to stretch the high dimensional space so that data go into extremes and become easy to put in a high dimension, low dimensional object (the manifold) where is trivial to classify or separate.
Gaining an intuition about this process will make some deep learning practices so much easy to understand.
I think these 'intuitions' are an after-the-fact thing, meaning AFTER deep learning comes up with a method, researchers in other fields of science notice the similarities between the deep learning approach and their (possibly decades old) methods. Here's an example where the author discovers that GPT is really the same computational problems he has solved in physics before:
https://ondrejcertik.com/blog/2023/03/fastgpt-faster-than-py...
It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, *unlike other aspects of AI and ML* [emphasis mine]. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.
Note: Matrix-matrix multiplication is basic mathematics, and not remotely interesting as physics. It's not physically interesting.Although, to try to see it from the author’s perspective, it is pulling tools out of the same (extremely well developed and studied in it’s own right) toolbox as computational physics does. It is a little funny although not too surprising that a computational physics guy would look at some linear algebra code and immediately see the similarity.
Edit: actually, thinking a little more, it is basically absurd to believe that somebody has had a career in computational physics without knowing they are relying heavily on the HPC/scientific computing/numerical linear algebra toolbox. So, I think they are just using that to help with the narrative for the blog post.
Many statistical thermodynamics ideas were reinvented in ML.
Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.
One can keep going.
It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.
It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.
For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.
It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.
But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.
Deep learning in its current form relates to a hypothetical underlying theory as alchemy does to chemistry.
In a few hundred years the Inuktitut speaking high schoolers of the civilisation that comes after us will learn that this strange word “deep learning” is a left over from the lingua franca of yore.
Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.
In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.
Often, they do (and then they are called "sheaves").
In fact not all ML models treat data as manifolds. Nearest neighbors, decision trees don’t require the manifold assumption and actually work better without it.
An integer lattice can only be a manifold in a trivial sense (dimension 0). But not for any positive dimensions.
Coming up with an idea for how something works, by applying your expertise, is the fundamental foundation of intelligence, learning, and was behind every single advancement of human understanding.
People thinking is always a good thing. Thinking about the unknown is better. Thinking with others is best, and sharing those thoughts isn't somehow bad, even if they're not complete.
Even with LLMs, there's no real mystery about why they work so well - they produce human-like input continuations (aka "answers") because they are trained to predict continuations of human-generated training data. Maybe we should be a bit surprised that the continuation signal is there in the first place, but given that it evidentially is, it's no mystery that LLMs are able to use it - just testimony to the power of the Transformer as a predictive architecture, and of course to gradient descent as a cold unthinking way of finding an error minimum.
Perhaps you meant how LLMs work, rather than why they work, but I'm not sure there's any real mystery there either - the transformer itself is all about key-based attention, and we now know that training a transformer seems to consistently cause it to leverage attention to learn "induction heads" (using pairs of adjacent attention heads) that are the main data finding/copying primitive they use to operate.
Of course knowing how an LLM works in broad strokes isn't the same as knowing specifically how it is working in any given case, how is it transforming a specific input layer by layer to create the given output, but that seems a bit like saying that because I can't describe - precisely - why you had pancakes for breakfast, that we don't know how the brains works.
Physics is just applied mathematics
Chemistry is just applied physics
Biology is just applied chemistry
It doesn’t work very well.
Of course. Now, to actually deeply understand what is happening with these constructs, we will use topology. Topoligical insights will without doubt then inform the next generations of this technology.
Neural Networks consist almost exclusively of two parts, numerical linear algebra and numerical optimization.
Even if you reject the abstract topological description. Numerical linear algebra and optimization couldn't be any more directly applicable.
The attention mechanism is not a stretching of the manifold, but is trained to be able to measure distances in the manifold surface, which is stretched and deformed (or transformed?) in the feed-forward layers.
I've always been hopeful that some algebraic topology master would dig into this question and it'd provide some better design principles for neural nets. which activation functions? how much to fan in/out? how many layers?
In general it's a nice idea, but the blogpost is very fluffy, especially once it connects it to reasoning, there is serious technical work in this area (i.g. https://arxiv.org/abs/1402.1869) that has expanded this idea and made it more concrete.
"Everything lives on a manifold"
"If you are trying to learn a translation task — say, English to Spanish, or Images to Text — your model will learn a topology where bread is close to pan, or where that picture of a cat is close to the word cat."
This is everything that topology is not about: a notion of points being "close" or "far." If we have some topological space in which two points are "close," we can stretch the space so as to get the same topological space, but with the two points now "far". That's the whole point of the joke that the coffee cup and the donut are the same thing.
Instead, the entire thing seems to be a real-world application of something like algebraic geometry. We want to look for something like an algebraic variety the points are near. It's all about geometry and all about metrics between points. That's what it seems like to me, anyway.
100 percent true.
I can only hope that in an article that is about two things, i) topology and ii) deep learning, the evident confusions are contained within one of them -- topology, only.
You then mean Deep Learning has a lot in common with differential geometry and manifolds in general. That I will definitely agree with. DG and manifolds have far richer and informative structure than topology.
> I'm personally pretty convinced that, in a high enough dimensional space, this is indistinguishable from reasoning
I actually have journaled extensively about this and even written some on Hacker News about it with respect to what I've been calling probabilistic reasoning manifolds:
> This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
> But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.
> Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains
Full comment: https://news.ycombinator.com/item?id=42871894
In which case, I cannot understand " true reasoning is expressed in terms of probabilities, not axioms "
One of the features of reasoning is that it does not operate in this way. It's highly implausible animals would have been endowed with no ability to operate non-probabilistically on propositions represented by them, since this is essential for correct reasoning -- and a relatively trivial capability to provide.
Eg., "if the spider is in boxA, then it is not everywhere else" and so on
We don't do logic itself, we just create logic from certainty as part of verbal reasoning. It's our messy internal inference of likelihoods that causes us to pause and think, or dash forward with confidence, and convincing others is the only place we need things like "theorems".
This is the only way I can square things like intuition, writing to formalize thoughts, verbal argument, etc, with the fact that people are just so mushy all the time.
This naive cynicism about our mental capacities is a product of this credulity about statistical AI. If one beings with an earnest study of animal intelligence, in order to describe it, it disappears. It's exactly and only a project of the child playing with his lego, certain that great engineering projects have little use for any more than stacking bricks.
Logical propositions ("2+2=4 regardless of my certainty about it") seem a long way from necessary or sufficient to survival for animals. A fuzzy heatmap of "where is prey going" or "How many prey over there" is much closer to necessary and sufficient. The fact that measurements or senses can update those estimates is a long way from a logical deduction.
Something more like probability factor graph can do it, without the pernicious use of "concepts" or too much need for implication, which is sticky and overly rigorous.
That's all I have to say, and I doubt we'll find middle ground.
Any validation of a theory is inherently statistical, as you must sample your environment with some level of precision across spacetime, and that level of precision correlates to the known accuracy of hypotheses. In other words, we can create axiomatic systems of logic, but ultimately any attempt to compare them to reality involves empirical sampling.
Unlike classical physics, our current understanding of quantum physics essentially allows for anything to be "possible" at large enough spacetime scales, even if it is never actually "probable". For example, quantum tunneling, where a quantum system might suddenly overcome an energy barrier despite lacking the required energy.
Every day when I walk outside my door and step onto the ground, I am operating on a belief that gravity will work the same way every time, that I won't suddenly pass through the Earth's crust or float into the sky. We often take such things for granted, as axiomatic, but ultimately all of our reasoning is based on statistical correlations. There is the ever-minute possibility that gravity suddenly stops working as expected.
> if the spider is in boxA, then it is not everywhere else
We can't even physically prove that. There's always some level of uncertainty which introduces probability into your reasoning. It's just convenient for us to say, "it's exceedingly unlikely in the entire age of the universe that a macroscopic spider will tunnel from Box A to Box B", and apply non-probabilistic heuristics.
It doesn't remove the probability, we just don't bother to consider it when making decisions because the energy required for accounting for such improbabilities outweighs the energy saved by not accounting for them.
As mentioned in my comment, there's also the possibility that universal axioms may be recoverable as fixed points in a reasoning manifold, or in some other transformation. If you view these probabilities as attractors on some surface, fixed points may represent "axioms" that are true or false under any contextual transformation.
A proposition is not a prediction. A prediction is either an estimate of the value of some quantity ("the dumb ML meaning of prediction") or a proposition which describes a future scenario. We can trivially enumerate propositions that do not describe future scenarios, eg., 2 + 2 = 4.
Uncertainty is a property of belief attitudes towards propositions, it isn't a feature of their semantic content. A person doesnt mean anything different by "2 + 2 = 4" if they are 80 or 90% sure of it.
> We can't even physically prove that.
Irrelevant. Our minds are not constrained by physical possibility, necessarily so, as we know very little about what is physically possible. I can imagine abitary number of cases, arising out of logical manipulation of propositons, that are not physically possible. (Eg., "Superman can lift any building. The empire state building is so-and-so a kind of building. Imagine(Superman lifting the empire state building)").
The infinite variety of our imagination is a trivial consequence of non-probabilistic operations on propositions, it's incomprehensibly implausible as a consequence of merely probabilistic ones.
That nature seems to have endowed minds with discrete operations, that these are empirical in operation across very wide classes of reasoning, including imagination, that these seem trivial for nature to provide (etc.) render the notion that they don't exist highly highly implausible.
There is nothing lacking explanation here. The relevant mental processes we have to hand are fairly obvious and fairly easy to explain.
Its an obvious act of credulity to try and find some way to make the latest trinkets of the recent rich some sort of miracle. All of these projects of "incredible abstraction" follow around these hype cycles, turning lead into gold: if x "is really" y, and y "is really" z, and ..., then x is amazin! This piles towers of every more general hollowed-out words on top of each other until the most trivial thing sounds like a wonder.
Why would animals need to evolve 100% correct reasoning if probabilistically correct reasoning suffices? If probabilistic reasoning is cheaper in terms of energy then correct reasoning is a disadvantage.
Topology is whatever little structure that remains in geometry after you throwaway distances, angles, orientations and all sorts of non tearing stretchings. It's that bare minimum that still remains valid after such violent deformations.
While notion of topology is definitely useful in machine learning, -- scale, distance, angles etc., all usually provide lots of essential information about the data.
If you want to distinguish between a tabby cat and a tiger it would be an act of stupidity to ignore scale.
Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations. That happens, but to claim deep learning is applied topology is absurd, almost stupid.
But...you can't. The input data lives on a manifold that you cannot 'trust'. It doesn't mean anything apriori that an image of a coca-cola can and an image of a stopsign live close to each other in pixel space. The neural network applies all of those violent transformations you are talking about
Only in a desperate sales pitch or a desparate research grants. There are of course some situations were certain measurements are untrustworthy, but to claim that is the common case is very snake oily.
When certain measurements become untrustworthy, that it does so only because of some unknown smooth transformation, is not very frequent (this is what purely topological methods will deal with). Random noise will also do that for you.
Not disputing the fact that sometimes metrics cannot be trusted entirely, but to go to a topological approach seems extreme. One should use as much of the relevant non-topological information as possible.
As the hackneyed example goes a topological methods would not be able to distinguish between a cup and a donut. For that you would need to trust non-topological features such as distances and angles. Deep learning methods can indeed differentiate between cop-nip and coffee mugs.
BTW I am completely on-board with the idea that data often looks as if it has been sampled from an unknown, potentially smooth, possibly non-Euclidean manifold and then corrupted by noise. In such cases recovering that manifold from noisy data is a very worthy cause.
In fact that is what most of your blogpost is about. But that's differential geometry and manifolds, they have structure far richer than a topology. For example they may have tangent planes, a Reimann metric or a symplectic form etc. A topological method would throw all of that away and focus on topology.
Dogs have fur. Dogs are an example of a furry animal. But dogs and furs are not the same thing although they may appear in the same text often.
Topology is a traditional as well as an active branch of applied and pure mathematics, well, Physics too.
It has tons of text books printed on it, has several active journals and conferences dealing with it. https://www.amazon.com/s?k=Topology&sprefix=topology+%2Caps%...
Surprise, surprise ...not ...has an extensive Wikipedia page.
https://en.m.wikipedia.org/wiki/Topology
Math magazines for high schoolers have articles on it. Colleges offer multiple courses on it. Some of those courses would be mandatory for a degree in even undergrad mathematics.
If one wants to do graduate studies then one can do a Masters or a PhD in Topology, well in one of it's many branches.
It's also not a new kid on the block. It goes back to ... analysis situs ... further back to Leibniz, although it began to crystalize formally after Poincare.
If someone wants to use the phrase 'differential calculus' to mean something else in their love letters and sweet nothings, that's absolutely fine :) but in Maths (and Machine Learning, well, with quality of peer reviewing this might soon be iffy) it has a well established and unambiguous meaning.
Note because of its shared beginning at the feet of Leibniz, comparing it with calculus is not an unfair comparison.
"Topology is a branch of mathematics concerned with geometric properties preserved under continuous deformation (stretching without tearing or gluing)"
That is indeed the established meaning of topology in mathematics and the blog post was on applied mathematics. That it may mean something else in other contexts is irrelevant.
I rest my case. LOL.
> The most common uses of "topology", whenever used to convey a geometry-related idea, is in the more general sense meaning "surfaces"
Erm, citation please, because if it was true, wouldn't the Wikipedia pages have talked about that general sense meaning first ?
Alternatively, I would say, take a breath. Is this hill really the one worth dying on ? There are better ones. Have a good day and if work permits, get yourself a juicy topology book, it can be interesting, if presented well.
What if the models capable of CoT aren't and will never be, regardless of topological manipulation, capable of processes that could be considered AGI? For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
As a layman, this matches my intuition that LLMs are not at all in the same family of systems as the ones capable of generating intelligence or consciousness.
> For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
I've done a fair bit of connectomics research and I think that this framing elides the ways in which neural networks and biological networks are actually quite similar. For example, in mice olfactory systems there is something akin to a 'feature vector' that appears based on which neurons light up. Specific sets of neurons lighting up means 'chocolate' or 'lemon' or whatever. More generally, it seems like neuronal representations are somewhat similar to embedding representations, and you could imagine constructing an embedding space based on what neurons light up where. Everything on top of the embeddings is 'just' processing.
In short, direct manifold learning is not really tractable as an algorithmic approach. The most powerful set of tools and theoretical basis for AI has sprung from statistical optimization theory (SGD, information-theoretical loss minimization, etc.). The fact that data is on a manifold is a tautological footnote to this approach.
However, we still have much to learn about the topology of the brain and its functional connectivity. In the coming years, we are likely to discover new architectures — both internal within individual layers/nodes and in the ways specialized networks connect and interact with each other.
Additionally, the brain doesn’t rely on a single network, but rather on several ones — often referred to as the "Big 7" — that operate in parallel and are deeply interconnected. Some of these include the Default Mode Network (DMN), the Central Executive Network (CEN) or the Limbic Network, among others. In fact, a single neuron can be part of multiple networks, each serving different functions.
We have not yet been able to fully replicate this complexity in artificial systems, and there is still much to be learned and inspired by from this "network topologies".
So, "Topology is all you need" :-)
Advances and insights sometimes lie dormant for decades or more before someone else picks them up and does something new.
You can say any advance or insight is just lying dormant, it doesn't mean anything unless you can specifically articulate why it still has potential. I haven't made any claims on the future of the intersection of deep learning and topology, I was pointing out that it's been anything but dormant given the interest in it but it hasn't lead anywhere.
More generally, in my experience as an AI researcher, understandings of the geometry of data leads directly to changes in model architecture. Though people disparage that as "trial and error" it is far more directed than people on the outside give credit for.
So then the question becomes what's the difference between Graph Theory and Applied Topology? Graphs operate on discrete structures and topology is about a continuous space. Otherwise they're very closely related.
But the higher order bit is that AI/ML and Deep Learning in particular could do a better job of learning from and acknowledging prior art from related fields. Reusing older terminology instead of inventing new.
Hard disagree.
I tried really hard to use topology as a way to understand neural networks, for example in these follow ups:
- https://colah.github.io/posts/2014-10-Visualizing-MNIST/
- https://colah.github.io/posts/2015-01-Visualizing-Representa...
There are places I've found the topological perspective useful, but after a decade of grappling with trying to understand what goes on inside neural networks, I just haven't gotten that much traction out of it.
I've had a lot more success with:
* The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks.
* The idea of circuits - networks of such connected concepts.
Some selected related writing:
- https://distill.pub/2020/circuits/zoom-in/
- https://transformer-circuits.pub/2022/mech-interp-essay/inde...
- https://transformer-circuits.pub/2025/attribution-graphs/bio...
- LLMs are basically just slightly better `n-gram` models
- The idea of "just" predicting the next token, as if next-token-prediction implies a model must be dumb
(I wonder if this [1] popular response to Karpathy's RNN [2] post is partly to blame for people equating language neural nets with n-gram models. The stochastic parrot paper [3] also somewhat equates LLMs and n-gram models, e.g. "although she primarily had n-gram models in mind, the conclusions remain apt and relevant". I guess there was a time where they were more equivalent, before the nets got really really good)
[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139
[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
The whole discourse of "stochastic parrots" and "do models understand" and so on is deeply unhealthy because it should be scientific questions about mechanism, and people don't have a vocabulary for discussing the range of mechanisms which might exist inside a neural network. So instead we have lots of arguments where people project meaning onto very fuzzy ideas and the argument doesn't ground out to scientific, empirical claims.
Our recent paper reverse engineers the computation neural networks use to answer in a number of interesting cases (https://transformer-circuits.pub/2025/attribution-graphs/bio... ). We find computation that one might informally describe as "multi-step inference", "planning", and so on. I think it's maybe clarifying for this, because it grounds out to very specific empirical claims about mechanism (which we test by intervention experiments).
Of course, one can disagree with the informal language we use. I'm happy for people to use whatever language they want! I think in an ideal world, we'd move more towards talking about concrete mechanism, and we need to develop ways to talk about these informally.
There was previous discussion of our paper here: https://news.ycombinator.com/item?id=43505748
I.e. those higher concepts are kept in mind as a goal. It is healthy: it keeps the aim alive.
I've expressed this on here before, but it feels like the everyday reception of LLMs has been so damaged by the general public having just gotten a basic grasp on the existence of machine learning.
It's only an analogy, but it does suggest at least that the interesting details of the dynamics aren't embedded in the topology of the system. It's more complicated than that.
Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds. And even simple models will, due to optimization pressure, collapse similar-enough features into the same linear direction. I suppose it's hard to reconcile the manifold hypothesis with the empirical evidence that simple models will place similar-ish features in orthogonal directions, but surely that has more to do with the loss that is being optimized? In Toy Models of Superposition, you're using a MSE which effectively makes the model learn an autoencoder regression / compression task. Makes sense then that the interference patterns between co-occurring features would matter. But in a different setting, say a contrastive loss objective, I suspect you wouldn't see that same interference minimization behavior.
I don't think circuits is specific to transformers? Our work in the Transformer Circuits thread often is, but the original circuits work was done on convolutional vision models (https://distill.pub/2020/circuits/ )
> Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds
(1) There are actually quite a few examples of seemingly linear representations in GANs, VAEs, etc (see discussion in Toy Models for examples).
(2) Linear representations aren't necessarily in tension with the manifold hypothesis.
(3) GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition (which requires linear representations) since central limit theorem allows superposition to produce Gaussian-like distributions.
O neat, I haven't read that far back. Will add it to the reading list.
To flesh this out a bit, part of why I find circuits less compelling is because it seems intuitive to me that neural networks more or less smoothly blend 'process' and 'state'. As an intuition pump, a vector x matrix matmul in an MLP can be viewed as changing the basis of an input vector (ie the weights act as a process) or as a way to select specific pieces of information from a set of embedding rows (ie the weights act as state).
There are architectures that try to separate these out with varying degrees of success -- LSTMs and ResNets seem to have a more clear throughline of 'state' with various 'operations' that are applied to that state in sequence. But that seems really architecture-dependent.
I will openly admit though that I am very willing to be convinced by the circuits paradigm. I have a background in molecular bio and there's something very 'protein pathways' about it.
> Linear representations aren't necessarily in tension with the manifold hypothesis.
True! I suppose I was thinking about a 'strong' form of linear representations, which is something like: features are represented by linear combinations of neurons that display the same repulsion-geometries as observed in Toy Models, but that's not what you're saying / that's me jumping a step too far.
> GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition
Superposition is one of those things that has always been so intuitive to me that I can't imagine it not being a part of neural network learning.
But I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?
Note this happens in "uniform superposition". In reality, we're almost certainly in very non-uniform superposition.
One key term to look for is "feature manifolds" or "multi-diemsnional features". Some discussion here: https://transformer-circuits.pub/2024/july-update/index.html...
(Note that the term "strong linear representation" is becoming a term of art in the literature referring to the idea that all features are linear, rather than just most or some.)
> I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?
When you say "individual neurons being used in combination with other neurons to represent more features than neurons", that's a way one might _informally_ talk about superposition, but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend. All kinds of crazy things are possible if you allow non-linear features, and it's not necessarily clear what a feature would mean.
Superposition, in the narrow technical sense of exploiting compressed sensing / high-dimensional spaces, requires linear representations and sparsity.
I should probably read the updates more. Not enough time in the day. But yea the way you're describing feature manifolds and multidimensional features, especially the importance of linearity-in-properties and not necessarily linearity-in-dimensions, makes a lot of sense.
> but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend.
Fair, I'm only passingly familiar with compressed sensing so I'm not sure I could offer a more technical definition without, like, a much longer conversation! But it's good to know in the future that in a technical sense linear representations and superposition are dependent.
For anyone interested in these may I also suggest learning about normalizing flows? (They are the broader class to flow matching) They are learnable networks that learn coordinate changes. So the connection to geometry/topology is much more obvious. Of course the down side of flows is you're stuck with a constant dimension (well... sorta) but I still think they can help you understand a lot more of what's going on because you are working in a more interpretable environment
Latent spaces may or may not have useful topology, so this idea is inherently wrong, and builds the wrong type of intuition. Different neural nets will result in different feature space understanding of the same data, so I think it's incorrect to believe you're determining intrinsic geometric properties from a given neural net. I don't think people should throw around words carelessly because all that does is increase misunderstanding of concepts.
In general, manifolds can help discern useful characteristics about the feature space, and may have useful topological structures, but trying to impose an idea of "topology" on this is a stretch. Moreover, the kind of basics examples used in this blog post don't help prove the author's point. Maybe I am misunderstanding this author's description of what they mean, but this idea of manifold learning is nothing new.
Yeah deep learning is applied topology, it's also applied geometry, and probably applied algebra and I wouldn't be surprised if it was also applied number theory.
It's also for this reason that I think new knowledge is discoverable from with in LLMs.
I imagine having a topographic map of some island that has only been explored partially by humans. But if I know the surrounding topography, I can make pretty accurate guesses about the areas I haven't been. And I think the same thing can be applied to certain areas of human knowledge, especially when represented as text or symbolically.
For example, let's take a look at graph data structure. A graph has a set of stored objects (vertices) and a set of stored relations between the vertices (edges). In this way, graph defines a topology in discrete form.
Let's take a look at network data structure which is closely related to the graph. It is very much the same idea, but it additionally has a value stored in every edge. A network has a set of objects (vertices) and a set of relations between the objects (edges), while edges also hold edge values. So it is also a form of topology because the network defines the relations between the abstract objects.
In this light, you can view a graph as a neural network with {0, 1} weights. The graph edge is either present or absent, hence {0, 1} values only. The network structure, however, can hold any assigned value in every edge, so every connection between objects (neurons) can be characterized not only by its presence, but also by edge-assigned values (weights). Now we get the full model of a neural network. And yes, it is built upon topology in its discrete form.
esafak•6h ago
Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training? I imagine it violently changing at first before stabilizing, followed by geometric refinement. Here are some relevant papers:
* Topology and geometry of data manifold in deep learning (https://arxiv.org/abs/2204.08624)
* Topology of Deep Neural Networks (https://jmlr.org/papers/v21/20-345.html)
* Persistent Topological Features in Large Language Models (https://arxiv.org/abs/2410.11042)
* Deep learning as Ricci flow (https://www.nature.com/articles/s41598-024-74045-9)
profchemai•5h ago
lostmsu•5h ago
theahura•4h ago
If you've ever played with GANs or VAEs, you can actually answer this question! And the answer is more or less 'yes'. You can look at GANs at various checkpoints during training and see how different points in the high dimensional space move around (using tools like UMAP / TSNE).
> I imagine it violently changing at first before stabilizing, followed by geometric refinement
Also correct, though the violent changing at the beginning is also influenced the learning rate and the choice of optimizer.
esafak•4h ago