If you think about it for some time then you’ll come to realise transformers are autoencoders on steroids. A small input space is expanded onto a big manifold and contracted again.
Now, suppose you want to impose a function to regulate the output of an autoencoder. It’s actually pretty obvious that you need exactly one layer to do so… f(manifold).
I feel like every couple months someone comes back around and rediscovers how capacity works in neural networks one way or another
earthnail•16m ago
Took me a short time to understand what you mean with "autoencoders on steroids", but I believe you mean they are autoencoders with an inverse bottleneck - an intermediate representation that isn't smaller, but that's much larger than the input space. Is my understanding of your comment correct?
usernametaken29•3m ago
Kind of. Autoencoders don’t need to have an embedding that’s smaller than the input. Their only requirement is that they compress information and thus create reconstruction loss. Typically however they are not trained this way because they don’t converge.. transformers do the same thing, but they can squeeze much more bits of information through one pass because the way they are designed. This holds true even for decoder only networks because they’re still doing the same thing
soraki_soladead•14m ago
I might be misunderstanding your point but this conflates the distinguishing features of each. you mention expansion but autoencoders canonically compress their inputs. autoencoders have an explicit encoder and decoder. most transformers we interact with these days (LLMs) are decoder only. the manifold isn't typically something the model is applied to directly. we apply the function/model to the latent representations. those are what live on the manifold.
usernametaken29•26m ago
earthnail•16m ago
usernametaken29•3m ago
soraki_soladead•14m ago