If I understand this correctly, there are three major problems with LLMs right now.
1. LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space. Since we don't know what the dimensions in the low-dimensional vector space mean, we can only check that the outputs are correct most of the time.
What research is happening to resolve this?
2. LLMs use written texts to facilitate this reduction. So, they don't learn from reality, but from what humans written down about reality.
It seems like Keen Technologies tries to avoid this issue, by using (simple) robots with sensors for training, instead of human text. Which seems a much slower process, but could yield more accurate models in the long run.
3. LLMs holds internal state as a vector that reflects the meaning and context of the "conversation". Which explains, why the quality of responses deteriorates with longer conversations, if one vector is "stamped over" again and again, the meaning of the first "stamps" will get blurred.
Are there alternative ways of holding state or is the only way around this to back up that state vector at every point an revert if things go awry?
(1) While studying the properties of the mathematical objects produced is important, I don't think we should understand the situation you describe as a problem to be solved. In old supervised machine learning methods, human beings were tasked with defining the rather crude 'features' of relevance in a data/object domain, so each dimension had some intuitive significance (often binary 'is tall', 'is blue' etc). The question now is really about learning the objective geometry of meaning, so the dimensions of the resultant vector don't exactly have to be 'meaningful' in the same way -- and, counter-intuitive as it may seem, this is progress. Now the question is of the necessary dimensionality of the mathematical space in which semantic relations can be preserved -- and meaning /is/ in some fundamental sense the resultant geometry.
(2) This is where the 'Platonic hypothesis' research [1] is so fascinating: empirically we have found that the learned structures from text and image converge. This isn't saying we don't need images and sensor robots, but it appears we get the best results when training across modalities (language and image, for example). This is really fascinating for how we understand language. While any particular text might get things wrong, the language that human beings have developed over however many thousands of years really does seem to do a good job of breaking out the relevant possible 'features' of experience. The convergence of models trained from language and image suggests a certain convergence between what is learnable from sensory experience of the world and the relations that human beings have slowly come to know through the relations between words.
[1] https://phillipi.github.io/prh/ and https://arxiv.org/pdf/2405.07987
I've never really challenged that text is a suitable stand-in for important bits of reality. I worry instead about meta-limitations of text: can we reliably scale our training corpus without accreting incestuous slop from other models?
Sensory bots would seem to provide a convenient way out of this problem but I'm not read-enough to know.
What do you mean? There is an embedding size that is maintained constant from the first layer to the last. Embedding lookup, N x transformer layers, softmax - all three of them have the same dimension.
Maybe you mean LoRA is "reducing a high-dimensional vector space into a lower-dimensional vector space"
> This blog post is recommended for desktop users.
That said, there is a lot of content here that could have been mobile-friendly with very little effort. The first image, of embeddings, is a prime example. It has been a very long time since I've seen any online content, let alone a blog post, that requires a desktop browserSimple incompetence, not so much. But whoever wrote this wanted to kick sand in the user's face.
>If your ears are more important than your eyes, you can listen to the podcast version of this article generated by NotebookLM.
It looks like an LLM would read it to you; I wonder if one could have made it mobile-friendly.
Lots of console errors with the likes of "Content-Security-Policy: The page’s settings blocked an inline style (style-src-elem) from being applied because it violates the following directive: “style-src 'self'”." etc...
https://app.vidyaarthi.ai/ai-tutor?session_id=C2Wr46JFIqslX7...
Our goal is to make abstract concepts more intuitive and interactive — kind of like a "learning-by-doing" approach. Would love feedback from folks here.
(Not trying to self-promote — just sharing a related learning tool we’ve put a lot of thought into.)
Tokens are a form of compression, and working on uncompressed representation would require more memory and more processing power.
First, the embedding typically uses thousands of dimensions.
Then, the value along each dimension is represented with a floating point number which will take 16 bits (can be smaller though with higher quantization).
My point was that you compared how the LLM represents a token internally versus how “English” transmits a word. That’s a category error.
To be pedantic, we can't feed humans ASCII directly, we have to convert it to images or sounds first.
> My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.
That could be done, by having only 256 tokens, one for each possible byte, plus perhaps a few special-use tokens like "end of sequence". But it would be much less efficient.
In either case, the tokens are a necessary part of LLMs. They have to have a differentiable representation in order to be possible to train effectively. High-dimensional embeddings are differentiable and are able to usefully represent "meaning" of a token.
In other words, the representation of "cat" in an LLM must be something that can be gradually nudged towards "kitten", or "print", or "excavator", or other possible meanings. This is doable with the large vector representation, but such operation makes no sense when you try to represent the meaning directly in ASCII.
(However, there seems to be some serious back-button / browser history hijacking on this page.. Just scolling down the page appends a ton to my browser history, which is lame.)
So someone, at some point, thought this was a feature
This an optimization that many vector dbs use in retrieval since it is typically much faster to compute Euclidean distance rather than cosine.
In d dimensions you can have d vectors that are mutually orthogonal.
Interestingly this means that for sequence lengths up to d, you can have precise positional targeting attention. As soon as you go to longer sequences that's no longer universally possible.
[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...
One problem with this technique is that the model wasn't trained with intermediate layers being mapped to logits in the first place, so it's not clear why the LMHead should be able to map them to anything sensible. But alas, like everything in DL research, they threw science at the wall and a bit stuck.
I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).
Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.
EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955
My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.
NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).
There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).
This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.
That's a really interesting three-word noun-phrase. Is it a term-of-art, or a personal analogy?
LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.
[1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971...
They should be a really big deal! Though I can see why trying to comprehend a 1,000-dimensional vector space might be intimidating.
Also the results were not great. Are there any good embeddings api providers?
- JinaAI (https://jina.ai/embeddings/) v3 and v4 performed well in my testing. - Google's Gemini-001 model (https://ai.google.dev/gemini-api/docs/models#gemini-embeddin...).
Overall, both were surpassed by Qwen3-8b (https://huggingface.co/Qwen/Qwen3-Embedding-8B).
Note, this was specifically regarding English and Code embedding generation/retrieval, with reranking.
Something like https://projector.tensorflow.org/
just type a word in, select UMAP projection.
So out of interest: During inference, the embedding is simply a lookup table "token ID -> embedding vector". Mathematically, you could represent this as encoding the token ID as a (very very long) one-hot vector, then passing that through a linear layer to get the embedding vector. The linear layer would contain exactly the information from the lookup table.
My question: Is this also how the embeddings are trained? I.e. just treat them as a linear layer and include them in the normal backpropagation of the model?
(If you're curious about the details, there's an example of making indexing differentiable in my minimal deep learning library here: https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...)
IF N is vocab size and L is sequence length, you'd need to create a NxL matrix, and multiply it with the embedding matrix. But since your NxL matrix will be sparse with only a single 1 per column, it'd make sense to represent it internally as just one number per column, representing the index at which 1 is. At which point if you defined new multiplication by this matrix, it would basically just index with this number.
And just like you write a special forward pass, you can write a special backward pass so that backpropagation would reach it.
As embeddings transfer through various layers, you can see what contribution each layer of transformer is making to classification. There are 3 types of holes that form 1-d, 2-d 3-d... each is telling the shape of data (embedding) as it traverses... It can help is reducing layers/reducing backprop. Some layers are more important than others...
You will see none of this using Vietoris Rips!
in case you want to play and visually understand the traditional PE;
carschno•6mo ago
dust42•6mo ago
Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:
Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.
Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.
Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert
Also worth a look: neoBERT https://huggingface.co/papers/2502.19587
jasonjayr•6mo ago
xxpor•6mo ago
E.g. Er macht das Fenster. vs Er macht das Fenster auf.
(He makes the window. vs He opens the window.)
Ey7NFZ3P0nzAe•6mo ago
No I don't think it makes any noticeable difference :)
xxpor•6mo ago
ubutler•6mo ago
Encoder-decoders are not in vogue.
Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.
Decoders are favored for text generation, summarization and translation.
Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.
Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.
The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).
microtonal•6mo ago
namibj•6mo ago
A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens).
That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related: discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step.
A recent paper on the matter: https://openreview.net/forum?id=MJNywBdSDy