If I understand this correctly, there are three major problems with LLMs right now.
1. LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space. Since we don't know what the dimensions in the low-dimensional vector space mean, we can only check that the outputs are correct most of the time.
What research is happening to resolve this?
2. LLMs use written texts to facilitate this reduction. So, they don't learn from reality, but from what humans written down about reality.
It seems like Keen Technologies tries to avoid this issue, by using (simple) robots with sensors for training, instead of human text. Which seems a much slower process, but could yield more accurate models in the long run.
3. LLMs holds internal state as a vector that reflects the meaning and context of the "conversation". Which explains, why the quality of responses deteriorates with longer conversations, if one vector is "stamped over" again and again, the meaning of the first "stamps" will get blurred.
Are there alternative ways of holding state or is the only way around this to back up that state vector at every point an revert if things go awry?
> This blog post is recommended for desktop users.
That said, there is a lot of content here that could have been mobile-friendly with very little effort. The first image, of embeddings, is a prime example. It has been a very long time since I've seen any online content, let alone a blog post, that requires a desktop browserLots of console errors with the likes of "Content-Security-Policy: The page’s settings blocked an inline style (style-src-elem) from being applied because it violates the following directive: “style-src 'self'”." etc...
https://app.vidyaarthi.ai/ai-tutor?session_id=C2Wr46JFIqslX7...
Our goal is to make abstract concepts more intuitive and interactive — kind of like a "learning-by-doing" approach. Would love feedback from folks here.
(Not trying to self-promote — just sharing a related learning tool we’ve put a lot of thought into.)
Tokens are a form of compression, and working on uncompressed representation would require more memory and more processing power.
(However, there seems to be some serious back-button / browser history hijacking on this page.. Just scolling down the page appends a ton to my browser history, which is lame.)
carschno•3h ago
dust42•1h ago
Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:
Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.
Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.
Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert
Also worth a look: neoBERT https://huggingface.co/papers/2502.19587
jasonjayr•58m ago
ubutler•1h ago
Encoder-decoders are not in vogue.
Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.
Decoders are favored for text generation, summarization and translation.
Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.
Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.
The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).