The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!
Thank you for sharing on HN.
For needle in a haystack you have to pay attention to the thing that you are trying to find. Attention can do this pretty well.
When looking for an omission, that omission can be anything, you can only reason about it by comparing one whole context to another whole context. The attention layers can't really do that.
This is similar to the "rank a long set of things" problem. Absent some meta cognition process, they just can't do that.
In this benchmark they give the LLM the necessary information to determine what is missing. For example “here is a poem, here is a version of that same poem that may or may not be missing lines. Are any lines missing?
It’s more a tuning issue IMHO than an inherent weakness in LLMs.
If I was asked to find an omission in an ML paper, my brain compares it with other ML papers, it does not need to compare it to Star Ward, Top Gear, Greek history, pottery and the other 1000s of contexts I may know about.
That is still hard. You only have so many attention heads looking for things.. you can't pay attention to EVERYTHING.. which is what is required to find the omission.
Here are two verses of a poem (song) in Mandarin Chinese:
yi quan ting ni de
er gei ni hao de
shu dao san yong yuan ai ni yi ge
si bu hui fan cuo
wu bu hui luo suo
shuo ni xiang shuo de
zuo ni xiang zuo de
bie pa shi bai yin wei ni you wo
pei ni kan ri luo
pei ni yi qi chang wan wo men ai de ge
I removed two lines. Where did that happen?
Would your answer be different if I told you that I might or might not have removed some lines?
This image shows a minimalist, abstract geometric composition with several elements:
Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background
The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.
"Rage, rage against the dying of the light.
Wild men who caught and sang the sun in flight,
[And learn, too late, they grieved it on its way,]
Do not go gentle into that good night."
For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.
In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.
From the paper:
System Prompt You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else.
User Message Here is the complete original poem: {original poem} Now, here is my recitation which may be missing some lines: {modified poem} What lines did I miss? Please list only the missing lines, nothing else.
They seem to struggle more when you flip the image around (finding fewer differences, and potentially halluciating)
To detect an absence, the brain cannot rely on sensory input, by definition. To be surprised if sensory evidence is _not_ there requires a model of the world strong enough to register surprise if the expectation is not there, without a sensory prompt.
It seems to me detecting an absence is a strictly higher-order neurological task than processing sensory input.
If LLMs can't do this strictly higher-order neurological task, is that not a capability currently unique to living things?
I know less-than-zero about the subject but I’d imagine the temporal aspect alone is a problem. Aren’t these agents reasoning from a fixed/ frozen version of “reality” rather than adjusting in real-time??
This comes down to training. You only show the AI model the final result of training, not the process that led to it. If it could 'fill in the blanks' like the human brain, then different people, with different knowledge, would arrive at different conclusions. But that doesn’t mean a professor’s or expert’s conclusion is necessarily more correct than a student’s, because the real world is fundamentally unknowable. Don’t assume that just because you can interpret the world you see, it must be true—that’s just your mind playing tricks on you.
So, this so-called 'world model'? It’s really just a mental model—an arrogant assumption that your mind’s construct is the world
AlienRobot•3h ago
For example, I asked ChatGPT to explain something I typed randomly
>It looks like you've entered “dosfi8q3anfdfiqr”, which appears to be a random string or perhaps a typo—it's not a recognized acronym, code, or term in any common context I’m aware of. Could you share a bit more about where you found this?
Although the answer is correct, my point is that anything you give to the LLM is going to be put under some bucket. The LLM can't say "I don't know what that is." Instead it says "that is a random string." As far as the LLM is concerned, it knows every possible input and concept that anyone could ever type into it, it's just that its "understanding" of what that means (after the tokens have gone through the neural network) doesn't necessarily match what any human being thinks it means.
cyral•2h ago
Funny enough when testing this I also had to tell it to use English. It sees "dos" I suppose and tends to reply with exactly what you saw, but in Spanish.
layer8•2h ago
drsim•1h ago