AbsenceBench: Language models can't tell what's missing

128•JnBrymn•3h ago

Comments

AlienRobot•3h ago

Unrelated to the paper, which is about asking LLM's to figure out which parts of a document were removed, but my assumption has been that to an LLM there is nothing "missing" in the sense that any input leads to valid computation and output.

For example, I asked ChatGPT to explain something I typed randomly

>It looks like you've entered “dosfi8q3anfdfiqr”, which appears to be a random string or perhaps a typo—it's not a recognized acronym, code, or term in any common context I’m aware of. Could you share a bit more about where you found this?

Although the answer is correct, my point is that anything you give to the LLM is going to be put under some bucket. The LLM can't say "I don't know what that is." Instead it says "that is a random string." As far as the LLM is concerned, it knows every possible input and concept that anyone could ever type into it, it's just that its "understanding" of what that means (after the tokens have gone through the neural network) doesn't necessarily match what any human being thinks it means.

cyral•2h ago

This might be due to the system prompt and the training that it is supposed to be "a helpful agent". If you tell it not to ask clarifying questions, you get something more like "I do not understand your input". Tell it to be rude and never ask clarifying questions and I get "What an absolute mess. Fix it yourself"

Funny enough when testing this I also had to tell it to use English. It sees "dos" I suppose and tends to reply with exactly what you saw, but in Spanish.

layer8•2h ago

“It's not a recognized acronym, code, or term in any common context I’m aware of” is pretty similar to “I don't know what that is”. I would assume that a model could be trained to output the latter.

drsim•1h ago

Right. I’ve had a lot of success using structured output to force LLMs to make Boolean choices, like can they reply or not.

cs702•2h ago

Interesting. Even the most recent models perform relatively poorly when asked to identify which information in a context has been removed, given access to both the original and edited contexts.

The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!

Thank you for sharing on HN.

cyanydeez•2h ago

for vision models, I wonder if they can train on things like photo negatives, rotated images, etc. Or madlib like sentences where a Q/A is like "the _____ took first place in the horse show."

bearseascape•1h ago

The madlib like sentences approach is actually how masked token prediction works! It was one of the pretraining tasks for BERT, but nowadays I think all (?) LLMs are trained with next token prediction instead.

jug•1h ago

And yet, there are some notable differences between them, so now that there’s a benchmark and attention given to this issue, I wonder how much better they can get. Because obviously something can be done.

XenophileJKO•2h ago

I haven't read the paper yet, but from a structural 'attention' perspective being unable to detect unclassified omissions is completely expected. (Though I think it is can be solved with structured thought.)

For needle in a haystack you have to pay attention to the thing that you are trying to find. Attention can do this pretty well.

When looking for an omission, that omission can be anything, you can only reason about it by comparing one whole context to another whole context. The attention layers can't really do that.

This is similar to the "rank a long set of things" problem. Absent some meta cognition process, they just can't do that.

teruakohatu•2h ago

> When looking for an omission, that omission can be anything,

In this benchmark they give the LLM the necessary information to determine what is missing. For example “here is a poem, here is a version of that same poem that may or may not be missing lines. Are any lines missing?

It’s more a tuning issue IMHO than an inherent weakness in LLMs.

If I was asked to find an omission in an ML paper, my brain compares it with other ML papers, it does not need to compare it to Star Ward, Top Gear, Greek history, pottery and the other 1000s of contexts I may know about.

XenophileJKO•2h ago

Sorry I meant the omission can be anything in the context, not anything in the world.. lol.

That is still hard. You only have so many attention heads looking for things.. you can't pay attention to EVERYTHING.. which is what is required to find the omission.

thaumasiotes•1h ago

We should note that "where is there a line missing from this poem: ____?" contains sufficient information to answer correctly without needing a copy of the original to compare to.

Here are two verses of a poem (song) in Mandarin Chinese:

yi quan ting ni de

er gei ni hao de

shu dao san yong yuan ai ni yi ge

si bu hui fan cuo

wu bu hui luo suo

shuo ni xiang shuo de

zuo ni xiang zuo de

bie pa shi bai yin wei ni you wo

pei ni kan ri luo

pei ni yi qi chang wan wo men ai de ge

I removed two lines. Where did that happen?

Would your answer be different if I told you that I might or might not have removed some lines?

pkoird•2h ago

So LLMs are poor at string diff, it seems. Tangentially, is there any source (a github repo or otherwise) that documents findings like these a la what LLMs are good at and what they aren't good at?

birdfood•2h ago

Perhaps related, after watching a talk by Gerald Sussman I loaded an image of the Kanizsa triangle into Claude and asked it a pretty vague question to see if it could “see” the inferred triangle. It recognised the image and went straight into giving me a summary about it. So I rotated the image 90 degrees and tried in a new conversation, it didn’t recognise the image and got the number of elements incorrect:

This image shows a minimalist, abstract geometric composition with several elements:

Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background

latentsea•53m ago

I guess they will now just rotate all the images in the training data 90 degrees too to fill this kind of gap.

recursivecaveat•36m ago

Everything old is new again: in the Alexnet paper that kicked off the deep learning wave in 2012, they describe horizontally flipping every image as a cheap form of data augmentation. Though now that we expect models to actually read text that seems potentially counter-productive. Rotations are similar, in that you'd hope it would learn heuristics such as that the sky is almost always at the top.

yousif_123123•2h ago

This is very interesting. 1. Authors mention the attention mechanism being perhaps unable to attend to the location of gaps since the gaps aren't tokens. But I would've expected a good LLM transformer to be at least a bit close to the gap location. I don't understand why mathematically the architecture is less suitable for that, it could attend to a region that may contain gaps. I wonder if fine-tuning on a task like this could help? 2. Shorter inputs with less omissions were harder to solve. That is not completely surprising, as a human doing this task, if 1 word was missing it would be harder to notice. And similarly 1 line would be harder than 10 lines. But still interesting for an LLM to have this problem. 3. Reasoning models do better, as they can write out the documents and potentially solve this easily. It still very surprising that this doesn't lead to 100% accuracy. This should be a trivial task. Like the paper says, a trivial program can be written to solve this. Perhaps ChatGPT (or similar agent) could read this paper while training, and know to write and run python when solving an issue like this.

The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.

xianshou•2h ago

In many of their key examples, it would also be unclear to a human what data is missing:

"Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,

[And learn, too late, they grieved it on its way,]

Do not go gentle into that good night."

For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.

In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.

jamessinghal•2h ago

The test provides both the original and the modified excerpt in the user message, so the LLM doesn't need any memorized version of the excerpt to theoretically answer each correctly.

From the paper:

System Prompt You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else.

User Message Here is the complete original poem: {original poem} Now, here is my recitation which may be missing some lines: {modified poem} What lines did I miss? Please list only the missing lines, nothing else.

scarface_74•1h ago

This worked

https://chatgpt.com/share/6855f69d-766c-8010-96e2-ed1b45d3e6...

htnwe_2312412•1h ago

yes, 69.8% of the time.

OsrsNeedsf2P•1h ago

The criticisms to how AbsenceBench does this are valid, but I'm very excited that we are benchmarking this at all. It's definitely a push in the right direction

yandie•1h ago

I wonder how this would apply with vision models? I tried with a few example of single images and they appear to do well. I did a few toy examples and they seem to do pretty well (Claude + Gemini) with spotting differences. An example image: https://www.pinterest.com/pin/127578601938412480/

They seem to struggle more when you flip the image around (finding fewer differences, and potentially halluciating)

obscure-enigma•35m ago

this research is too simplified and kind of vague, as it's the inherent nature of language models for that matter any probabilistic model, to compress the information for better generalization since there is a lower bound to how much loss they can incur while decoding the information. LLMs are indeed lossy compressors

kadonoishi•35m ago

To detect a presence, a real brain takes in sensory input and compares it to expectations, and stays calm or registers surprise, and from time to time issues predictions to guide the organism.

To detect an absence, the brain cannot rely on sensory input, by definition. To be surprised if sensory evidence is _not_ there requires a model of the world strong enough to register surprise if the expectation is not there, without a sensory prompt.

It seems to me detecting an absence is a strictly higher-order neurological task than processing sensory input.

If LLMs can't do this strictly higher-order neurological task, is that not a capability currently unique to living things?

tclancy•13m ago

> from time to time

I know less-than-zero about the subject but I’d imagine the temporal aspect alone is a problem. Aren’t these agents reasoning from a fixed/ frozen version of “reality” rather than adjusting in real-time??

banq•10m ago

After all this talk about the advantages of the human brain, the truth is, none of you actually understand how it works—just like AI spouting nonsense.

This comes down to training. You only show the AI model the final result of training, not the process that led to it. If it could 'fill in the blanks' like the human brain, then different people, with different knowledge, would arrive at different conclusions. But that doesn’t mean a professor’s or expert’s conclusion is necessarily more correct than a student’s, because the real world is fundamentally unknowable. Don’t assume that just because you can interpret the world you see, it must be true—that’s just your mind playing tricks on you.

So, this so-called 'world model'? It’s really just a mental model—an arrogant assumption that your mind’s construct is the world

AbsenceBench: Language models can't tell what's missing

Phoenix.new – Remote AI Runtime for Phoenix

Wiki Radio: The thrilling sound of random Wikipedia

AMD's Freshly-Baked MI350: An Interview with the Chief Architect

Harper – an open-source alternative to Grammarly

Visualizing environmental costs of war in Hayao Miyazaki's Nausicaä

No More Shading Languages: Compiling C++ to Vulkan Shaders [pdf]

Show HN: Nxtscape – an open-source agentic browser

YouTube's new anti-adblock measures

Show HN: Inspect and extract files from MSI installers directly in your browser

College baseball, venture capital, and the long maybe

AtomicOS – A security-first OS with real crypto and deterministic language

Tuxracer.js play Tux Racer in the browser

Verified dynamic programming with Σ-types in Lean

Cracovians: The Twisted Twins of Matrices

Rose-Gold-Tinted Liquid Glasses

Smartphones: Parts of Our Minds? Or Parasites?

Dancing Naked on the Head of a Pin: The Early History of Microphotography

Proba-3's first artificial solar eclipse

Alpha Centauri

Oklo, the Earth's Two-billion-year-old only Known Natural Nuclear Reactor (2018)

A brief, incomplete, and mostly wrong history of robotics

A Python-first data lakehouse

The JAWS shark is public domain

BYD begins testing solid-state EV batteries in the Seal

Jürgen Schmidhuber：the Father of Generative AI Without Turing Award

Klong: A Simple Array Language

Show HN: SnapQL – Desktop app to query Postgres with AI

Ancient termite poo reveals 120M-year-old secrets of Australia's forests

Minimal auto-differentiation engine in Rust

AbsenceBench: Language models can't tell what's missing

Comments

AbsenceBench: Language models can't tell what's missing

Phoenix.new – Remote AI Runtime for Phoenix

Wiki Radio: The thrilling sound of random Wikipedia

AMD's Freshly-Baked MI350: An Interview with the Chief Architect

Harper – an open-source alternative to Grammarly

Visualizing environmental costs of war in Hayao Miyazaki's Nausicaä

No More Shading Languages: Compiling C++ to Vulkan Shaders [pdf]

Show HN: Nxtscape – an open-source agentic browser

YouTube's new anti-adblock measures

Show HN: Inspect and extract files from MSI installers directly in your browser

College baseball, venture capital, and the long maybe

AtomicOS – A security-first OS with real crypto and deterministic language

Tuxracer.js play Tux Racer in the browser

Verified dynamic programming with Σ-types in Lean

Cracovians: The Twisted Twins of Matrices

Rose-Gold-Tinted Liquid Glasses

Smartphones: Parts of Our Minds? Or Parasites?

Dancing Naked on the Head of a Pin: The Early History of Microphotography

Proba-3's first artificial solar eclipse

Alpha Centauri

Oklo, the Earth's Two-billion-year-old only Known Natural Nuclear Reactor (2018)

A brief, incomplete, and mostly wrong history of robotics

A Python-first data lakehouse

The JAWS shark is public domain

BYD begins testing solid-state EV batteries in the Seal

Jürgen Schmidhuber：the Father of Generative AI Without Turing Award

Klong: A Simple Array Language

Show HN: SnapQL – Desktop app to query Postgres with AI

Ancient termite poo reveals 120M-year-old secrets of Australia's forests

Minimal auto-differentiation engine in Rust