AbsenceBench: Language models can't tell what's missing

2025-06-2022:2632484arxiv.org

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH)…

Show article

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

From: Harvey Yiyun Fu [view email]
[v1] Fri, 13 Jun 2025 03:38:29 UTC (5,538 KB)

Read the original article

Comments

By birdfood 2025-06-2023:195 reply

Perhaps related, after watching a talk by Gerald Sussman I loaded an image of the Kanizsa triangle into Claude and asked it a pretty vague question to see if it could “see” the inferred triangle. It recognised the image and went straight into giving me a summary about it. So I rotated the image 90 degrees and tried in a new conversation, it didn’t recognise the image and got the number of elements incorrect:

This image shows a minimalist, abstract geometric composition with several elements:

Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background

By latentsea 2025-06-210:593 reply

I guess they will now just rotate all the images in the training data 90 degrees too to fill this kind of gap.

By recursivecaveat 2025-06-211:172 reply

Everything old is new again: in the Alexnet paper that kicked off the deep learning wave in 2012, they describe horizontally flipping every image as a cheap form of data augmentation. Though now that we expect models to actually read text that seems potentially counter-productive. Rotations are similar, in that you'd hope it would learn heuristics such as that the sky is almost always at the top.

By latency-guy2 2025-06-212:16

At least from when I was still doing this kind of work, look angle/platform angle scatterer signal (radar) mattered more than rotation, but rotation was a simple way to get quite a bit more samples. It never stopped being relevant :)

By bonoboTP 2025-06-219:00

That's called data augmentation. It was common alredy before AlexNet. And it never stopped being common, it's still commonly done.

By mirekrusin 2025-06-215:551 reply

That's how you train neural network with synthetic data so it extracts actual meaning.

That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.

LLM training seems to be falling into memoization trap because models are extremely good at it, orders of magnitude better than humans.

IMHO what is missing in training process is this feedback explaining wrong answer. What we're currently doing with training is leaving out this understanding as "exercise to the reader". We're feeding correct answers to specific, individual examples which promotes memoization.

What we should be doing in post training is ditch direct backpropagation on next token, instead let the model finish its wrong answer, append explanation why it's wrong and continue backpropagation for final answer - now with explanation in context to guide it to the right place in understanding.

What all of this means is that current models are largely underutilized and unnecessarily bloated, they contain way too much memoized information. Making model larger is easy, quick illusion of improvement. Models need to be squeezed more, more focus needs to go towards training flow itself.

By atwrk 2025-06-2114:051 reply

> That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.

Just nitpicking here, but this isn't how humans learn numbers. They start at birth with competency up to about 3 or 5 and expand from that. So they can already work with quantities of varying size (i.e. they know which is more, the 4 apples on the left or the five on the right, and they also know what happens if I take one apple from the left and put it to the others on the right), and then they learn the numbers. So yes, they learn the numbers through memorization, but only the signs/symbols, not the numeric competency itself.

By mirekrusin 2025-06-2116:521 reply

Turtles all the way down, things like meaning of "more" is also memoized ie initially as "I want more food" etc. then refined with time, ie. kid saying "he's more than me" is corrected by explaining that there needs to be some qualifier for measurable quantity ie. "he's more tall (taller) than me" or "he is more fast (faster) than me" etc.

Using different modalities (like images, videos, voice/sounds instead of pure text) is interesting as well as it helps completing the meaning, adds sense of time etc.

I don't think we're born with any concepts at all, it's all quite chaotic initially with consistent sensory inputs that we use to train/stabilise our neural network. Newborns for example don't even have concept of separation between "me and the environment around me", it's learned.

By atwrk 2025-06-238:221 reply

> I don't think we're born with any concepts at all, it's all quite chaotic initially with consistent sensory inputs that we use to train/stabilise our neural network.

That is exactly the thing that doesn't seem to be true, or at least it is considered outdated in neuroscience. We very much have some concepts that are inert, and all other concept we learned in relation to the things that are already there in our brains - at birth mostly sensorymotor stuff. We decidedly don't learn new concepts from scratch, only in relation to already acquired concepts.

So our brains work quite a bit different than LLMs, despite the neuron metaphor used there.

And regarding your food example, the difference I was trying to point out: For LLMs, the word and the concept, are the same thing. For humans they are different things that are also learned differently. The memorization part (mostly) only affects the word, not the concept behind it. What you described was only the learning of the word "tall" - the child in your example already knew that the other person was taller than them, it just didn't know how to talk about that.

By mirekrusin 2025-06-2312:42

LLMs name became misnomer once we started directly adding different modalities. In that sense "word and concept" is not the same thing because multimodal LLM can express it in ie. image and sentence.

By littlestymaar 2025-06-216:232 reply

And it will work.

I just whish the people believing LLM can actually reason and generalize see that they don't.

By ben_w 2025-06-2116:181 reply

If that was evidence current AI don't reason, then the Thatcher effect would be evidence that humans don't: https://en.wikipedia.org/wiki/Thatcher_effect

LLMs may or may not "reason", for certain definitions of the word (there are many), but this specific thing doesn't differentiate them from us.

By t-3 2025-06-2118:16

Being tricked by optical illusions is more about the sensory apparatus and image processing faculties than reasoning, but detecting optical illusions is definitely a reasoning task. I doubt it's an important enough task to train into general models though.

By latentsea 2025-06-216:41

At this point think all reasoning really means is having seen enough of the right training data to make the correct inferences, and they're just missing some training data.

By Workaccount2 2025-06-212:311 reply

Show any LLM a picture of a dog with 5 legs watch them be totally unable to count.

By pfdietz 2025-06-214:21

Or watch them channel Abraham Lincoln.

By JohnKemeny 2025-06-218:03

We really don't know how to compute.

Oct 2011, 30 comments.

https://news.ycombinator.com/item?id=3163473

Strange loop video:

July 2011, 36 comments.

https://news.ycombinator.com/item?id=2820118

By iknownothow 2025-06-2118:24

As far as I can tell, the paper covers text documents only. Therefore your example doesn't quite apply.

It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.

I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:

* OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?

By akomtu 2025-06-211:541 reply

To generalise this idea: if we look at a thousand points that more or less fill a triangle, we'll instantly recognize the shape. IMO, this simple example reveals what intelligence is really about. We spot the triangle because so much complexity - a thousand points - fits into a simple, low-entropy geometric shape. What we call IQ is the ceiling of complexity of patterns that we can notice. For example, the thousand dots may in fact represent corners of a 10-dimensional cube, rotated slightly - an easy pattern to see for a 10-d mind.

By saithound 2025-06-215:382 reply

Cool. Since ChatGPT 4o is actually really good at this particular shape identification task, what, if anything do you conclude about its intelligence?

By akomtu 2025-06-2118:30

Recognizing triangles isn't that impressive. What's the ceiling of complexity of patterns in data it can identify with is the real question. Give it a list of randomly generated xyz coords that fall on a geometric shape, or a list of points that sample a trajectory of Earth around Sun. Will it tell you that it's an ellipse? Will it derive the 2nd Newton's law? Will it notice the deviation from the ellipse and find the rule explaining it?

By JohnKemeny 2025-06-217:591 reply

The entire point here is that LLMs and image recognition software is not managing this task, so, not really good at this particular shape identification task.

By saithound 2025-06-2111:40

No, the post's article is not about the sort of shape identification task discussed by GP. Or indeed any image recognition task: it's a paper about removed context in language.

Fwiw, I did test GP's task on ChatGPT 4o directly before writing my comment. It is as good at it as any human.

By cs702 2025-06-2022:554 reply

Interesting. Even the most recent models perform relatively poorly when asked to identify which information in a context has been removed, given access to both the original and edited contexts.

The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!

Thank you for sharing on HN.

By yorwba 2025-06-215:061 reply

There are keys to attend to, they're just in the original text instead of the modified one. Since the model receives both as input, it could theoretically attend to those keys.

For the attention mechanism, there isn't much difference between

  Original: {shared prefix} {removed part} {shared suffix} Modified: {shared prefix} {shared suffix}

And

  Original: {shared prefix} {shared suffix} Modified: {shared prefix} {added part} {shared suffix}

I think you could implement an algorithm for this in RASP (a language for manually programming transformers) roughly like this:

1. The first layer uses attention to the "Original:" and "Modified:" tokens to determine whether the current token is in the original or modified parts.

2. The second layer has one head attend equally to all original tokens, which averages their values, and another head attends equally to all modified tokens, averaging them as well. The averages are combined by computing their difference.

3. The third layer attends to tokens that are similar to this difference, which would be the ones in the {removed part}/{added part}.

The only ordering-dependent part is whether you compute the difference as original_average - modified_average or the other way around.

If a model can detect additions but not removal, that would show that it is capable of learning this or a similar algorithm in principle, but wasn't trained on enough removal-style data to develop the necessary circuitry.

By ironmanszombie 2025-06-2122:161 reply

Thanks for the breakdown. I am far from knowledgeable on AI but was wondering why can't a simple comparison work? They can definitely be coded, as you have beautifully demonstrated.

By yorwba 2025-06-224:16

A simple comparison between which two vectors?

By cyanydeez 2025-06-2023:382 reply

for vision models, I wonder if they can train on things like photo negatives, rotated images, etc. Or madlib like sentences where a Q/A is like "the _____ took first place in the horse show."

By bearseascape 2025-06-210:18

The madlib like sentences approach is actually how masked token prediction works! It was one of the pretraining tasks for BERT, but nowadays I think all (?) LLMs are trained with next token prediction instead.

By latency-guy2 2025-06-212:21

For photo negatives - usually doesn't matter. I am not up to date with what the vision folks are doing at these companies, but images are usually single channel, and more likely than not for regular images in greyscale. Otherwise in complex domain for the radar folks, and those are not RGB based images at all, rather scatterer defined.

Additional channels being recognized in training usually didn't matter for the experiments and models I used to deal with before 2022, and if they were, certainly did not matter for colors. Then again, the work I was doing was on known (and some additional confusers) classes for object detection and classification where the color pretty much didn't matter in the first place.

By usaar333 2025-06-212:481 reply

They don't seem to use any recent top models. No opus, no o3, no Gemini 25 pro

By cs702 2025-06-2213:29

It seems they used the most recent models available as of March 2025.

By jug 2025-06-210:42

And yet, there are some notable differences between them, so now that there’s a benchmark and attention given to this issue, I wonder how much better they can get. Because obviously something can be done.

By yousif_123123 2025-06-2023:411 reply

This is very interesting. 1. Authors mention the attention mechanism being perhaps unable to attend to the location of gaps since the gaps aren't tokens. But I would've expected a good LLM transformer to be at least a bit close to the gap location. I don't understand why mathematically the architecture is less suitable for that, it could attend to a region that may contain gaps. I wonder if fine-tuning on a task like this could help? 2. Shorter inputs with less omissions were harder to solve. That is not completely surprising, as a human doing this task, if 1 word was missing it would be harder to notice. And similarly 1 line would be harder than 10 lines. But still interesting for an LLM to have this problem. 3. Reasoning models do better, as they can write out the documents and potentially solve this easily. It still very surprising that this doesn't lead to 100% accuracy. This should be a trivial task. Like the paper says, a trivial program can be written to solve this. Perhaps ChatGPT (or similar agent) could read this paper while training, and know to write and run python when solving an issue like this.

The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.

By banq 2025-06-211:48

[dead]