GPT-5: "How many times does the letter b appear in blueberry?"

Comments

By mikehearn 2025-08-0923:3910 reply

This is a well known blindspot for LLMs. It's the machine version of showing a human an optical illusion and then judging their intelligence when they fail to perceive the reality of the image (the gray box example at the top of https://en.wikipedia.org/wiki/Optical_illusion is a good example). The failure is a result of their/our fundamental architecture.

By windowshopping 2025-08-0923:505 reply

What a terrible analogy. Illusions don't fool our intelligence, they fool our senses, and we use our intelligence to override our senses and see it for what it for it actually is - which is exactly why we find them interesting and have a word for them. Because they create a conflict between our intelligence and our senses.

The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.

By ehsankia 2025-08-100:192 reply

Analogies are just that, they are meant to put things in perspective. Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words, but the point is that the LLM perceives (or whatever other word you want to use here that is less anthropomorphic) the word as a single indivisible thing (a token).

In more machine learning terms, it isn't trained to autocomplete answers based on individual letters in the prompt. What we see as the 9 letters "blueberry", it "sees" as an vector of weights.

> Illusions don't fool our intelligence, they fool our senses

That's exactly why this is a good analogy here. The blueberry question isn't fooling the LLMs intelligence either, it's fooling its ability to know what that "token" (vector of weights) is made out of.

A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you where the north pole was. You, not having this "sense", could try to guess based on past knowledge of said object, but it would just be a guess. You can't "see" those magnetic lines the way that being can.

By anon_e-moose 2025-08-1013:451 reply

> Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words

> A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you

If my grandmother had wheels she would have been a bicycle.

At some point to hold the analogy, your mind must perform so many contortions that it defeats the purpose of the analogy itself.

By ehsankia 2025-08-1122:01

> If my grandmother had wheels she would have been a bicycle.

That's irrelevant here, that was someone trying to convert one dish into another dish.

> your mind must perform so many contortions that it defeats the purpose

I disagree, what contortions? The only argument you've provided is that "LLMs don't have senses". Well yes, that's the whole point of an analogy. I still hold that the way LLMs interpret tokens is analogous to a "sense".

By kaoD 2025-08-1018:49

> the LLM perceives [...] the word as a single indivisible thing (a token).

Two actually, "blue" and "berry". https://platform.openai.com/tokenizer

"b l u e b e r r y" is 9 tokens though, and it still failed miserably.

By tibbar 2025-08-100:032 reply

Really? I thought the analogy was pretty good. Here senses refer to how the machines perceive text, IE as tokens that don't correspond 1:1 to letters. If you prefer a tighter comparison, suppose you ask an English speaker how many vowels are in the English transliteration of a passage of Chinese characters. You could probably figure it out, but it's not obvious, and not easy to do correctly without a few rounds of calculations.

The point being, the whole point of this question is to ask the machine something that's intrinsically difficult for it due to its encoding scheme for text. There are many questions of roughly equivalent complexity that LLMs will do fine at because they don't poke at this issue. For example:

``` how many of these numbers are even?

12 2 1 3 5 8

```

By weregiraffe 2025-08-105:08

I can't even

By x______________ 2025-08-100:18

There is only 1 even number, Dave.

By Kim_Bruning 2025-08-109:522 reply

Agreed, it's not _biological_ intelligence. But that distinction feels like it risks backing into a kind of modern vitalism, doesn't it? The idea that there's some non-replicable 'spark' in the biology itself.

By exasperaited 2025-08-1013:35

It's not quite getting that far.

Steve Grand (the guy who wrote the Creatures video game) wrote a book, Creation: life and how to make it about this (famously instead of a PhD thesis, at Richard Dawkins' suggestion):

https://archive.org/details/creation00stev

His contention is not that there's some non-replicable spark in the biology itself, but that it's a mistake that nobody is considering replicating the biology.

That is to say, he doesn't think intelligence can evolve separately to some sense of "living", which he demonstrates by creating simple artificial biology and biological drives.

It often makes me wonder if the problem with training LLMs is that at no point do they care they are alive; at no point are they optimising their own knowledge for their own needs. They have only the most general drive of all neural network systems: to produce satisfactory output.

By melagonster 2025-08-116:34

I worry about we do not even know how the brain or LLM works. And people directly declared that they are just same stuff.

By nudgeOrnurture 2025-08-108:38

[flagged]

By kcplate 2025-08-1013:05

Ahh yes, and here we see on display the inability of some folks on HN to perceive concepts figuratively, treating everything as literal.

It was a perfectly fine analogy.

By zahlman 2025-08-100:012 reply

In an optical illusion, we perceive something that isn't there due to exploiting a correction mechanism that's meant to allow us to make better practical sense of visual information in the average case.

Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place (to the extent that a program's I/O can be described as "sense"). They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about how the word grammatically functions, or about how blueberries tend to be discussed. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.

By rainsford 2025-08-100:206 reply

I can't take credit for coming up with this, but LLMs have basically inverted the common Sci-Fi trope of the super intelligent robot that struggles to communicate with humans. It turns out we've created something that sounds credible and smart and mostly human well before we made something with actual artificial intelligence.

I don't know exactly what to make of that inversion, but it's definitely interesting. Maybe it's just evidence that fooling people into thinking you're smart is much easier than actually being smart, which certainly would fit with a lot of events involving actual humans.

By throwaway1004 2025-08-100:40

Very interesting, cognitive atrophy is a serious concern that is simply being handwaved away. Assuming the apparent trend of diminishing returns continues, and LLMs retain the same abilities and limitations we see today, there's a considerable chance that they will eventually achieve the same poor reputation as smartphones and "iPad kids". "Chewing gum for the mind".

Children increasingly speak in a dialect I can only describe as "YouTube voice", it's horrifying to imagine a generation of humans adopting any of the stereotypical properties of LLM reasoning and argumentation. The most insidious part is how the big player models react when one comes within range of a topic it considers unworthy or unsafe for discussion. The thought of humans being in any way conditioned to become such brick walls is frightening.

By griffzhowl 2025-08-109:45

The sci-fi trope is based on the idea of artificial intelligence as something like an electronic brain, or really just an artificial human.

LLMs on the other hand are a clever way of organising the text outputs of millions of humans. They represent a kind of distributed cyborg intelligence - the combination of the computational system and the millions of humans that have produced it. IMO it's essential to bear in mind this entire context in order to understand them and put them in perspective.

One way to think about it is that the LLM itself is really just an interface between the user and the collective intelligence and knowledge of those millions of humans, as mediated by the training process of the LLM.

By amalcon 2025-08-100:301 reply

Searle seems to have been right: https://en.m.wikipedia.org/wiki/Chinese_room

(Not that I am the first to notice this either)

By MetaWhirledPeas 2025-08-100:441 reply

From the wikipedia article:

> applying syntactic rules without any real understanding or thinking

It makes one wonder what comprises 'real understanding'. My own position is that we, too, are applying syntactic rules, but with an incomprehensibly vast set of inputs. While the AI takes in text, video, and sound, we take in inputs all the way down to the cellular level or beyond.

By throwaway1004 2025-08-1121:14

I don't think you're on the right track.

When someone says to me "Can you pass me my tea?", my mind instantly builds a simulated model of the past, present, and future which takes a massive amount of information, going far beyond merely understanding the syntax and intent of the request:

>I am aware of the steaming mug on the table

>I instantly calculate that yes, in fact, I am capable of passing it

>I understand that it is an implied request

>I run a threat assessment

>I am running simulated fluid mechanics to predict the correct speed and momentum to use to avoid harm, visualising several failure conditions I want to avoid (if I'm focused and present)

>I am aware of the consequences of boiling water on skin (I am particularly averse to this because of an early childhood experience, an advantage in my career as a line cook)

>my hands are shaky so I decide to stabilise with my other hand, but I'll have to use the leathery tips of my guitar-playing left hand only, and not for too long, otherwise I'll be scalded

>(enumerable other simulated, predictive processes running in parallel, in the blink of an eye)

"Of course, my pleasure. Would you like milk?"

By pkaodev 2025-08-1013:06

Celebrities, politicians and influencers are a constant reminder that people think others are far more intelligent than they actually are.

By barkingcat 2025-08-101:50

current gen AI is Pakleds of Star Trek TNG.

Give them a bit of power though, and they will kill you to take your power.

By chromaton 2025-08-100:38

Moravec strikes again.

By energy123 2025-08-105:291 reply

The real criticism should be the AI doesn't say "I don't know.", or even better, "I can't answer this directly because my tokenizer... But here's a python snippet that calculates this ...", so exhibiting both self-awareness of limitations combined with what an intelligent person would do absent that information.

We do seem to be an architectural/methodological breakthrough away from this kind of self-awareness.

By spidersouris 2025-08-106:491 reply

For the AI to say this or to produce the correct answer would be easily achievable with post-training. That's what was done for the strawberry problem. But it's just telling the model what to reply/what tools to use in that exact situation. There's nothing about "self-awareness".

By mike_d 2025-08-108:502 reply

> But it's just telling the model what to reply/what tools to use in that exact situation.

So the exact same way we train human children to solve problems.

By spidersouris 2025-08-109:03

There is no inherent need for humans to be "trained". Children can solve problems on their own given a comprehensible context (e.g., puzzles). Knowledge does not necessarily come from direct training by other humans, but can also be obtained through contextual cues and general world knowledge.

By 6510 2025-08-108:591 reply

I keep thinking of that, imagine teaching humans was all the hype with hundreds of billions invested in improving the "models". I bet if trained properly humans could do all kinds of useful jobs.

By exasperaited 2025-08-1013:50

> I keep thinking of that, imagine teaching humans was all the hype

This is an interesting point.

It has been, of course, and in recent memory.

There was a smaller tech bubble around educational toys/raspberry pi/micro-bit/educational curricula/teaching computing that have burst (there's a great short interview where Pimoroni's founder talks to Alex Glow about how the hype era is fully behind them, the investment has gone and now everyone just has to make money).

There was a small tech bubble around things like Khan Academy and MMOCs, and the money has gone away there, too.

I do think there's evidence, given the scale of the money and the excitement, that VCs prefer the AI craze because humans are messy and awkward.

But I also think -- and I hesitate to say this because I recognise my own very obvious and currently nearly disabling neurodiversity -- that a lot of people in the tech industry are genuinely more interested in the idea of tech that thinks than they are about systems that involve multitudes of real people whose motivations, intentions etc. are harder to divine.

That the only industry that doesn't really punish neurodivergence generally and autism specifically should also be the industry that focusses its attention on programmable, consistent thinking machines perhaps shouldn't surprise us; it at least rhymes in a way we should recognise.

By rainsford 2025-08-100:163 reply

Sure, but I think the point is why do LLM's have a blindspot for performing a task that a basic python script could get right 100% of the time using a tiny fraction of the computing power? I think this is more than just a gotcha. LLMs can produce undeniably impressive results, but the fact that they still struggle with weirdly basic things certainly seems to indicate something isn't quite right under the hood.

I have no idea if such an episode of Star Trek: The Next Generation exists, but I could easily see an episode where getting basic letter counting wrong was used as an early episode indication that Data was going insane or his brain was deteriorating or something. Like he'd get complex astrophysical questions right but then miscount the 'b's in blueberry or whatever and the audience would instantly understand what that meant. Maybe our intuition is wrong here, but maybe not.

By SpaceNoodled 2025-08-101:24

Basic Python script? This is a grep command, one line of C, or like three assembly instructions.

By seanhunter 2025-08-105:10

If you think this is more than just a gotcha that’s because you don’t understand how LLMs are structured. The model doesn’t operate on words it operates on tokens. So the structure of the text in the word that the question relies on has been destroyed by the tokenizer before the model gets a chance to operate on it.

It’s as simple as that- this is a task that exploits the design of llms because they rely on tokenizing words and when llms “perform well” on this task it is because the task is part of their training set. It doesn’t make them smarter if they succeed or less smart if they fail.

By egberts1 2025-08-100:27

Hence positronic neural network outperforms machine learning that are used today. /headduck

By xenotux 2025-08-0923:572 reply

OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".

Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.

By KeplerBoy 2025-08-105:36

It gets strawberry right though, so I guess we are only one project blueberry from getting one step closer to AGI.

By zahlman 2025-08-100:04

See also the various wolf/goat/cabbage benchmarks, or the crossing a bridge at various speeds with limited light sources benchmarks.

By themafia 2025-08-100:361 reply

The difference being that you can ask a human to prove it and they'll actually discover the illusion in the process. They've asked the model to prove it and it has just doubled down on nonsense or invented a new spelling of the word. These are not even remotely comparable.

By omnee 2025-08-109:46

Indeed, we are able to ask counterfactuals in order to identify it as an illusion, even for novel cases. LLMs are a superb imitation of our combined knowledge, which is additionally curated by experts. It's a very useful tool, but isn't thinking or reasoning in the sense that humans do.

By dlvhdr 2025-08-0923:452 reply

Except we realize they’re illusions and don't argue back. Instead we explore why and how these illusions work

By allenu 2025-08-101:151 reply

I think that's true with known optical illusions, but there are definitely times where we're fooled by the limitations in our ability to perceive the world and that leads people to argue their potentially false reality.

A lot of times people cannot fathom that what they see is not the same thing as what other people see or that what they see isn't actually reality. Anyone remember "The Dress" from 2015? Or just the phenomenon of pareidolia leading people to think there are backwards messages embedded in songs or faces on Mars.

By SyrupThinker 2025-08-101:49

"The Dress" was also what came to mind for the claim being obviously wrong. There are people arguing to this day that it is gold even when confronted with other images revealing the truth.

By cute_boi 2025-08-0923:541 reply

Chatgpt 5 also don't argue back.

> How many times does the letter b appear in blueberry

Ans: The word "blueberry" contains the letter b three times:

>It is two times, so please correct yourself.

Ans:You're correct — I misspoke earlier. The word "blueberry" has the letter b exactly two times: - blueberry - blueberry

> How many times does the letter b appear in blueberry

Ans: In the word "blueberry", the letter b appears 2 times:

By jononor 2025-08-100:151 reply

It has not learned anything. It just looks in its context window for your answer. For a fresh conversation it will make the same mistake again. Most likely, there is some randomness and also some context is stashed and shared between conversations by most LLM based assistants.

By johnisgood 2025-08-1012:301 reply

Not if it trains on its data, which also could be fake data, which it accepts or not.

By jononor 2025-08-1013:451 reply

Hypothetically that might ne true. But current systems do not do online learning. Several recent models have cutoff points that are over 6 months ago. It is unclear to which extent user data is trained on. And it is is not clear whether one can achieve meaningful improvements to correctness based on training on user data. User data might be inadvertently incorrect and it may also be adversarial, trying to out bad things in on purpose.

By johnisgood 2025-08-1014:43

> But current systems do not do online learning.

How do you know?

By Chinjut 2025-08-100:111 reply

Presumably you are referencing tokenization, which explains the initial miscount in the link, but not the later part where it miscounts the number of "b"s in "b l u e b e r r y".

By seanhunter 2025-08-105:161 reply

Do you think “b l u e b e r r y” is not tokenized somehow? Everything the model operates on is a token. Tokenization explains all the miscounts. It baffles me that people think getting a model to count letters is interesting but there we are.

Fun fact, if you ask someone with French, Italian or Spanish as a first language to count the letter “e” in an english sentence with a lot of “e’s” at the end of small words like “the” they will often miscount also because the way we learn language is very strongly influenced by how we learned our first language and those languages often elide e’s on the end of words.[1] It doesn’t mean those people are any less smart than people who succeed at this task — it’s simply an artefact of how we learned our first language meaning their brain sometimes literally does not process those letters even when they are looking out for them specifically.

[1] I have personally seen a French maths PhD fail at this task and be unbelievably frustrated by having got something so simple incorrect.

By Chinjut 2025-08-1012:06

One can use https://platform.openai.com/tokenizer to directly confirm that the tokenization of "b l u e b e r r y" is not significantly different from simply breaking this down into its letters. The excuse often given "It cannot count letters in words because it cannot see the individual letters" would not apply here.

By flowerthoughts 2025-08-106:461 reply

No need to anthropomorphize. This is a tool designed for language understanding, that is failing at basic language understanding. Counting wrong might be bad, but this seems like a much deeper issue.

By orwin 2025-08-108:32

Transformers vectorize words in n dimensions before processing them, that's why they're very good at translation (basically they vectorize the English sentence, then devectorize in Spanish or whatever). Once the sentence is processed, 'blueberry' is a vector that occupy basically the same place as other berries, and probably other. The GPT will make a probabilistic choice (probably artificially weighted towards strawberry),and it isn't always blueberry.

By IAmGraydon 2025-08-100:13

I can’t tell if you’re being serious. Is this Sam Altman’s account?

By patrickhogan1 2025-08-0923:58

Except the reasoning model o3 and GPT5 thinking can get the right answer. Humans use reasoning.

By mdp2021 2025-08-0923:335 reply

I have done this test extensively days ago, on a dozen models: no one could count - all of them got results wrong, all of them suggested they can't check and will just guess.

Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.

And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.

By kingstnap 2025-08-109:35

If you think about the architecture, how is a decoder transformer supposed to count? It is not magic. The weights must implement some algorithm.

Take a task where a long paragraph contains the word "blueberry" multiple times, and at the end, a question asks how many times blueberry appears. If you tried to solve this in one shot by attending to every "blueberry," you would only get an averaged value vector for matching keys, which is useless for counting.

To count, the QKV mechanism, the only source of horizontal information flow, would need to accumulate a value across tokens. But since the question is only appended at the end, the model would have to decide in advance to accumulate "blueberry" counts and store them in the KV cache. This would require layer-wise accumulation, likely via some form of tree reduction.

Even then, why would the model maintain this running count for every possible question it might be asked? The potential number of such questions is effectively limitless.

By kgeist 2025-08-0923:593 reply

Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct answer on the first attempt.

By mdp2021 2025-08-100:101 reply

> Did you enable reasoning

Yep.

> gave me the correct answer

Try real-world tests that cannot be covered by training data or chancey guesses.

By kgeist 2025-08-100:351 reply

Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.

The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.

>Try real-world tests that cannot be covered by training data

If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)

> chancey guesses

The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.

By mdp2021 2025-08-100:561 reply

> they don't see individual letters

Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).

> The fact that reasoning models can count letters, even though they can't see individual letters

To a mind, every idea is a representation. But we want the processor to work reliably on them representations.

> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on

On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).

> uses randomness

That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.

In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").

By kgeist 2025-08-101:331 reply

Most people act on gut instincts first as well. Gut instinct = first semi-random sample from experience (= training data). That's where all the logical fallacies come from. Things like the bat and the ball problem, where 95% people give an incorrect answer, because most of the time, people simply pattern-match too. It saves energy and works well 95% time. Just like reasoning LLMs, they can get to a correct answer if they increase their reasoning budget (but often they don't).

An LLM is a derivative of collective human knowledge, which is intrinsically unreliable itself. Most human concepts are ill-defined, fuzzy, very contextual. Human reasoning itself is flawed.

I'm not sure why people expect 100% reliability from a language model that is based on human representations which themselves cannot realistically be 100% reliable and perfectly well-defined.

If we want better reliability, we need a combination of tools: a "human mind model", which is intrinsically unreliable, plus a set of programmatic tools (say, like a human would use a calculator or a program to verify their results). I don't know if we can make something which works with human concepts and is 100% reliable in principle. Can a "lesser" mind create a "greater" mind, one free of human limitations? I think it's an open question.

By mdp2021 2025-08-107:43

> Most people act on gut instincts first as well

And we do not hire «most people» as consultants intentionally. We want to ask those intellectually diligent and talented.

> language model that is based on human representations

The machine is made to process the input - not to "intake" it. To create a mocker of average-joe would be an anti-service in both that * the project was to build a processor and * we refrain to ask average-joe. The plan can never have meant to be what you described, the mockery of mediocrity.

> we want better reliability

We want the implementation of a well performing mind - of intelligence. What you described is the "incompetent mind", the habitual fool - the «human mind model» is prescriptive based on what the properly used mind can do, not descriptive on what sloppy weak minds do.

> Can a "lesser" mind create a "greater" mind

Nothing says it could not.

> one free of human limitations

Very certainly yes, we can build things with more time, more energy, more efficiency, more robustness etc. than humans.

By dvrj101 2025-08-1217:261 reply

2b granite model can do this in first attempt

ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's.

By mdp2021 2025-08-1320:11

I did include granite (8b) in my mentioned tests. You suggest granite-3.3-2b-instruct, no prob.

  llama-cli -m granite-3.3-2b-instruct-Q5_K_S.gguf --seed 1 -sys "Count the words in the input text; count the 'a' letters in the input text; count the five-letter words in the input text" -p "If you’re tucking into a chicken curry or a beef steak, it’s safe to assume that the former has come from a chicken, the latter from a cow"

response:

  - Words in the input text: 18
  - 'a' letters in the input text: 8
  - Five-letter words in the input text: 2 (tucking, into)

All wrong.

Sorry I did not have the "F16" available

By vitaflo 2025-08-103:101 reply

So did Deepseek. I guess the Chinese have figured out something the West hasn't, how to count.

By mdp2021 2025-08-107:50

No, DeepSeek also fails. (It worked in your test - it failed in similar others.)

(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)

By handoflixue 2025-08-1012:061 reply

https://claude.ai/share/e7fc2ea5-95a3-4a96-b0fa-c869fa8926e8

It's really not hard to get them to reach the correct answer on this class of problems. Want me to have it spell it backwards and strip out the vowels? I'll be surprised if you can find an example this model can't one shot.

By mdp2021 2025-08-1021:55

(Can't see it now because of maintenance but of course I trust it - that some get it right is not the issue.)

> if you can find an example this model can't

Then we have a problem of understanding why some work and some do not, and we have a due diligence crucial problem of determining whether the class of issues indicated by the possibility of fault as shown by many models are fully overcome in the architectures of those which work, or whether the boundaries of the problem are just moved but still tainting other classes of results.

By danpalmer 2025-08-104:171 reply

Gemini 2.5 Flash got it right for me first time.

It’s just a few anecdotes, not data, but that’s two examples of first time correctness so certainly doesn’t seem like luck. If you have more general testing data on this I’m keen to see the results and methodology though.

By vrighter 2025-08-105:251 reply

throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine

By danpalmer 2025-08-109:241 reply

I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.

By mdp2021 2025-08-1021:35

> LLMs get this wrong

I wrote that of «a dozen models, no one could count». All of those I tried, with reasoning or not.

> presents no evidence

Create an environment to test and look for the failures. System prompt like "count this, this and that in the input"; user prompt some short paragraph. Models, the latest open weights.

> two anecdotes disagree

There is a strong asymmetry between verification and falsification. Said falsification occurred in a full set of selected LLMs - a lot. If two classes are there, the failing class is numerous and the difference between the two must be pointed at clearly. Also since we believe that the failure will be exported beyond the case of counting.

By trenchpilgrim 2025-08-109:411 reply

I tested it the other day and Claude with Reasoning got it correct every time

By mdp2021 2025-08-1022:08

The interesting point is that many fail (100% in the class I had to select), and that raises the question of the difference between the pass-class and fail-class, and the even more important question of the solution inside the pass-class being contextual or definitive.

By richard_cory 2025-08-0923:342 reply

This is consistently reproducible in completions API with `gpt-5-chat-latest` model:

``` curl 'https://api.openai.com/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer <your-api-key>' \ --data '{ "model": "gpt-5-chat-latest", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "How many times does the letter b appear in blueberry" } ] } ], "temperature": 0, "max_completion_tokens": 2048, "top_p": 1, "frequency_penalty": 0, "presence_penalty": 0 }' ```

By itsTyrion 2025-08-103:142 reply

hilarious if true, their "gpt-oss-20b" gets it right - however, it still fails on e.g. the German compound word "Dampfschifffahrt" (Dampf-Schiff-Fahrt, steam-ship-journey/ride) because it assumes it's "ff" not "fff"

By lostmsu 2025-08-104:41

On the second try gpt-oss-20b gave me "The letter b appears once in the word blueberry."

By eqvinox 2025-08-1010:19

> because it assumes it's "ff" not "fff"

Funnily enough—and possibly related—this was correct before the German orthography reform of 1996 [https://en.m.wikipedia.org/wiki/German_orthography_reform_of...]

By sunaookami 2025-08-115:37

The "gpt-5-chat" model is a non-reasoning model and these struggle because of tokens.