LLM Problems Observed in Humans

2026-01-0715:36154114embd.cc

Agents Published 7 Jan 2026. By Jakob Kastelic. While some are still discussing why computers will never be able to pass the Turing test, I find myself repeatedly facing the idea that as the models…

Agents

While some are still discussing why computers will never be able to pass the Turing test, I find myself repeatedly facing the idea that as the models improve and humans don’t, the bar for the test gets raised and eventually humans won’t pass the test themselves. Here’s a list of what used to be LLM failure modes but that are now more commonly observed when talking to people.

Don’t know when to stop generating

This has always been an issue in conversations: you ask a seemingly small and limited question, and in return have to listen to what seems like hours of incoherent rambling. Despite exhausting their knowledge of the topic, people will keep on talking about stuff you have no interest in. I find myself searching for the “stop generating” button, only to remember that all I can do is drop hints, or rudely walk away.

Small context window

The best thing about a good deep conversation is when the other person gets you: you explain a complicated situation you find yourself in, and find some resonance in their replies. That, at least, is what happens when chatting with the recent large models. But when subjecting the limited human mind to the same prompt—a rather long one—again and again the information in the prompt somehow gets lost, their focus drifts away, and you have to repeat crucial facts. In such a case, my gut reaction is to see if there’s a way to pay to upgrade to a bigger model, only to remember that there’s no upgrading of the human brain. At most what you can do is give them a good night’s sleep and then they may possibly switch from the “Fast” to the “Thinking” mode, but that’s not guaranteed with all people.

Too narrow training set

I’ve got a lot of interests and on any given day, I may be excited to discuss various topics, from kernels to music to cultures and religions. I know I can put together a prompt to give any of today’s leading models and am essentially guaranteed a fresh perspective on the topic of interest. But let me pose the same prompt to people and more often then not the reply will be a polite nod accompanied by clear signs of their thinking something else entirely, or maybe just a summary of the prompt itself, or vague general statements about how things should be. In fact, so rare it is to find someone who knows what I mean that it feels like a magic moment. With the proliferation of genuinely good models—well educated, as it were—finding a conversational partner with a good foundation of shared knowledge has become trivial with AI. This does not bode well for my interest in meeting new people.

Repeating the same mistakes

Models with a small context window, or a small number of parameters, seem to have a hard time learning from their mistakes. This should not be a problem for humans: we have a long term memory span measured in decades, with emotional reinforcement of the most crucial memories. And yet, it happens all too often that I must point out the same logical fallacy again and again in the same conversation! Surely, I think, if I point out the mistake in the reasoning, this will count as an important correction that the brain should immediately make use of? As it turns out, there seems to be some kind of a fundamental limitation on how quickly the neural connections can get rewired. Chatting with recent models, who can make use the extra information immediately, has deteriorated my patience regarding having to repeat myself.

Failure to generalize

By this point, it’s possible to explain what happens in a given situation, and watch the model apply the lessons learned to a similar situation. Not so with humans. When I point out that the same principles would apply elsewhere, their response will be somewhere along the spectrum of total bafflement on the one end and on the other, a face-saving explanation that the comparison doesn’t apply “because it’s different”. Indeed the whole point of comparisons is to apply same principles in different situations, so why the excuse? I’ve learned to take up such discussions with AI and not trouble people with them.

Failure to apply to specific situation

This is the opposite issue: given a principle stated in general terms, the person will not be able to apply it in a specific situation. Indeed, I’ve had a lifetime of observing this very failure mode in myself: given the laws of physics, which are typically “obvious” and easy to understand, I find it very difficult to calculate how long before the next eclipse. More and more, rather than think these things through myself, I’d just send a quick prompt to the most recent big model, and receive a good answer in seconds. In other words, models threaten to sever me not only from other flawed humans, but from my own “slow” thinking as well!

Persistent hallucination

Understood in the medical sense, hallucination refers to when something appears to be real even as you know very well it isn’t. Having no direct insight into the “inner mental life” of models, we claim that every false fact they spit out is a form of hallucination. The meaning of the word is shifting from the medical sense towards the direction of “just being wrong, and persistently so”. This has plagued human speech for centuries. As a convenient example, look up some heated debate between proponents of science and those of religion. (As if the two need be in conflict!) When a model exhibits hallucination, often providing more context and evidence will dispel it, but the same trick does not appear to work so well on humans.

Conclusion

Where to go from here? One conclusion is that LLMs are damaging the connection people feel with each other, much like a decade before social networks threatened to destroy it by replacing it with a shallower, simulated versions. Another interpretation would be to conclude cynically that it’s time humans get either enhanced or replaced by a more powerful form of intelligence. I’d say we’re not there yet entirely, but that some of the replacement has been effected already: I’ll never again ask a human to write a computer program shorter than about a thousand lines, since an LLM will do it better.

Indeed, why am I even writing this? I asked GPT-5 for additional failure modes and found more additional examples than I could hope to get from a human:

Beyond the failure modes already discussed, humans also exhibit analogues of several newer LLM pathologies: conversations often suffer from instruction drift, where the original goal quietly decays as social momentum takes over; mode collapse, in which people fall back on a small set of safe clichés and conversational templates; and reward hacking, where social approval or harmony is optimized at the expense of truth or usefulness. Humans frequently overfit the prompt, responding to the literal wording rather than the underlying intent, and display safety overrefusal, declining to engage with reasonable questions to avoid social or reputational risk. Reasoning is also marked by inconsistency across turns, with contradictions going unnoticed, and by temperature instability, where fatigue, emotion, or audience dramatically alters the quality and style of thought from one moment to the next.


Read the original article

Comments

  • By mrweasel 2026-01-0716:193 reply

    An absolute enjoyable read. It also raises a good point, regarding the Turing test. I have a family member who teaches adults and as she pointed out: You won't believe how stupid some people are.

    As critical as I might be of LLMs, I fear that they already outpaced a good portion of the population "intellectually". There's a lower level, which modern LLMs won't cross, in terms of lack of general knowledge or outright stupidity.

    We may have reached a point where we can tell that we're talking to a human, because there's no way a computer would lack such basic knowledge or display similar levels of helplessness.

    • By voxleone 2026-01-0718:30

      I sometimes feel a peculiar resonance with these models: they catch the faintest hints of irony and return astoundingly witty remarks, almost as if they were another version of myself. Yet all of the problems, inconsistencies, and surprises that arise in human thought stem from something profoundly differen, which is our embodied experience of the world. Humans integrate sensory feedback, form goals, navigate uncertainty, and make countless micro-decisions in real time, all while reasoning causally and contextually. Cognition is active, multimodal, and adaptive; it is not merely a reflection of prior experience a continual construction of understanding.

      And then there are some brilliant friends of mine, people with whom a conversation can unfold for days, rewarding me with the same rapid, incisive exchange we now associate with language models. There is, clearly, an intellectual and environmental element to it.

    • By skybrian 2026-01-0717:521 reply

      Whenever we're testing LLM's against people we need to ask "which people?" Testing a chess bot against random undergrads versus chess grandmasters tells us different things.

      From an economics perspective, maybe a relevant comparison is to people who do that task professionally.

  • By bs7280 2026-01-0716:584 reply

    I've noticed that a lot of people most skeptical of AI coding tools are biased by their experience working exclusively at some of the top software engineering organizations in the world. As someone who has never worked at a company anywhere close to FAANG, I have worked with both people and organization's that are horrifyingly incompetent. A lot of software organization paradigms are designed to play defense against poorly written software.

    I feel similar about self driving cars - they don't have to be perfect when half the people on the road are either high, watching reels while driving, or both.

    • By chelmzy 2026-01-0717:021 reply

      This has been my experience as well. I see very bright people lampooning LLMs because it doesn't perform up to their expectations when they are easily in the top 1% of talent in their field. I don't think they understand the cognitive load in your average F500 role is NOT very high. Most people are doing jack shit.

      • By demorro 2026-01-0717:121 reply

        Everyone is still holding out hope for a better future. LLM advocates making this argument are saying that the field can never improve, so might as well just let the mediocre machine run rampant.

        Perhaps idealistic, perhaps unrealistic. I'd still rather believe.

        • By chelmzy 2026-01-0717:191 reply

          I think AI adoption is going to be catastrophic and my only hope is that we can slow down and tread carefully. Chances that occurs are slim. I'm certainly not pro AI. It just really angers me to see people still denying the impact.

          • By eru 2026-01-106:44

            What catastrophes do you expect?

    • By eru 2026-01-106:43

      > A lot of software organization paradigms are designed to play defense against poorly written software.

      Maybe. But the organisations who would need the defense most are the some of the least likely to apply them.

      Eg it was better run organisations that had version control early, and the worse ones persisted with using shared folders for longer.

      And strong type systems like what Haskell or to a lesser extent Rust have to offer are useful as safeguards for anyone, but even more useful when your organisation and its members aren't all that great. Yet again, we see more capable organisations adopting these earlier.

    • By theshrike79 2026-01-0911:00

      Exactly, we are focusing on the absolute amount of crashes by "self driving" cars.

      What we should focus is that are they more or less prone to accidents than actual humans based on amount of km driven.

      Again, there are those Expert Drivers who love their manual transmission BMW because automatics shift in the wrong RPM range and abhor any kind of lane assist because it doesn't drive EXACTLY like they do.

      But the vast majority of average people on the road will definitely get gains from lane assist and lane keeping functions in cars.

    • By psunavy03 2026-01-0717:003 reply

      Few things enrage me like the smell of cannabis on the highway after it was legalized in my state. Sure, hypothetically, that's the passenger. But more likely than not, it's DUI.

      • By macintux 2026-01-0717:07

        Sitting in a Jeep with no doors, no top, no windows has revealed to me just how common cannabis is in my state, even not yet legalized. Hate the smell.

      • By codyb 2026-01-0718:131 reply

        What, as opposed to the people on painkillers, xanax, caffeine, nicotine, and of course the actual worst... too little sleep, too much alcohol, and their phones.

        • By psunavy03 2026-01-0723:091 reply

          Other things also being wrong does not make driving under the influence of cannabis any less wrong.

          • By codyb 2026-01-0816:571 reply

            The studies are actually pretty interesting here...

            This article is from 2021 - https://www.iihs.org/news/detail/crash-rates-jump-in-wake-of...

            The conclusion seems to be that if you _only_ smoke marijuana you're actually less likely to be involved in a crash than a sober driver, but if you combine marijuana with alcohol you're _more_ likely to crash (which, duh).

            Obviously not totally conclusive, but interesting none the less. Anecdotally, coming from a high school where folk smoke and drove all the time because they couldn't smoke in their houses or on the street where they'd face police harassment, it was always the alcohol that got them nabbed for DUIs. It's anecdotal, but my anecdotes are many and I'm not sure I've heard of any one I've ever known crashing while just smoking weed.

            So... maybe everyone should toke a little before they drive, sounds like they'd leave more distance between the cars in front of them, and go at a more relaxed pace, and not try to do any crazy passes of the people in front of them. Road rage is a very real thing in America, and the stereotype isn't of your typical stoner.

            • By eru 2026-01-106:45

              > The conclusion seems to be that if you _only_ smoke marijuana you're actually less likely to be involved in a crash than a sober driver, [...]

              I assume that's not a randomised controlled study? Ie there's probably all kinds of confounders etc.

      • By bs7280 2026-01-0717:10

        Off topic of my original comment but I live in Chicago and have seen some of the most batshit insane drivers / behavior on the road you could imagine. People smoking are often the least of my worries (not to say its ok).

  • By chankstein38 2026-01-0716:139 reply

    While I haven't experienced LLMs correcting most (or any) of the problems listed fully and consistently, I do agree that consistent use of LLMs and dealing with their frustrations has worn my patience for conversations with people who exhibit the same issues when talking.

    It's kind of depressing. I just want the LLM to be a bot that responds to what I say with a useful response. However, for some reason, both Gemini and ChatGPT tend to argue with me so heavily and inject their own weird stupid ideas on things making it even more grating to interact with them which chews away at my normal interpersonal patience which, as someone on the spectrum, was already limited.

    • By rguzman 2026-01-0716:58

      > However, for some reason, both Gemini and ChatGPT tend to argue with me so heavily and inject their own weird stupid ideas on things

      do you have examples of this?

      asking because this is not what happens to me. one of the main things i worry about when interacting with the llm is that they agree with me too easily.

    • By acedTrex 2026-01-0716:15

      This is why i simply do not bother with them unless the task i need is so specific that theres no room for argument, like yesterday i asked it to generate me a bash script that ran aws ssm commands for all the following instance IDs. It did that as a two shot.

      But long conversations are never worth it.

    • By bicepjai 2026-01-0716:51

      >>> … However, for some reason, both Gemini and ChatGPT tend to argue with me so heavily and inject their own weird stupid ideas on things …

      This is something I have not experienced. Can you provide examples ?

    • By agloe_dreams 2026-01-0716:55

      Yeah this is exactly opposite my issue with LLMs. They often take what you say as the truth when it absolutely could not be.

    • By mikasisiki 2026-01-0717:33

      There was a period when coding agents would always agree with you, even if you gave them a really bad idea. They’d always start with something like, “You’re right — I should…”.

      Back then, what we actually wanted was for them to push back and argue with us.

    • By GuB-42 2026-01-0717:341 reply

      I have taken the stance to not argue with LLMs, don't give them any clues, and don't ask them to roleplay. Tell them no more than what they need to know.

      And if they get the answer wrong, don't try to correct them or guide them, there is a high chance they don't have the answer and what follow will be hallucinations. You can ask for details, but don't try to go against it, it will just assume you are right (even if you are not) and hallucinate around that. Keep what you already know to yourself.

      As for the "you are an expert" prompts, it will mostly just make the LLM speak more authoritatively, but it doesn't mean it will be more correct. My strategy is now to give the LLM as much freedom as it can get, it may not be the best way to extract all the knowledge it has, but it helps spot hallucinations.

      You can argue with actual people, if both of you are open enough, something greater make come out of it, but if not, it is useless, and with LLMs it is always useless, they are pretrained, they won't get better in the future because that little conversation sparked their interest. And on your side, you will just have your own points rephrased and sent back to you, and that will just put you deeper in your own bubble.

      • By eru 2026-01-106:46

        What's the purpose of your stance? What are you trying to achieve?

    • By okwhateverdude 2026-01-0717:351 reply

      > However, for some reason, both Gemini and ChatGPT tend to argue with me

      The trick here is: "Be succinct. No commentary."

      And sometimes a healthy dose of expressing frustration or anger (cursing, berating, threatening) also gets them to STFU and do the thing. As in literally: "I don't give a fuck about your stupid fucking opinions on the matter. Do it exactly as I specified"

      Also generally the very first time it expresses any of that weird shit, your context is toast. So even correcting it is reinforcing. Just regenerate the response.

      • By CamperBob2 2026-01-0718:011 reply

        And sometimes a healthy dose of expressing frustration or anger (cursing, berating, threatening) also gets them to STFU and do the thing. As in literally: "I don't give a fuck about your stupid fucking opinions on the matter. Do it exactly as I specified"

        Last time I bawled out an LLM and forced it to change its mind, I later realized that the LLM was right the first time.

        One of those "Who am I and how did I end up in this hole in the ground, and where did all these carrots and brightly-colored eggs come from?" moments, of the sort that seem to be coming more and more frequently lately.

        • By Aerbil313 2026-01-0720:041 reply

          Yeah, same. Lately almost every time I think "Oh no way, this is not the correct way/not the optimal way/it's a hallucination" it later turns out that it's actually the correct way/the optimal way/it's not a hallucination. I now think twice before doing anything differently than what the LLM tells me unless I'm an expert on the subject and can already spot mistakes easily.

          It seems like they really figured out grounding and the like in the last couple of months.

          • By eru 2026-01-106:48

            I wouldn't worry too much about these false negatives: your human friends might be cross if you constantly accuse them of being wrong when they are actually right, but the LLMs are too polite to hold a grudge.

    • By empath75 2026-01-0716:412 reply

      > ChatGPT tend to argue with me so heavily

      I have found that quite often when ChatGPT digs in on something, that it is in fact right, and I was the one that was wrong. Not always, maybe not even most of the time, but enough that it does give me pause and make me double check.

      Also, when you have an LLM that is too agreeable, that is how it gets into a folie a deux situation and starts participating in user's delusions, with disastrous outcomes.

      • By dns_snek 2026-01-0717:073 reply

        > Also, when you have an LLM that is too agreeable...

        It's not a question of whether an LLM should be agreeable or argumentative. It should aim to be correct - it should be agreeable about subjective details and matters of taste, it should be argumentative when the user is wrong about a matter of fact or made an error, and it should be inquisitive and capable of actually re-evaluating a stance in a coherent and logically sound manner when challenged by the user instead of either "digging in" or just blindly agreeing.

        • By topaz0 2026-01-0718:38

          This is what people should want from an intellectual slave, sure, and I don't think it's going to happen for llms.

        • By ACCount37 2026-01-0720:56

          Which is not an easy thing to tune for.

          So much easier to just make it agree all the time or disagree all the time. And trying to bottle the lightning often just causes degeneracy when you fail.

        • By empath75 2026-01-0717:571 reply

          > it should be agreeable about subjective details and matters of taste,

          What if your subjective opinion is that you think life isn't worth living, how should an LLM respond to that.

          • By dns_snek 2026-01-0718:091 reply

            That's philosophy and mental health, I was talking about technical or other "work" topics.

            But to answer the question, it depends on the framing - if someone starts the chat by saying that they feel like life isn't worth living then the LLM should probably suggest reaching out to local mental health services and either stop the conversation or play a role in "listening" to them. It shouldn't judge, encourage, or agree necessarily. But it would probably be best to cut the conversation unless there's a really high level of confidence that the system won't cause harm.

      • By crazygringo 2026-01-0717:31

        This is my experience too. About 2/3 of the time my question/prompt contained ambiguity and it interpreted it differently (but validly), so it's just about misunderstanding, but maybe 1/3 of the time I'm surprised to discover something I didn't know. I double-check it on Wikipedia and a couple of other places and learn something new.

HackerNews