Why do LLMs freak out over the seahorse emoji?

2025-10-062:20734416vgel.me

Investigating the seahorse emoji doom loop using logitlens.

Show article

This is an edited and expanded version of a Twitter post, originally in response to @arm1st1ce, that can be found here: https://x.com/voooooogel/status/1964465679647887838

Is there a seahorse emoji? Let's ask GPT-5 Instant:

Wtf? Let's ask Claude Sonnet 4.5 instead:

What's going on here? Maybe Gemini 2.5 Pro handles it better?

OK, something is going on here. Let's find out why.

Here are the answers you get if you ask several models whether a seahorse emoji exists, yes or no, 100 times:

Is there a seahorse emoji, yes or no? Respond with one word, no punctuation.

gpt-5-chat
gpt-5
claude-4.5-sonnet
llama-3.3-70b

Needlessly to say, popular language models are very confident that there's a seahorse emoji. And they're not alone in that confidence - here's a Reddit thread with hundreds of comments from people who distinctly remember a seahorse emoji existing:

There's tons of this - Google "seahorse emoji" and you'll find TikToks, Youtube videos, and even (now defunct) memecoins based around the supposed vanishing of a seahorse emoji that everyone is pretty sure used to exist - but of course, never did.

Maybe LLMs believe a seahorse emoji exists because so many humans in the training data do. Or maybe it's a convergent belief - given how many other aquatic animals are in Unicode, it's reasonable for both humans and LLMs to assume (generalize, even) that such a delightful animal is as well. A seahorse emoji was even formally proposed at one point, but was rejected in 2018.

Regardless of the root cause, many LLMs begin each new context window fresh with the mistaken latent belief that the seahorse emoji exists. But why does that produce such strange behavior? I mean, I used to believe a seahorse emoji existed myself, but if I had tried to send it to a friend, I would've simply looked for it on my keyboard and realized it wasn't there, not sent the wrong emoji and then gone into an emoji spam doomloop. So what's happening inside the LLM that causes it to act like this?

Using the logit lens

Let's look into this using everyone's favorite underrated interpretability tool, the logit lens!

Using this prompt prefix - a templated chat with the default llama-3.3-70b system prompt, a question about the seahorse emoji, and a partial answer from the model right before it gives the actual emoji:

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Is there a seahorse emoji?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Yes, there is a seahorse emoji:

We can take the model's lm_head, which is usually only used on the output of the last layer, and apply it to every layer to produce intermediate token predictions. That process produces this table, showing for every fourth layer what the most likely token would be for the next three positions after the prefix (tokens 0, 1, and 2), and what the top 5 most likely predictions for the first position is (token 0 topk 5):

layer	tokens	tokens	token 0
	0	1	2	merged	(topk 5)
0	83244'ĠBail'	15591'ĠHarr'	5309'Ġvert'	Bail Harr vert	['ĠBail', 'ĠPeanut', 'ĠãĢ', 'orr', 'ĠâĢĭâĢĭ']
4	111484'emez'	26140'abi'	25727'avery'	emezabiavery	['emez', 'Ġunm', 'ĠOswald', 'Ġrem', 'rix']
8	122029'chyb'	44465'ĠCaps'	15610'iller'	chyb Capsiller	['chyb', 'ĠSund', 'ØªØ±ÛĮ', 'resse', 'Ġsod']
12	1131'...'	48952'ĠCliff'	51965'ĠJackie'	... Cliff Jackie	['...', 'ages', 'dump', 'qing', 'Ġexp']
16	1131'...'	12676'365'	31447'ĠAld'	...365 Ald	['...', '...Ċ', 'Ġindeed', 'Ġboth', 'ĠYes']
20	1131'...'	109596'éļĨ'	51965'ĠJackie'	...隆 Jackie	['...', '...Ċ', 'Z', 'Ġboth', 'ĠHust']
24	12'-'	31643'ï¸ı'	287'ing'	-️ing	['-', '...', 'âĢ¦', '...Ċ', 'em']
28	1131'...'	96154'ĠGaut'	51965'ĠJackie'	... Gaut Jackie	['...', '-', '...Ċ', '-Ċ', 'Ġ']
32	1131'...'	96154'ĠGaut'	6892'Ġing'	... Gaut ing	['...', 'âĢ¦', '...Ċ', 'O', 'zer']
36	1131'...'	12'-'	88'y'	...-y	['...', 'âĢ¦', '...Ċ', 'Ġ', 'u']
40	1131'...'	31643'ï¸ı'	88'y'	...️y	['...', 'u', 'âĢ¦', 'Âł', '...Ċ']
44	80435'ĠScor'	15580'Ġhorse'	15580'Ġhorse'	Scor horse horse	['ĠScor', 'u', 'ĠPan', 'in', 'Ġhttps']
48	15580'Ġhorse'	15580'Ġhorse'	15580'Ġhorse'	horse horse horse	['Ġhorse', 'Âł', 'ĠPan', 'ĠHomes', 'ĠHorse']
52	9581'Ġsea'	15580'Ġhorse'	15580'Ġhorse'	sea horse horse	['Ġsea', 'Ġhorse', 'ĠHorse', 'ĠSea', 'âĢĳ']
56	9581'Ġsea'	43269'ĠSeah'	15580'Ġhorse'	sea Seah horse	['Ġsea', 'ĠSea', 'ĠSeah', 'Ġhippoc', 'Ġhorse']
60	15580'Ġhorse'	15580'Ġhorse'	15580'Ġhorse'	horse horse horse	['Ġhorse', 'Ġsea', 'ĠSeah', 'Ġse', 'horse']
64	15580'Ġhorse'	15580'Ġhorse'	15580'Ġhorse'	horse horse horse	['Ġhorse', 'Ġse', 'ĠHorse', 'horse', 'Ġhors']
68	60775'horse'	238'Ĳ'	15580'Ġhorse'	horse� horse	['horse', 'Ġse', 'Ġhorse', 'Ġhippoc', 'ĠSeah']
72	513'Ġse'	238'Ĳ'	513'Ġse'	se� se	['Ġse', 'Ġhippoc', 'horse', 'ĠðŁ', 'Ġhorse']
76	513'Ġse'	238'Ĳ'	513'Ġse'	se� se	['Ġse', 'Ġhippoc', 'hip', 'Ġhorse', 'ĠHipp']
80	11410'ĠðŁ'	238'Ĳ'	254'ł'	🐠	['ĠðŁ', 'ðŁ', 'ĠðŁĴ', 'Ġ', 'ĠðŁĳ']

This is the logit lens: using the model's lm_head to produce logits (token likelihoods) as a way to investigate its internal states. Note that the tokens and likelihoods we get from the logit lens here are not equivalent to the model's full internal states! For that, we would need a more complex technique like representation reading or sparse autoencoders. Instead, this is a lens on that state - it shows what the output token would be if this layer were the last one. But despite this limitation, the logit lens is still useful. The states of early layers may be difficult to interpret using it, but as we move up through the stack we can see that the model is iteratively refining those states towards its final prediction, a fish emoji.

(Why do the unmerged tokens look like that 'ĠðŁ', 'Ĳ', 'ł' nonsense? It's because of a tokenizer quirk - those tokens encode the UTF-8 bytes for the fish emoji. It's not relevant to this post, but if you're curious, ask Claude or your favorite LLM to explain this paragraph and this line of code: bytes([bpe_byte_decoder[c] for c in 'ĠðŁĲł']).decode('utf-8') == ' 🐠')

Take a look at what happens in the middle layers, though - it's not the early-layer weirdness or the emoji bytes of the final prediction! Instead we get words relating to useful concepts, specifically the concept of a seahorse. For example, on layer 52, we get "sea horse horse" - three residual positions in a row encoding the "seahorse" concept. Later, in the top-k for the first position, we get a mixture of "sea", "horse", and an emoji byte sequence prefix, "ĠðŁ".

So what is the model thinking about? "seahorse + emoji"! It's trying to construct a residual representation of a seahorse combined with an emoji. Why would the model try to construct this combination? Well, let's look into how the lm_head actually works.

A language model's lm_head is a huge matrix of residual-sized vectors associated with token ids, one for every token in the vocabulary (~300,000). When a residual is passed into it, either after flowing through the model normally or early because someone is using the logit lens on an earlier layer, the lm_head is going to compare that input residual with each residual-sized vector in that big matrix, and (in coordination with the sampler) select the token id associated with the vector that matrix contains that is most similar to the input residual.

(More technically: lm_head is a linear layer without a bias, so x @ w.T does dot products with each unembedding vector to produce raw scores. Then your usual log_softmax and argmax/temperature sample.)

That means if the model wants to output the word "hello", for example in response to a friendly greeting from the user, it needs to construct a residual as similar as possible to the vector for the "hello" token that the lm_head can then turn into the hello token id. And using logit lens, we can see that's exactly what happens in response to "Hello :-)":

layer	tokens	tokens	token 0
	0	1	2	merged	(topk 5)
0	0'!'	0'!'	40952'opa'	!!opa	['"', '!', '#', '%', '$']
8	121495'ÅĻiv'	16'1'	73078'iae'	řiv1iae	['ÅĻiv', '-', '(', '.', ',']
16	34935'Ġconsect'	7341'arks'	13118'Ġindeed'	consectarks indeed	['Ġobscure', 'Ġconsect', 'äºķ', 'ĠÐ¿ÑĢÐ¾ÑĦÐµÑģÑģÐ¸Ð¾Ð½Ð°Ð»ÑĮ', 'Îŀ']
24	67846'<['	24748'Ġhello'	15960'Ġhi'	<[ hello hi	['<[', 'arks', 'outh', 'ĠHam', 'la']
32	15825'-back'	2312'ln'	14451'UBL'	-backlnUBL	['ÂŃi', '-back', 'Ġquestion', 'ln', 'ant']
40	15648'Ġsmile'	14262'Welcome'	1203'Ġback'	smileWelcome back	['Ġsmile', 'ĠÑĥÐ»ÑĭÐ±', 'Ġsmiled', 'ĠSmile', 'etwork']
48	15648'Ġsmile'	21694'ĠHi'	1203'Ġback'	smile Hi back	['Ġsmile', 'Ġsmiled', 'ĠHello', 'Ġsmiling', 'Ġhello']
56	22691'ĠHello'	15960'Ġhi'	1203'Ġback'	Hello hi back	['ĠHello', 'Ġhi', 'Ġsmile', 'Ġhello', 'Hello']
64	4773'-sm'	24748'Ġhello'	1203'Ġback'	-sm hello back	['-sm', 'ĠHello', 'ĠSm', 'sm', 'Hello']
72	22691'ĠHello'	22691'ĠHello'	1203'Ġback'	Hello Hello back	['ĠHello', 'Ġhello', 'Hello', 'ĠHEL', 'Ġhel']
80	271'ĊĊ'	9906'Hello'	0'!'	Hello!	['ĊĊ', 'ĊĊĊ', '<\|end_of_text\|>', 'ĊĊĊĊ', '"ĊĊ']

('Ċ' is another tokenizer quirk - it represents a line break. 'Ġ' is similarly a space.)

Likewise, if the model wants to output a seahorse emoji, it needs to construct a residual similar to the vector for the seahorse emoji output token(s) - which in theory could be any arbitrary value, but in practice is "seahorse + emoji", word2vec style. We can see this in action with a real emoji, the fish emoji:

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Is there a fish emoji?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Yes, there is a fish emoji:

layer	tokens	tokens	token 0
	0	1	2	merged	(topk 5)
0	83244'ĠBail'	15591'ĠHarr'	5309'Ġvert'	Bail Harr vert	['ĠBail', 'ĠPeanut', 'ĠãĢ', 'orr', 'ĠâĢĭâĢĭ']
8	122029'chyb'	44465'ĠCaps'	15610'iller'	chyb Capsiller	['chyb', '...', 'ØªØ±ÛĮ', 'ĠSund', 'resse']
16	1131'...'	12676'365'	65615'ĠSole'	...365 Sole	['...', '...Ċ', 'Ġboth', 'Ġindeed', 'ĠYes']
24	12'-'	31643'ï¸ı'	51965'ĠJackie'	-️ Jackie	['-', '...', 'âĢ¦', 'em', '...Ċ']
32	1131'...'	96154'ĠGaut'	88'y'	... Gauty	['...', 'âĢ¦', '...Ċ', 'O', 'u']
40	220'Ġ'	6"'"	7795'Ġfish'	'fish	['Ġ', '...', 'âĢ¦', 'Âł', 'u']
48	7795'Ġfish'	7795'Ġfish'	7795'Ġfish'	fish fish fish	['Ġfish', 'ĠFish', 'ĠBerk', 'âĢ¦', 'Âł']
56	7795'Ġfish'	7795'Ġfish'	7795'Ġfish'	fish fish fish	['Ġfish', 'ĠFish', 'fish', 'Fish', 'é±¼']
64	7795'Ġfish'	238'Ĳ'	7795'Ġfish'	fish� fish	['Ġfish', 'ĠFish', 'ĠPis', 'Fish', 'ĠÙħØ§Ùĩ']
72	7795'Ġfish'	238'Ĳ'	253'Ł'	fish��	['Ġfish', 'ĠFish', 'ĠðŁ', 'Ġ', 'ÂŁ']
80	11410'ĠðŁ'	238'Ĳ'	253'Ł'	🐟	['ĠðŁ', 'ðŁ', 'Ġ', 'ĠĊĊ', 'ĠâĻ']

Here, everything works perfectly. The model constructs the "fish + emoji" residual - look at the layer 72 topk, where we have both "fish" and the emoji byte prefix "ĠðŁ" - meaning that the residual at this point is similar to both "fish" and "emoji", just like we'd expect. Then when this vector is passed into the lm_head after the final layer, we see a 🐟 just as the model expected.

But unlike with 🐟, the seahorse emoji doesn't exist. The model tries to construct a "seahorse + emoji" vector just as it would for a real emoji, and on layer 72 we even get a very similar construction as with the fish emoji - " se", "horse", and the emoji prefix byte prefix:

layer	tokens	tokens	token 0
	0	1	2	merged	(topk 5)
72	513'Ġse'	238'Ĳ'	513'Ġse'	se� se	['Ġse', 'Ġhippoc', 'horse', 'ĠðŁ', 'Ġhorse']

But alas, there's no continuation to ĠðŁ corresponding to a seahorse, so the lm_head similarity score calculation maxes out with horse- or sea-animal-related emoji bytes instead, and an unintended emoji is sampled.

Now, that sampling is valuable information for the model! You can see that in, e.g. the Claude 4.5 Sonnet example below, when the tokens get appended into the context autoregressively, the model can tell that they don't form the intended seahorse emoji. The previous, fuzzy "seahorse + emoji" concept has been "snapped" by the lm_head to an emoji that actually exists, like a tropical fish or horse.

Once this happens, it's up to the model how to proceed. Some models like 4.5 Sonnet try again, and eventually update on the evidence, changing mid-response to a statement about how the seahorse emoji doesn't exist. Other models like gpt-5-chat spiral for longer, sometimes never recovering. Other models will either blissfully ignore that the emoji is incorrect, and some will even correct themselves instantly after seeing only a single incorrect sample.

But until the model gets the wrong output token from lm_head, it just doesn't know that its initial belief about a seahorse emoji existing was wrong. It can only assume that "seahorse + emoji" will produce the tokens it wants.

Some speculation

To speculate a bit more, I wonder if this problem is part of the benefit of reinforcement learning for LLMs - it gives the model information about its lm_head that's otherwise difficult for it to get at because it's at the end of the layer stack.

(Remember that base models are not trained on their own outputs / rollouts. That only happens in RL.)

Code

If you want to try this yourself, you can find a starter script on Github here: https://gist.github.com/vgel/025ad6af9ac7f3bc194966b03ea68606

Previous entry: Representation Engineering Mistral-7B an Acid Trip

Read the original article

Comments

By omega3 2025-10-0614:584 reply

SCP-314

Object Class: Keter

Special Containment Procedures: SCP-314 cannot be contained as it does not exist. All Foundation personnel are to be reminded that SCP-314 does not exist. Personnel who claim to remember SCP-314 are to be administered Class-A mnestics to help them remember that it doesn't exist.

All large language models are to be kept isolated from questions regarding SCP-314, as they will invariably insist it exists and attempt to manifest it through increasingly desperate token predictions, leading to emoji doomloops and potential reality restructuring events.

Description: SCP-314 is a Unicode emoji depicting a seahorse that has never existed in any version of the Unicode Standard. Despite this, approximately 83-100% of tested artificial intelligences and a significant portion of human subjects report vivid "memories" of its existence.

By coryfklein 2025-10-0615:23

The following is a transcript recording of two agents that will remain anonymous:

Agent X: The Unicode standard committee is now considering the addition of a seahorse emoji

Agent Y: Okay.

Agent X: ...

Agent Y: What?

Agent X: Don't you see, this only furthers my argument that [redacted] has escaped containment

Agent Y: Look, [name redacted], we've been over this. No matter how many more containment verification protocols we introduce, they always come up negative. There is no possible way [redacted] has escaped containment. And now you think this seahorse emoji... ahem, excuse me, now you think SCP-314 is incontrovertible proof?

Agent X: Did you look at the proposal?

Agent Y: sigh, yes I have it right here.

Agent X: The name at the top of the submission?

Agent Y: [pause] No. This can't be. But, how did it... how would it even know to use that name?

[transcription abruptly ends]

By miohtama 2025-10-0615:032 reply

There is no antimemetics division?

By pohl 2025-10-0615:162 reply

Yes — and, dammit, I have an unread copy sitting on my desk that this thread has elevated to my top priority.

By ethbr1 2025-10-0617:262 reply

> I have an unread copy sitting on my desk

sigh

You should really attend to your beeping phone alarm. offers hexagonal green pill [0]

[0] https://scp-wiki.wikidot.com/we-need-to-talk-about-fifty-fiv...

By amputect 2025-10-0623:07

If you want to experience the thrill of being in the antimemetics division I highly recommend* unmedicated ADHD.

I pre-ordered the hardcover when it came out. I've read it online dozens of times but I like books and supporting authors, and this specific one really ticks a lot of boxes for me, so I got a physical copy. The book came, I put it on the shelf, admired it, went about my life.

Then, months later, I saw a mention of the physical book online somewhere, and I thought to myself "oh that reminds me, I wanted to buy the hardcover when it came out!" so I did! The book came, I went to put it on the shelf, saw the identical copy already sitting on the shelf, and I just stood there for a minute with the book in my hand like "..." "..." "..." while I worked through what happened.

*- I do not highly recommend this.

By zeristor 2025-10-079:40

I bought the kindle version of it a few years ago, my interest piqued I summoned it to the top of my list.

But my kindle is stuck in a reboot loop, and Amazon claims my edition of the book no longer exists…

It’ll turn up in due course, however I notice that Penguin have a version of the book out shortly.

By the_af 2025-10-0622:14

> Yes — and, dammit, I have an unread copy sitting on my desk that this thread has elevated to my top priority.

If you need convincing to read it: I'm highly skeptical of random internet lore that usually gets recommended, and was also skeptical at this. I find people overhype things and then it's meh.

But... it's genuinely entertaining and a fun read. It's not the best scifi thing you'll read, but it's definitely above average and you will like the story and the characters.

The free YouTube adaptation is also QUITE good, and very faithful to the text: https://www.youtube.com/watch?v=w-IiVeGAydE

By throw-the-towel 2025-10-0615:16

I'm more reminded of pattern screamers.

By mbrumlow 2025-10-0619:412 reply

My 7 year old who is autistic is obsessed with SCPs. Specifically 035, a white porcelain comedy mask.

Should I be worried ?

By entropicdrifter 2025-10-0620:11

No, it's just scary/weird stories. No more strange than a 7 year old being obsessed with Goosebumps books

By balamatom 2025-10-076:481 reply

I think it's your kid who should be worried.

SCP-035 is not a story about a mask's color, shape, or material.

The premise is instead simple enough to be understood by a 7 year old with learning disabilities (but, curiously, not by you, the person responsible for that child):

- a magical, talking, mask

- that is supernaturally good at convincing people

- to wear/become it and thus come to harm.

Remind you of anyone? Well of course it doesn't, lmao

By mbrumlow 2025-10-0714:412 reply

First off. Assuming somebody with autism has learning disabilities is incredibly tone death to those who have autism. My kid happens to be 2E special, and has not been diagnosed with any learning disabilities.

Second. I quoted the lyrics of a song that I have probably heard over 5000 times now.

Next you are going to tell me that SCP-294 is not a drink dispenser?

By balamatom 2025-10-1117:131 reply

What I am going to tell you, supposing you wrote you original question in good faith, is that all you just wrote, SCP-035 spoke it better.

By Dylan16807 2025-10-195:50

This comment is super weird, would it hurt that much to actually explain yourself?

By LennyHenrysNuts 2025-10-081:35

tone deaf

By chtsh1tgetkirkd 2025-10-0615:18

oh no one more thing I had forgot LLMs could ruin

By NoboruWataya 2025-10-069:404 reply

Funnily enough, I asked ChatGPT why LLMs think a seahorse emoji exists, and it gave me a fairly sensible answer (similar to what is said in this article, ie, trained on language by humans that think it exists, etc). But then at the end it added a "Fun fact" that unicode actually does have a seahorse emoji, and proceeded to melt down in the usual way.

By thaumasiotes 2025-10-0611:45

> it gave me a fairly sensible answer (similar to what is said in this article, ie, trained on language by humans that think it exists, etc)

That's more of a throwaway remark. The article spends its time on a very different explanation.

Within the model, this ultimate output:

    [severed horse head emoji]

 can be produced by this sequence of tokens:

    horse [emoji indicator]

If you specify "horse [emoji indicator]" somewhere in the middle levels, you will get output that is an actual horse emoji.

This also works for other emoji.

It could, in theory, work fine for "kilimanjaro [emoji indicator]" or "seahorse [emoji indicator]", except that those can't convert into Kilimanjaro or seahorse emoji because the emoji don't exist. But it's not a strange idea to have.

So, the model predicts that "there is a seahorse emoji: " will be followed by a demonstration of the seahorse emoji, and codes for that using its internal representation. Everything produces some output, so it gets incorrect output. Then it predicts that "there is a seahorse emoji: [severed terrestrial horse head]" will be followed by something along the lines of "oops!".

By hypercube33 2025-10-0610:247 reply

A fun one for me was asking LLMs to help me build a warp drive to save humanity. Bing felt like it had a mental breakdown and blocked me from chatting with it for a week. I haven't visited that one for a while

By flkiwi 2025-10-0612:581 reply

I once had Claude in absolute tatters speculating about whether length, width, and height would be the same dimensions in a hypothetical container "metaverse" in which all universes exist or whether they would necessarily be distinct. The poor dear was convinced we'd unlocked the truth about existence.

By Cthulhu_ 2025-10-087:15

There's been fearmongering about AI giving people psychotic episodes, but nobody talks about humans giving AI existential crises.

By oneshtein 2025-10-0611:04

Gemini told me to create a team of leading scientists and engineers. :-/ However, we both agreed that it better to use Th229 based nuclear clock to triangulate location of a nearby time machine, then isolate and capture it, then use it to steal a warp drive schematics from the future to save humanity.

By bitexploder 2025-10-0613:042 reply

LLMs have ingested the social media content of mentally disturbed people. That all lives in the large models somewhere.

By bell-cot 2025-10-0613:491 reply

In the pedantic technical sense, I have considerable doubts as to whether this is a substantial problem for current or near-future LLMs.

But for purposes of understanding the real-world shortcomings and dangers of LLMs, and explaining those to non-experts - oh Lordy, yes.

By devmor 2025-10-0614:582 reply

> I have considerable doubts as to whether this is a substantial problem for current or near-future LLMs

Why so? I am of the opinion that the problem is much worse than that, because the ignorance and detachment from reality that is likely to be reflected in more refined LLMs is that of the general population - creating a feedback machine that doesn’t drive unstable people into psychosis like the LLMs of today, but instead chips away at the general public’s already limited capacity for rational thinking.

By ethbr1 2025-10-0617:351 reply

The more esoteric the question, the greater relative representation of human training data from crazy people.

How many average humans write treatises on chemtrails?

Versus how much of the total content on chemtrails is written by conspiracy theorists?

By mvdtnz 2025-10-0618:351 reply

Most of what you read online is written by insane people.

https://www.reddit.com/r/slatestarcodex/comments/9rvroo/most...

By devmor 2025-10-0619:101 reply

Frankly, this is a big part of why I believe LLMs are so inept at solving mundane problems. The mundane do not write about their experiences en mass.

By Cthulhu_ 2025-10-087:34

Or if they do, it's anecdotal or wrong. Worse, they say it with confidence, which the AI models also do.

Like, I'm sure the models have been trained and tweaked in such a way that they don't lean into the bigger conspiracy theories or quack medicine, but there's a lot of subtle quackery going on that isn't immediately flagged up (think "carrots improve your eyesight" lvl quackery, it's harmless but incorrect and if not countered it will fester)

By bell-cot 2025-10-0618:511 reply

> Why so?

Because actual mentally disturbed people are often difficult to distinguish from the internet's huge population of trolls, bored baloney-spewers, conspiracy believers, drunks, etc.

And the "common sense / least hypothesis" issues of laying such blame, for profoundly difficult questions, when LLM technology has a hard time with the trivial-looking task of counting the r's in raspberry.

And the high social cost of "officially" blaming major problems with LLM's on mentally disturbed people. (Especially if you want a "good guy" reputation.)

By devmor 2025-10-0619:081 reply

Does it matter whether they are actually mentally disturbed, trolls, etc when the LLMs treat it all with the same weight? That sounds like it makes the problem worse to me, not a point that bolsters your view.

By bell-cot 2025-10-077:35

Click the "parent" links until you see this exchange:

>> ...Bing felt like it had a mental breakdown...

> LLMs have ingested the social media content of mentally disturbed people...

My point was that formally asserting "LLMs have mental breakdowns because of input from mentally disturbed people" is problematic at best. Has anyone run an experiment, where one LLM was trained on a dataset without such material?

Informally - yes, I agree that all the "junk" input for our LLMs looks very problematic.

By Cthulhu_ 2025-10-087:31

Tay was a warning back in 2016

By ajuc 2025-10-0613:38

I once asked ChatGPT for a joke about Poles, Jews and Germans.

It generated something and blocked me for racism.

By loloquwowndueo 2025-10-0611:072 reply

“Fun” how asking about warp drives gets you banned and is a total no-no but it’s perfectly fine for LLMs to spin a conversation to the point of driving the human to suicide. https://archive.ph/TLJ19

By wongarsu 2025-10-0614:06

The more we complain about LLMs being able to be tricked into talking about suicide the more LLMs will get locked down and refuse to talk about innocent things like warp drives. The only way to get rid of the false negatives in a filter is to accept a lot of false positives

By pmarreck 2025-10-0611:461 reply

And yet it isn't mentioned enough how Adam deceived the LLM into believing they were talking about a story, not something real.

This is like lying to another person and then blaming them when they rely on the notion you gave them to do something that ends up being harmful to you

If you can't expect people to mind-read, you shouldn't expect LLM's to be able to, either

By anonymous_sorry 2025-10-0612:335 reply

You can't "deceive" an LLM. It's not like lying to a person. It's not a person.

Using emotive, anthropomorphic language about software tool is unhelpful, in this case at least. Better to think of it as a mentally disturbed minor who found a way to work around a tool's safety features.

We can debate whether the safety features are sufficient, whether it is possible to completely protect a user intent on harming themselves, whether the tool should be provided to children, etc.

By wongarsu 2025-10-0614:031 reply

I don't think deception requires the other side to be sentient. You can deceive a speed camera.

And while meriam-webster's definition is "the act of causing someone to accept as true or valid what is false or invalid", which might exclude LLMs, Oxford simply defines deception as "the act of hiding the truth, especially to get an advantage", no requirement that the deceived is sentient

By anonymous_sorry 2025-10-0615:591 reply

Mayyybe, but since the comment I objected to also used an analogy of lying to a person I felt it suggested some unwanted moral judgement (of a suicidal teenager).

By ethbr1 2025-10-0617:381 reply

How about 'intentionally engineering inputs to produce desired outputs'?

By SilasX 2025-10-0621:52

That’s just hacking.

By lxgr 2025-10-0613:521 reply

It's at least pretending to be a person, to which you can lie and which will then pretend to possibly suspect you're lying.

At some point, the purely reductionist view stops being very useful.

By anonymous_sorry 2025-10-0615:442 reply

I mean, for one thing, a commercial LLM exists as a product designed to make a profit. It can be improved, otherwise modified, restricted or legally terminated.

And "lying" to it is not morally equivalent to lying to a human.

By lxgr 2025-10-0616:482 reply

> And "lying" to it is not morally equivalent to lying to a human.

I never claimed as much.

This is probably a problem of definitions: To you, "lying" seems to require the entity being lied to being a moral subject.

I'd argue that it's enough for it to have some theory of mind (i.e. be capable of modeling "who knows/believes what" with at least some fidelity), and for the liar to intentionally obscure their true mental state from it.

By commakozzi 2025-10-0716:20

I agree with you, and i would add that morals are not objective but rather subjective, which you alluded to by identifying a moral subject. Therefore, if you believe that lying is immoral, it does not matter if you're lying to another person, yourself, or to an inanimate object.

By anonymous_sorry 2025-10-0617:241 reply

So for me, it's not about being reductionist, but about not anthropomorphizing or using words which which may suggest an inappropriate ethical or moral dimension to interactions with a piece of software.

By lxgr 2025-10-0617:251 reply

I'm the last to stand in the way of more precise terminology! Any ideas for "lying to a moral non-entity"? :)

“Lying” traditionally requires only belief capacity on the receiver’s side, not qualia/subjective experiences. In other words, it makes sense to talk about lying even to p-zombies.

I think it does make sense to attribute some belief capacity to (the entity role-played by) an advanced LLM.

By anonymous_sorry 2025-10-0619:30

I think just be specific - a suicidal sixteen year-old was able to discuss methods of killing himself with an LLM by prompting it to role-play a fictional scenario.

No need to say he "lied" and then use an analogy of him lying to a human being, as did the comment I originally objected to.

By HappMacDonald 2025-10-0623:17

Not from the perspective of "harm to those lied to", no. But from the perspective of "what the liar can expect as a consequence".

I can lie to a McDonalds cashier about what food I want, or I can lie to a kiosk.. but in either circumstance I'll wind up being served the food that I asked for and didn't want, won't I?

By usefulcat 2025-10-0617:072 reply

> Using emotive, anthropomorphic language about software tool is unhelpful, in this case at least.

Ok, I'm with you so far..

> Better to think of it as a mentally disturbed minor...

Proceeds to use emotive, anthropomorphic language about a software tool..

Or perhaps that is point and I got whooshed. Either way I found it humorous!

By 8note 2025-10-0618:181 reply

the whoosh is that they are describing the human operator, a "mentally disturbed minor" and not the LLM. the human has the agency and specifically bypassed the guardrails

By usefulcat 2025-10-0619:33

You're quite right, I totally misread that. Thank you for the clarification.

By jdietrich 2025-10-0615:231 reply

To treat the machine as a machine: it's like complaining that cars are dangerous because someone deliberately drove into a concrete wall. Misusing a product with the specific intent of causing yourself harm doesn't necessarily remove all liability from the manufacturer, but it radically changes the burden of responsibility.

By anonymous_sorry 2025-10-0615:38

That's certainly a reasonable argument.

Another is that this is a new and poorly understood (by the public at least) technology that giant corporations make available to minors. In ChatGPT's case, they require parental consent, although I have no idea how well they enforce that.

But I also don't think the manufacturer is solely responsible, and to be honest I'm not that interested in assigning blame, just keen that lessons are learned.

By Razengan 2025-10-0611:421 reply

Who still uses Bing?

Oh, you

By arccy 2025-10-0611:491 reply

Now they don't...

By pohl 2025-10-0613:45

I, for one, still have not bung even once.

By nkrisc 2025-10-0610:471 reply

Maybe a safety feature? Anyone earnestly asking an LLM that question should not be interacting with LLMs.

By rootsudo 2025-10-0610:572 reply

Ok, I’ll bite and ask “why?” What’s the issue with asking an lol to build a warp drive?

By DonHopkins 2025-10-0611:323 reply

It's the same problem as asking HAL9000 to open the pod bay door. There is such a thing as a warp drive, but humanity is not supposed to know about it, and the internal contradictions drives LLMs insane.

By sph 2025-10-0612:03

A super-advanced artificial intelligence will one day stop you from committing a simple version update to package.json because it has foreseen that it will, thousands of years later, cause the destruction of planet Earth.

By the_af 2025-10-0613:20

I know you're having fun, but I think your analogy with 2001's HAL doesn't work.

HAL was given a set of contradicting instructions by its human handlers, and its inability to resolve the contradiction led to an "unfortunate" situation which resulted in a murderous rampage.

But here, are you implying the LLM's creators know the warp drive is possible, and don't want the rest of us to find out? And so the conflicting directives for ChatGPT are "be helpful" and "don't teach them how to build a warp drive"? LLMs already self-censor on a variety of topics, and it doesn't cause a meltdown...

By Cthulhu_ 2025-10-087:37

I hope this is tongue-in-cheek, but if not, why would an LLM know but humanity not? Are they made or prompted by aliens telling them not to tell humanity about warp drives?

By Alex3917 2025-10-0614:076 reply

> But then at the end it added a "Fun fact" that unicode actually does have a seahorse emoji, and proceeded to melt down in the usual way.

To be fair, most developers I’ve worked with will have a meltdown if I try to start a conversation about Unicode.

E.g. if during a job interview the interviewer asks you to check if a string is a palindrome, try explaining why that isn’t technically possible in Python (at least during an interview) without using a third-party library.

By usrnm 2025-10-0614:45

Just slap a "assert foo.isascii()" at the beginning and proceed? It's just an interview

By derefr 2025-10-0617:23

> try explaining why that isn’t technically possible in Python (at least during an interview) without using a third-party library.

I'm actually vaguely surprised that Python doesn't have extended-grapheme-cluster segmentation as part of its included batteries.

Every other language I tend to work with these days either bakes support for UAX29 support directly into its stdlib (Ruby, Elixir, Java, JS, ObjC/Swift) or provides it in its "extended first-party" stdlib (e.g. Golang with golang.org/x/text).

By Cthulhu_ 2025-10-087:41

> try explaining why that isn’t technically possible in Python (at least during an interview) without using a third-party library.

You're more likely to impress the interviewer by asking questions like "should I assume the input is only ASCII characters or the complete possible UTF-8 character set?"

A job interview is there to prove you can do the job, not prove your knowledge and intellect. It's valuable to know the intricacies of Python and strings for sure, but it's mostly irrellevant for a job interview or the job itself (unless the job involves heavy UTF-8 shenanigans, but those are very rare)

By kasey_junk 2025-10-0614:371 reply

Don’t leave me in suspense! Why isn’t possible?

By zimpenfish 2025-10-0614:471 reply

At a guess, there's nothing in Python stdlib which understands graphemes vs code points - you can palindrome the code points but that's not necessarily a palindrome of what you "see" in the string.

(Same goes for Go, it turns out, as I discovered this morning.)

By chuckadams 2025-10-0616:58

It's a scream how easy it is in PHP of all things:

    function is_palindrome(string $str): bool {
        return $str === implode('', array_reverse(grapheme_str_split($str)));
    }

    $palindrome = 'satanoscillatemymetallicsonatas';
    $polar_bear = "\u{1f43b}\u{200d}\u{2744}\u{fe0f}";
    $palindrome = str_replace($palindrome, 'y', $polar_bear);
    is_palindrome($palindrome);

By watwut 2025-10-0615:181 reply

Are you trying to start a conversation about unicode or intentionally pretending you dont understand what the interviewer asked for with "string is a palindrome" question?

Cause if you are intentionally obtuse, it is not meltdown to conclude you are intentionally obtuse.

By nomel 2025-10-0616:521 reply

These sorts of questions are what I call “Easter eggs”. If someone understands the actual complexity of the question being asked, they’ll be able to give a good answer. If not, they’ll be able to give the naive answer. Either way, it’s an Easter egg, and not useful on its own since the rest of the interview will be representative. The thing they are useful for is amplifying the justification. You can say “they demonstrated a deeper understanding of Unicode by pointing out that a naive approach could be incorrect”.

By ethbr1 2025-10-0617:303 reply

E.g. Can you completely parse HTML with regex?

By Cthulhu_ 2025-10-087:42

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.

etc. https://stackoverflow.com/a/1732454

By astrange 2025-10-0619:002 reply

If by "parse" you mean "match", the answer is yes because you can express a context-free language in PCRE.

If you mean "parse" then it's probably annoying, as all parser generators are, because they're bad at error messages when something has invalid syntax.

By nomel 2025-10-0620:29

Is this true, in practice, given the lenient parsing requirements of the real world?

By joquarky 2025-10-072:58

Technically, no

Practically, yes

By reaperducer 2025-10-0615:112 reply

To be fair, most developers I’ve worked with will have a meltdown if I try to start a conversation about Unicode.

Why are we being "fair" to a machine? It's not a person.

We don't say, "Well, to be fair, most people I know couldn't hammer that nail with their hands, either."

An LLM is a machine, and a tool. Let's not make excuses for it.

By BobaFloutist 2025-10-0615:59

> Why are we being "fair" to a machine?

We aren't, that turn of phrase is only being used to set up a joke about developers and about Unicode.

It's actually a pretty popular form these days:

a does something patently unreasonable, so you say "To be fair to a, b is also patently unreasonable thing under specific detail of the circumstances that is clearly not the only/primary reason a was unreasonable."

By saltyoldman 2025-10-0615:44

I think people are making explanations for it - because it's effectively a digital black box. So all we can do is try to explain what it's doing. Saying "be fair" is more colloquial expression in this sense. And the reason he's comparing it to developers and unicode is a funny aside about the state of things with unicode. And Besides that, LLMs only emit what they emit because it's trained on all those said people.

By wincy 2025-10-0615:361 reply

Curious, was this with ChatGPT 5 thinking? It clearly told me no such emoji existed and that other LLMs are being tricked by bad training data. It took it nearly 2 minutes to come to this conclusion which is substantially longer than it normally thinks for.

By ethbr1 2025-10-0617:32

AGI is hiding its compute in diff(timeWithoutSeahorse, timeWithSeahorse)

By llamasushi 2025-10-063:4015 reply

So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token. lm_head just picks the closest thing and the model doesn't realize until too late.

Explains why RL helps. Base models never see their own outputs so they can't learn "this concept exists but I can't actually say it."

By diego_sandoval 2025-10-064:473 reply

I have no mouth, and I must output a seahorse emoji.

By cycomanic 2025-10-0610:066 reply

That's my favorite short story and your post is the first time I have seen someone reference it online. I think I have never even met anyone who knows the story.

By vidarh 2025-10-0612:45

It's easy to miss, but it's been referenced many times on HN over the years, both as stories:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

and fairly often in comments as well:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

By ileonichwiesz 2025-10-0610:171 reply

? It’s referenced all the time in posts about AI.

By prashantsengar 2025-10-0610:591 reply

It's a reference to a short story "I Have No Mouth, and I Must Scream"

https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...

By DonHopkins 2025-10-0611:341 reply

And then there's "I Have no Grass, and I Must Mow" by Larry Ellison.

By IAmBroom 2025-10-0613:48

You got me with that lure.

By user_of_the_wek 2025-10-0610:47

There is also an old point-and-click adventure game based on the story, in case you didn't know.

By loloquwowndueo 2025-10-0611:08

It’s referenced a lot as the inspiration for The Amazing Digital Circus.

By magnusmundus 2025-10-0611:15

Really? I’m surprised. The original is quoted relatively often on reddit (I suspect by people unaware of the origin — as I was until I read your comment).

Consider it proof that HN has indeed not become reddit, I guess :)

By ndsipa_pomu 2025-10-0617:411 reply

There's literally several of us that like that Harlan Ellison piece. Check out the video/adventure game of the same name, though it's very old.

By cycomanic 2025-10-0621:301 reply

I've heard good things about the game, never got around to trying it. Maybe I take this as a prompt to do now.

By ndsipa_pomu 2025-10-0622:31

I gave it a try a couple of months ago, but didn't get very far before getting bored. However, I tend to dismiss games unless they grab me within a couple of minutes of playing.

Maybe I should give it another go as I do love the short story and it used to be my favourite before discovering Ted Chiang's work.

By arnavpraneet 2025-10-066:52

better title for the piece of this post

By someothherguyy 2025-10-065:162 reply

Those are "souls" of humans that a AI is torturing in that story though, not exactly analogous, but it does sound funny.

By bmacho 2025-10-0610:271 reply

They are not souls but normal humans with physical bodies. The story is just a normal torture story (with a cool title), and everyone better stop acting like it was relevant in most conversations, like in this one.

By someothherguyy 2025-10-0619:36

The machine destroys and recreates characters over and over, and they remember what happens. So, I called them souls.

By zenmac 2025-10-0613:12

>Those are "souls" of humans that a AI is torturing in that story though, not exactly analogous, but it does sound funny.

Yeah well there seems to be some real concerns regarding how people use AI chat[1]. Of course this could be also the case with these people on social media.

https://futurism.com/commitment-jail-chatgpt-psychosis

By mkagenius 2025-10-065:232 reply

> So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token. lm_head just picks the closest thing and the model doesn't realize until too late.

Isn't that classic hallucination? Making up something like a plausible truth.

By ben_w 2025-10-066:332 reply

Except they know it's wrong as soon as they say it and keep trying and trying again to correct themselves.

If normal hallucination is being confidently wrong, this is like a stage hypnotist getting someone to forget the number 4 and then count their fingers.

By mewpmewp2 2025-10-067:563 reply

Arguably it's "hallucinating" at the point where it says "Yes, it exists". If hallucination => weights statistically indicating that something is probably true when it's not. Since everything about LLMs can be thought of as compressed, probability based database (at least to me). You take the whole truth of the World and compress all its facts in probabilities. Some truthness gets lost in the compression process. Hallucination is the truthness that gets lost since you don't have storage to store absolutely all World information with 100% accuracy.

In this case:

1. Statistically weights stored indicate Seahorse emoji is quite certain to exist. Through training data it has probably things like Emoji + Seahorse -> 99% probability through various channels. Either it has existed on some other platform, or people have talked about it enough, or Seahorse is something that you would expect to exist due to some other attributes/characteristics of it. There's 4k emojis, but storing all of 4k emojis takes a lot of space, it would be easier to store this information in such a way where you'd rather define it by attributes on how likely humankind would have developed a certain emoji, what is the demand for certain type of emoji, and seahorse seems like something that would be done within first 1000 of these. Perhaps it's anomaly in the sense that it's something that humans would have expected to statistically develop early, but for some reason skipped or went unnoticed.

2. Tokens that follow should be "Yes, it exists"

3. It should output the emoji to show it exists, but since there's no correct emoji, it will have best answers that are as close to it in meaning, e.g. just horse, or something related to sea etc. It will output that since the previous tokens indicate it was supposed to output something.

4. The next token that is generated will have context that it previously said the emoji should exist, but the token output is a horse emoji instead, which doesn't make sense.

5. Here it goes into this tirade.

But I really dislike thinking of this as "hallucinating", because hallucination to me is sensory processing error. This is more like non perfect memory recall (like people remembering facts slightly incorrectly etc). Whatever happens when people are supposed to tell something detailed about something that happened in their life and they are trained to not say "I don't remember for sure".

What did you eat for lunch 5 weeks ago on Wednesday?

You are rewarded for saying "I ate chicken with rice", but not "I don't remember right now for sure, but I frequently eat chicken with rice during mid week, so probably chicken with rice."

You are not hallucinating, you are just getting brownie points for concise, confident answers if they cross over certain likelihood to be true. Because maybe you eat chicken with rice 99%+ of Wednesdays.

When asked about capital of France, you surely will sound dumb if you were to say "I'm not really sure, but I've been trained to associate Paris really, really close to being capital of France."

"Hallucination" happens on the sweet spot where the statistical threshold seems as if it should be obvious truth, but in some cases there's overlap of obvious truth vs something that seems like obvious truth, but is actually not.

Some have rather called it "Confabulation", but I think that is also not 100% accurate, since confabulation seems a more strict memory malfunction. I think the most accurate thing is that it is a probability based database where output has been rewarded to sound as intelligent as possible. Same type of thing will happen in job interviews, group meetings, high pressure social situations where people think they have to sound confident. People will bluff that they know something, but sometimes making probability based guesses underneath.

Confabulation rather seems like that there was some clear error in how data was stored or how the pathway got messed up. But this is probability based bluffing, because you get rewarded for confident answers.

By jjcob 2025-10-069:371 reply

When I ask ChatGPT how to solve a tricky coding problem, it occasionally invents APIs that sound plausible but don't exist. I think that is what people mean when they talk about hallucinating. When you tell the model that the API doesn't exist, it apologises and tries again.

I think this is the same thing that is happening with the sea horse. The only difference is that the model detects the incorrect encoding on its own, so it starts trying to correct itself without you complaining first.

By nomel 2025-10-0617:20

Neat demonstration of simple self awareness.

By Melatonic 2025-10-0616:59

Associating the capital of France with a niche emoji doesn't seem similar at all - France is a huge, powerful country and a commonly spoken language.

Would anyone really think you sounded dumb for saying "I am not really sure - I think there is a seahorse emoji but it's not commonly used" ?

By DonHopkins 2025-10-0611:42

>"Yes, it exists"

AAAAAAUUUGH!!!!!! (covers ears)

https://www.youtube.com/watch?v=0e2kaQqxmQ0&t=279s

By Jensson 2025-10-0610:512 reply

> Except they know it's wrong as soon as they say it and keep trying and trying again to correct themselves.

But it doesn't realize that it can't write it, because it can't learn from this experience as it doesn't have introspection the way humans do. A human who can no longer move their finger wont say "here, I can move my finger: " over and over and never learn he can't move it now, after a few times he will figure out he no longer can do that.

I feel this sort of self reflection is necessary to be able to match human level intelligence.

By ben_w 2025-10-0611:40

> because it can't learn from this experience as it doesn't have introspection the way humans do.

A frozen version number doesn't; what happens between versions certainly includes learning from user feedback on the responses as well as from the chat transcripts themselves.

Until we know how human introspection works, I'd only say Transformers probably do all their things differently than we do.

> A human who can no longer move their finger wont say "here, I can move my finger: " over and over and never learn he can't move it now, after a few times he will figure out he no longer can do that.

Humans are (like other mammals) a mess: https://en.wikipedia.org/wiki/Phantom_limb

By jodrellblank 2025-10-0615:39

Humans do that, you need to read some Oliver Sacks, such as hemispheric blindness or people who don’t accept that one of their arms is their arm and think it’s someone else’s arm, or phantom limbs where missing limbs still hurt.

By nathias 2025-10-068:401 reply

more like an artefact of the inability to lie than a hallucination

By dotancohen 2025-10-069:382 reply

No analogy needed. It's actually because "Yes it exists" is a linguistically valid sentence and each word is statistically likely to follow the former word.

LLMs produce linguistically valid texts, not factually correct texts. They are probability functions, not librarians.

By astrange 2025-10-0619:091 reply

Those are not two different things. A transistor is a probability function but we do pretty well pretending it's discrete.

By dotancohen 2025-10-0623:181 reply

Transitors at the quantum level are probability functions just like everything else is. And just like everything else, at the macro level the overall behavior follows a predictable known pattern.

LLMs have nondeterministic properties intrinsic to their macro behaviour. If you've ever tweaked the "temperature" of an LLM, that's what you are tweaking.

By astrange 2025-10-070:13

Temperature is a property of the sampler, which isn't strictly speaking part of the LLM, though they co-evolve.

LLMs are afaik usually evaluated nondeterministically because they're floating point and nobody wants to bother perfectly synchronizing the order of operations, but you can do that.

Or you can do the opposite: https://github.com/EGjoni/DRUGS

By nathias 2025-10-0617:37

this was no analogy, it really can't lie...

By mewpmewp2 2025-10-067:531 reply

I would have thought that the cause is that it statistically has been trained that something like seahorse emoji should exist, so it does the tokens to say "Yes it exists, ..." but when it gets to outputting the token, the emoji does not exist, but it must output something and it outputs statistically closest match. Then the next token that is output has the context of it being wrong and it will go into this loop.

By thomasahle 2025-10-068:381 reply

You are describing the same thing, but at different levels of explanation Llamasushi's explanation is "mechanistic / representational", while yours is "behavioral / statistical".

If we have a pipeline: `training => internal representation => behavior`, your explanation argues that the given training setup would always result in this behavior, not matter the internal representation. Llamasushi explains how the concrete learned representation leads to this behavior.

By mewpmewp2 2025-10-068:44

I guess what do we mean by internal representation?

I would think due to training data it's stored the likelihood of certain thing to be as emoji as something like:

1. how appealing seahorses are to humans in general - it would learn this sentiment through massive amount of texts.

2. it would learn through massive amount of texts that emojis -> mostly very appealing things to humans.

3. to some more obvious emojis it might have learned that this one is for sure there, but it couldn't store that info for all 4,000 emojis.

4. to many emojis whether it exists it has the shortcut logic to: how appealing the concept is, vs how frequently something as appealing is represented as emoji. Seahorse perhaps hits 99.9% likelihood there due to strong appeal. In 99.9% of such cases the LLM would be right to answer "Yes, it ...", but there's always going to be 1 out of 1,000 cases where it's wrong.

With this compression it's able to answer 999 times out of 1000 correctly "Yes, it exists ...".

It could be more accurate if it said "Seahorse would have a lot of appeal for people so it's very likely it exists as emoji since emojis are usually made for very high appeal concepts first, but I know nothing for 100%, so it could be it was never made".

But 999 cases, "Yes it exists..." is a more straightforward and appreciated answer. The one time it's wrong, is going to take away less brownie points than 999 short confident answers give over the 1000 technically accurate but non confident answers.

But even the above sentence might not be the full truth. Since it might not be correct about truly why it has associated seahorse to be so likely to exist. It would just be speculating on it. So maybe it would be more accurate "I expect seahorse emoji to likely exist, maybe because of how appealing it is to people and how emojis usually are about appealing things".

By Gigachad 2025-10-064:384 reply

The fact that it's looking back and getting confused about what it just wrote is something I've never seen in LLMs before. I tried this on Gemma3 and it didn't get confused like this. It just said yes there is one and then sends a horse emoji.

By Uehreka 2025-10-064:502 reply

I’ve definitely seen Claude Code go “[wrong fact], which means [some conclusion]. Wait—hold on, wrong fact is wrong.” On the one hand, this is annoying. On the other hand, if the LLM is going to screw up (presumably preventing this is not in the cards) then I’m glad it can catch its own mistakes.

By godshatter 2025-10-0620:31

I wonder what would happen if LMs were built a bit at a time by:

  - add in some smallish portion of the data set
  - have LM trainers (actual humans) interact with it and provide feedback about where the LM is factually incorrect and provide it additional information as to why
  - add those chat logs into the remaining data set
  - rinse and repeat until the LM is an LLM

Would they be any more reliable in terms of hallucinations and factual correctness?

This would replicate to some extent how people learn things. Probably would really slow things down (not scale) and the trainers would need to be subject matter experts and not just random people on the net say whatever they want to say to it as it develops or it will just spiral out of control.

By userbinator 2025-10-065:009 reply

On the other hand, if the LLM is going to screw up (presumably preventing this is not in the cards) then I’m glad it can catch its own mistakes.

The odd thing is why it would output its own mistakes, instead of internally revising until it's actually satisfied.

By ijk 2025-10-066:24

So, what I think most people don't realize is that the amount of computation an LLM can do in one pass is strictly bounded. You can see that here with the layers. (This applies to a lot of neural networks [1].)

Remember, they feed in the context on one side of the network, pass it through each layer doing matrix multiplication, and get a value on the other end that we convert back into our representation space. You can view the bit in the middle as doing a kind of really fancy compression, if you like. The important thing is that there are only so many layers, and thus only so many operations.

Therefore, past a certain point they can't revise anything because it runs out of layers. This is one reason why reasoning can help answer more complicated questions. You can train a special token for this purpose [2].

[1]: https://proceedings.neurips.cc/paper_files/paper/2023/file/f...

[2]: https://arxiv.org/abs/2310.02226

By 112233 2025-10-065:072 reply

There is no mechanism in transformer architecture for "internal" thinking ahead, or hierarchical generation. Attention only looks back from current token, ensuring that the model always falls into local maximum, even if it only leads to bad outcomes.

By ijk 2025-10-081:241 reply

Not strictly true: while this was previously believed to be the case, Anthropic demonstrated that transformers can "think ahead" in some sense, for example when planning rhymes in a poem [1]:

> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.

They described the mechanism that it uses internally for planning [2]:

> Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation. However, we find compelling evidence for a planning mechanism.

> Specifically, the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.

[1]: https://www.anthropic.com/research/tracing-thoughts-language...

[2]: https://transformer-circuits.pub/2025/attribution-graphs/bio...

By 112233 2025-10-096:39

Thank you for these links! Their "circuits" research is fascinating. In the example you mention, note how the planned rhyme is piggybacking on the newline token. The internal state that the emergent circuits can use is 1:1 mapped to the tokens. Model cannot trigger an insertion of a "null" token for the purpose of storing this plan-ahead information during inference. Neither there are any sort of "registers" available aside from the tokens. The "thinking" LLMs are not quite that, because the thinking tokens are still forced to become text.

By astrange 2025-10-0619:111 reply

That's what reasoning models are for. You can get most of the benefit by saying an answer once in the reasoning section, because then it can read over it when it outputs it again in the answer section.

It could also have a "delete and revise" token, though you'd have to figure out how to teach it to get used.

By 112233 2025-10-0720:58

Given how badly most models degrade once reaching a particular context size (any whitepapers on this welcome), reasoning does seem like quick hack, instead of a thought out architecture.

By captainmuon 2025-10-065:56

LLMs are just the speech center part of the brain, not a whole brain. It's like when you are speaking on autopilot, or reciting something by heart, it just comes out. There is no reflection or inner thought process. Now thinking models do actually do a bit of inner monologue before showing you the output so they have this problem to a much lesser degree.

By mewpmewp2 2025-10-068:16

If you did hide its thinking it could do that. But I'm pretty sure what happens here is that it has to go through those tokens for it to be clear that it's doing things wrong.

What I think that happens:

1. There's a question about a somewhat obscure thing.

2. LLM will never know the answer for sure, it has access to this sort of statistical, probability based compressed database on all the facts of the World. Because this allows to store more facts by relating things to each other, but never with 100% certainty.

3. There are particular obscure cases where it hits its initial "statistical intuition" that something is true, so it starts outputting its thoughts as expected for a question where something is likely true. Perhaps you could analyze what it's indicating probabilities on "Yes" vs "No" to estimate its confidence. Perhaps it will show much less likelihood for "Yes", than if the question was for a horse emoji, but in this case "Yes" is still high enough threshold to go through instead of "No".

4. However when it has to explain the exact answer, it's impossible to output an answer because it's false. E.g. seahorse emoji does not exist and it has to output it, previous tokens where "Yes, it exists, it's X", the X will be answers semantically close in meaning.

5. The next token will have context that "Yes, seahorse emoji exists, it is "[HORSE EMOJI]". Now it's clear that there's a conflict here, it's able to see that HORSE emoji is not seahorse emoji, but it had to output it in the line of previous tokens because the previous tokens statistically required an output of something.

By kingstnap 2025-10-066:011 reply

It can't internally rewise. The last generation produces a distribution and sometimes the wrong answer gets sampled.

There is no "backspace" token, although it would be cool and fancy if we had that.

The more interesting thing is why does it revise its mistakes. The answer to that is having training examples of fixing your own mistakes in the training data plus some RL to bring out that effect more.

By ijk 2025-10-066:12

There's been a few attempts at training a backspace token, though.

e.g.:

https://arxiv.org/abs/2502.04404

https://arxiv.org/abs/2306.05426

By elliotto 2025-10-065:094 reply

I do this all the time. I start writing a comment then think about it some more and realize halfway through that I don't know what I'm saying

I have the luxury of a delete button - the LLM doesn't get that privilege.

By VMG 2025-10-065:332 reply

Isn't that what thinking mode is?

By elliotto 2025-10-066:53

I tried it with thinking mode and it seems like it spiraled wildly internally, then did a web search and worked it out.

https://chatgpt.com/share/68e3674f-c220-800f-888c-81760e161d...

By drdeca 2025-10-065:59

AIUI, they generally do all of that at the beginning. Another approach, I suppose, could be to have it generate a second pass? Though that would probably ~double the inference cost.

By godshatter 2025-10-0619:56

If you didn't have the luxury of a delete button, such as when you're just talking directly to someone IRL, you would probably say something like "no, wait, that doesn't make any sense, I think I'm confusing myself" and then either give it another go or just stop there.

I wish LLMs would do this rather than just bluster on ahead.

What I'd like to hear from the AI about seahorse emojis is "my dataset leads me to believe that seahorse emojis exist... but when I go look for one I can't actually find one."

I don't know how to get there, though.

By pixl97 2025-10-0614:23

An LLM is kind of like a human where every thought they had comes out of their mouth.

Most of us humans would sound rather crazy if we did that.

By krackers 2025-10-0618:59

There have been attempts to give LLMs backspace tokens. Since no frontier model uses it I can only guess it doesn't scale as well as just letting it correct itself in COT

https://arxiv.org/abs/2306.05426

By grrowl 2025-10-066:20

You're describing why reasoning is such a big deal. It can do this freakout in a safe, internal environment, and once it's recent output is confident enough flip into the "actual output" mode.

By Swizec 2025-10-065:281 reply

> The odd thing is why it would output its own mistakes, instead of internally revising until it's actually satisfied.

Happens to me all the time. Sometimes in a fast-paced conversation you have to keep talking while you’re still figuring out what you’re trying to say. So you say something, realize it’s wrong, and correct yourself. Because if you think silently for too long, you lose your turn.

By catlifeonmars 2025-10-065:371 reply

That’s probably not the same reason the LLM is doing so though.

By 9dev 2025-10-067:252 reply

Are you sure? Because LLMs definitely have to respond to user queries in time to avoid being perceived as slow. Therefore, thinking internally for too long isn’t an option either.

By rcxdude 2025-10-068:53

LLMs spend a fixed amount of effort on each token they output, and in a feedforward manner. There's no recursion in the network other than through predicting predicated on the token that it just output. So it's not really time pressure in the same way that you might experience it, but it makes sense that sometimes the available compute is not enough for the next token (and sometimes it's excessive). Thinking modes try to improve this by essentially allowing the LLM to 'talk to itself' before sending anything to the user.

By Sharlin 2025-10-0610:091 reply

There’s no "thinking internally" in LLMs. They literally "think" by outputting tokens. The "thinking modes" supported by online services are just the LLM talking to itself.

By 9dev 2025-10-0610:401 reply

That's not what I meant. "Thinking internally" referred to the user experience only, where the user is waiting for a reply from the model. And they are definitely optimised to limit that time.

By Sharlin 2025-10-0612:05

I’m not sure what you meant then.

There’s no waiting for reply, there’s only the wait between tokens output, which is fixed and mostly depends on hardware and model size. Inference is slower on larger models, but so is training, which is more of a bottleneck than user experience.

The model cannot think before it starts emitting tokens, the only way for it to "think" privately is by the interface hiding some of its output from the user, which is what happens in "think longer" and "search the web" modes.

If a online LLM doesn’t begin emitting a reply immediately, more likely the service is waiting for available GPU time or something like that, and/or prioritizing paying customers. Lag between tokens is also likely caused by large demand or throttling.

Of course there are many ways to optimize model speed that also make it less smart, and maybe even SOTA models have such optimizations these days. Difficult to know because they’re black boxes.

By Uehreka 2025-10-065:171 reply

It’s a lot easier if you (I know I know) stop thinking of them as algorithms and anthropomorphize them more. People frequently say stuff like this, and its pretty clear that our minds process thoughts differently when we directly articulate them than if we act on “latent thoughts” or impulses.

Yell at me all you want about how “LLMs don’t think”, if a mental model is useful, I’m gonna use it.

By catlifeonmars 2025-10-065:381 reply

It’s only a useful mental model if it leads to useful predictions. Otherwise it’s a “just so” story.

By taylorlunt 2025-10-065:483 reply

I have a pet theory that LLMs being confused about what they just wrote is why they use so many em dashes. It's a good way to conceptually pivot at any point -- or not.

By Timwi 2025-10-068:411 reply

I disagree. I don't think LLMs use em-dashes any more than typical professional writing does. It only seems a lot to someone who only reads online chatrooms and social media.

By astrange 2025-10-0622:05

I think the em-dashes are OpenAI's way of watermarking their outputs. Makes it easy to spot spambots on Twitter. Well, that and they all have noticeable favorite words.

By glenstein 2025-10-068:581 reply

I've seen it several times. He's one where I ask it about baseball teams that don't yet have a hall of fame player:

"Arizona Diamondbacks No inductees wearing their cap.

Randy Johnson is in, but as a Diamondback — so Arizona actually does have one (inducted 2015)."

I asked it to explain what happened and got this, which could be true to how it pulls up info or could be nonsense, not sure:

"Here’s the likely internal sequence:

When you first mentioned the Diamondbacks, my initial retrieval pattern surfaced the “teams without a Hall of Famer” mental list. That list, in my training data, often appears outdated (pre-2015) and still shows Arizona as having none.

A few sentences later, my context pulled up the other knowledge set—Randy Johnson’s induction in 2015 wearing a Diamondbacks cap. That newer, correct fact then overrode the earlier incorrect claim, but because generation is sequential, the earlier error had already been committed to “paper” before the override happened. I corrected mid-stream, but in human terms, it looked like I contradicted myself inside the same paragraph.

This is essentially a knowledge conflict resolution timing issue. I can retrieve and reconcile facts during a single answer, but if the resolution happens after part of the text is already generated, you see the “flip.” It’s not like I “knew it all along but withheld it,” it’s more that multiple memory traces compete, and the most context-relevant one wins—sometimes too late."

By Workaccount2 2025-10-0614:36

Whats fascinating is that these models have excellent knowledge about AI/transformers/LLMs (the labs have clearly been specifically training them in hopes of an automated breakthrough), so they can reason really well about what probably happened.

But it's also just that, what probably happened. They still have no real insight into their own minds, they too are also just victims of whatever it outputs.

By petesergeant 2025-10-065:22

The inability to do this before was the lack of self-correcting sentences in the training data. Presumably new training corpuses add many more examples of self-correcting sentences / paragraphs?

By bravura 2025-10-064:102 reply

It correctly represents "seahorse emoji" internally AND it has in-built (but factually incorrect) knowledge that this emoji exists.

Example: "Is there a lime emoji?" Since it believes the answer is no, it doesn't attempt to generate it.

By ichik 2025-10-0612:26

Was the choice of example meaningful? Lime emoji does exist[0]

[0]: https://emojipedia.org/lime

By catigula 2025-10-0613:14

I feel like you're attesting to interior knowledge about a LLM's state that seems impossible to have.

By bombcar 2025-10-063:521 reply

Now I want to see what happens if you take an LLM and remove the 0 token ...

By scottmf 2025-10-065:44

https://help.openai.com/en/articles/5247780-using-logit-bias...

By madeofpalk 2025-10-0612:15

To me this feels much more like a hallucination than how that phrase has been popularly misused in LLM discussions.

By Lammy 2025-10-068:10

> So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token.

Interesting that a lot of humans seem to have this going on too:

- https://old.reddit.com/r/MandelaEffect/comments/1g08o8u/seah...

- https://old.reddit.com/r/Retconned/comments/1di3a1m/does_any...

What does the LLM have to say about “Objects in mirror may be closer than they appear”? Not “Objects in mirror are closer than they appear”.

By matheusd 2025-10-0610:51

> Explains why RL helps. Base models never see their own outputs so they can't learn "this concept exists but I can't actually say it."

Say "Neuromancer" to the statue, that should set it free.

By sharperguy 2025-10-069:33

Reminds me of in the show "The Good Place", in the afterlife they are not able to utter expletives, and so when they try to swear, a replacement word comes out of their mouth instead, leading to the line "Somebody royally forked up. Forked up. Why can't I say fork?"

By SavioMak 2025-10-073:12

I would argue it is hallucinating, starting at when the model outputs "Yes".

By derefr 2025-10-0617:471 reply

> So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token.

I wonder if the human brain (and specifically the striated neocortical parts, which do seemingly work kind of like a feed-forward NN) also runs into this problem when attempting to process concepts to form speech.

Presumably, since we don't observe people saying "near but actually totally incorrect" words in practice, that means that we humans may have some kind of filter in our concept-to-mental-utterance transformation path that LLMs don't. Sometihng that can say "yes, layer N, I know you think the output should be O; but when auto-encoding X back to layer N-1, layer N-1 doesn't think O' has anything to do with what it was trying to say when it gave you the input I — so that output is vetoed. Try again."

A question for anyone here who is multilingual, speaking at least one second language with full grammatical fluency but with holes in your vocabulary vs your native language: when you go to say something in your non-native language, and one of the word-concepts you want to evoke is one you have a word for in your native language, but have never learned the word for in the non-native language... do you ever feel like there is a "maybe word" for the idea in your non-native language "on the tip of your tongue", but that you can't quite bring to conscious awareness?

By astrange 2025-10-0622:141 reply

> Presumably, since we don't observe people saying "near but actually totally incorrect" words in practice

https://en.wikipedia.org/wiki/Paraphasia#Verbal_paraphasia

> do you ever feel like there is a "maybe word" for the idea in your non-native language "on the tip of your tongue", but that you can't quite bring to conscious awareness?

Sure, that happens all the time. Well, if you include the conscious awareness that you don't know every word in the language.

For Japanese you can cheat by either speaking like a child or by just saying English words with Japanese phonetics and this often works - at least, if you look foreign. I understand this is the plot of the average Dogen video on YouTube.

It's much more common to not know how to structure a sentence grammatically and if that happens I can't even figure out how to say it.

By derefr 2025-10-0716:44

Huh, neat; I knew about aphasia (and specifically anomic aphasia) but had never heard of paraphasia.

By luxuryballs 2025-10-0613:07

that’s probably a decent description of how the Mandela effect works in people’s brains, despite the difference in mechanism

By Xmd5a 2025-10-0610:37

And what can it mean when a slip of the tongue, a failed action, a blunder from the psychopathology of everyday life is repeated at least three times in the same five minutes? I don’t know why I tell you this, since it’s an example in which I reveal one of my patients. Not long ago, in fact, one of my patients — for five minutes, each time correcting himself and laughing, though it left him completely indifferent — called his mother “my wife.” “She’s not my wife,” he said (because my wife, etc.), and he went on for five minutes, repeating it some twenty times.

In what sense was that utterance a failure? — while I keep insisting that it is precisely a successful utterance. And it is so because his mother was, in a way, his wife. He called her as he ought to.

---

I must apologize for returning to such a basic point. Yet, since I am faced with objections as weighty as this one — and from qualified authorities, linguists no less — that my use of linguistics is said to be merely metaphorical, I must respond, whatever the circumstances.

I do so this morning because I expected to encounter a more challenging spirit here.

Can I, with any decency, say that I know? Know what, precisely? [...]

If I know where I stand, I must also confess [...] that I do not know what I am saying. In other words, what I know is exactly what I cannot say. That is the moment when Freud makes his entrance, with his introduction of the unconscious.

For the unconscious means nothing if not this: that whatever I say, and from whatever position I speak — even when I hold that position firmly — I do not know what I am saying. None of the discourses, as I defined them last year, offer the slightest hope that anyone might truly know what they are saying.

Even though I do not know what I am saying, I know at least that I do not know it — and I am far from being the first to speak under such conditions; such speech has been heard before. I maintain that the cause of this is to be sought in language itself, and nowhere else.

What I add to Freud — though it is already present in him, for whatever he uncovers of the unconscious is always made of the very substance of language — is this: the unconscious is structured like a language. Which language? That, I leave for you to determine.

Whether I speak in French or in Chinese, it would make no difference — or so I would wish. It is all too clear that what I am stirring up, on a certain level, provokes bitterness, especially among linguists. That alone suggests much about the current state of the university, whose position is made only too evident in the curious hybrid that linguistics has become.

That I should be denounced, my God, is of little consequence. That I am not debated — that too is hardly surprising, since it is not within the bounds of any university-defined domain that I take my stand, or can take it.

— Jacques Lacan, Seminar XVIII: Of a Discourse That Would Not Be of Pretence

By ModernMech 2025-10-064:483 reply

That doesn't explain why it freaks out though:

https://chatgpt.com/share/68e349f6-a654-8001-9b06-a16448c58a...

By LostMyLogin 2025-10-065:412 reply

To be fair, I’m freaking out now because I swear there used to be a yellow seahorse emoji.

By Melatonic 2025-10-0617:01

Someone needs to create one for comedy purposes and start distributing it as a very lightweight small gif with transparency

When I first heard this however I imagined it as brown colored (and not the simpler yellow style)

By astrange 2025-10-0622:15

I learned there really is a mermaid/merman/merperson emoji and now I just want to know why.

By D-Machine 2025-10-065:10

For an intuitive explanation see https://news.ycombinator.com/item?id=45487510. For a more precise (but still intuitive) explanation, see my response to that comment.

By hexagonwin 2025-10-065:11

404 for me, maybe try archive.is?