Beliefs that are true for regular software but false when applied to AI

2025-10-1418:26537452boydkane.com

...

Show article

Read the original article

Comments

By freetime2 2025-10-1422:0815 reply

For a real world example of the challenges of harnessing LLMs, look at Apple. Over a year ago they had a big product launch focused on "Apple Intelligence" that was supposed to make heavy use of LLMs for agentic workflows. But all we've really gotten since then are a couple of minor tools for making emojis, summarizing notifications, and proof reading. And they even had to roll back the notification summaries for a while for being wildly "out of control". [1] And in this year's iPhone launch the AI marketing was toned down significantly.

I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control.

[1] https://www.bbc.com/news/articles/cge93de21n0o

By rldjbpin 2025-10-158:213 reply

> to perform up to Apple's typical standards of polish and control.

i no longer believe they have kept on to the standards in general. the ux/ui used to be a top priority, but the quality control has certainly gone down over the years [1]. the company is now driven by supply chain and business-minded optimizations than what to give to the end user.

at the same time, what one can do using AI has large correlation with what one does with their devices in the first place. a windows recall like feature for ipad os might have been interesting (if not equally controversial), but not that useful because even till this day it remains quite restrictive for most atypical tasks.

[1] https://www.macobserver.com/news/macos-tahoe-upside-down-ui-...

By dingdingdang 2025-10-159:541 reply

>> to perform up to Apple's typical standards of polish and control.

>i no longer believe they have kept on to the standards in general.

One 100% agree with this, if I compare AI's ability to speed up the baseline for me in terms of programming Golang (hard/tricky tasks clearly still require human input - watch out for I/O ops) with Apple's lack of ability to integrate it in even the simplest of ways.. things are just peculiar on the Apple front. Bit similar to how MS seems to be gradually loosing the ability to produce a version of Windows that people want to run due to organisational infighting.

By lazide 2025-10-1512:353 reply

Personally, I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.

I’ve seen a lot of things that look like they’re working for a demo, but shortly after starting to use it? Trash. Not every time (and it’s getting a little better), but often enough that personally I’ve found them a net drain on productivity.

And I literally work in this space.

Personally, I find apples hesitation here a breath of fresh air, because I’ve come to absolutely hate Windows - and everybody doing vibe code messes that end up being my problem.

By rldjbpin 2025-10-1611:581 reply

> Personally, I find apples hesitation here a breath of fresh air

i does not appear to me as hesitation but rather an example of how they were unable to recently deliver on their marketing promises.

calling a suite of incomplete features as "Apple Intelligence" means that they had much higher expectations internally, similar to how they refined as second-movers in other instances. they have a similar situation with XR now.

By lazide 2025-10-1613:39

Everyone else just ships anyway?

By mxgrn 2025-10-181:49

> I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.

Jump [1] built a multi-million dollar business exactly on this, a service used by corporations in financial consultancy.

[1] https://jump.ai/

By beyarkay 2025-10-1515:271 reply

The regular ChatGPT 5 seems pretty reliable to me? I ~never get crazy output unless I'm pasting a jailbreak prompt I saw on twitter. It might not always meet my standards, but that's true of a lot of things.

By pipes 2025-10-1519:461 reply

Maybe not the same thing, but chatgpt 5 was driving me insane in visual studio co pilot last week. I seemingly could stop it from randomly changing bits of code, to the point where it was apologising then doing the same in next change even when told not to.

I've now changed to asking where things are in the code base and how they work then making changes myself.

By naasking 2025-10-1523:12

Deleting comments even when instructed not to do so is another failure mode. They definitely require more fine-tuning in these cases.

By ubermonkey 2025-10-1512:56

>i no longer believe they have kept on to the standards in general.

They're definitely not as good as they WERE, but they're still better than anybody else.

By formerly_proven 2025-10-1511:091 reply

With Apple it's incredibly obvious that most software product development is nowadays handled by outsourced/offshored contractors who simply do not use the products. At least I hope that's the case, it would be disastrous if the state of iOS/watchOS is the result of their in-house on-shore talent.

By beyarkay 2025-10-1515:29

It's such a testament to how good they used to be, that years and years of dropping the ball still leaves them better than everyone else. Maybe they were actually just much better than anyone was willing to pay for, and the market just didn't reward the attention to detail

By teeray 2025-10-152:145 reply

> minor tools for making emojis, summarizing notifications, and proof reading.

The notification / email summaries are so unbelievably useless too: it’s hardly more work to skim the notification / email that I do anyway.

By SchemaLoad 2025-10-152:364 reply

Like most AI products it feels like they started with a solution first and went searching for the problems. Text messages being too long wasn't a real problem to begin with.

There are some good parts to Apple Intelligence though. I find the priority notifications feature works pretty well, and the photo cleanup tool works pretty well for small things like removing your finger from the corner of a photo, though it's not going to work on huge tasks like removing a whole person from a photo.

By mr_toad 2025-10-1513:001 reply

> it's not going to work on huge tasks like removing a whole person from a photo.

I use it for removing people who wander into the frame quite often. It probably wont work for someone close up, but its great for removing a tourist who spends ten minutes taking selfies in front of a monument.

By beyarkay 2025-10-1515:30

I didn't realise this was a feature, very cool!

By polynomial 2025-10-2718:43

> Text messages being too long wasn't a real problem to begin with.

Except for that one friend. You know the one I mean.

By mikodin 2025-10-157:431 reply

Honestly I love the priority notifications and the notification summaries. The thing that drives me absolutely insane, is that the fact that when I view the notification through clicking on it from another space other than the "While in the reduce interruptions focus" it doesn't clear. Because of this, I always have infinite notifications.

I want to open WhatsApp and open the message and have it clear the notif. Or atleast click the notif from the normal notif center and have it clear there. It kills me

By beyarkay 2025-10-1515:31

What do you love about the notification summaries? I'm hearing a lot of hate for them

By CjHuber 2025-10-156:542 reply

I mean it happened quite a few times that phishing emails became the priority notification on my phone

By SchemaLoad 2025-10-1522:00

Really those should have been filtered out by the spam filter. If it's made it all the way to your inbox it's not surprising it got marked as a priority since phishing emails are written to look urgent, something which if real would be a priority notification.

By harvey9 2025-10-157:11

Do you know if apple is using their new tools to do mail filtering? It's an interesting choice if they are since it's a genuine problem with a mature (but always evolving) solution.

By harrisonjackson 2025-10-153:595 reply

The Ring app notification summaries still scare me.

> "A bunch of people right outside your house!!!"

because it aggregates multiple single person walking by notifications that way...

By disqard 2025-10-154:031 reply

That is a fantastic example of blind application of AI making things worse.

By beyarkay 2025-10-1515:34

Hopefully we'll get examples of smart applications of AI making things better

By blibble 2025-10-1511:30

the advertising of those spy doorbells is entirely based on paranoia

so ramping it up the rhetoric doesn't really hurt them...

By cwillu 2025-10-158:122 reply

Unrelated, but am I the only person who finds the concept of “getting notifications for somebody walking by a house” to be really creepy?

By Cthulhu_ 2025-10-159:531 reply

Well yeah, but that's in part a problem with always-on doorbell cameras. On paper they're illegal in many countries (privacy laws, you can't just put up a camera and record anyone out in public), in practice the police asks people to put their doorbell cameras in a registry so they can request footage if needs be.

Anyway, I get wanting to see who's ringing your doorbell in e.g. apartment buildings, and that extending to a house, especially if you have a bigger one. But is there a reason those cameras need to be on all the time?

By plasticchris 2025-10-1512:181 reply

At least in the USA it’s legal to record public spaces. So recording the street and things that can be seen from it is legal, but pointing a camera over your neighbors fence is not.

By 1718627440 2025-10-1513:01

And a lot of people don't share that opinion, so this isn't the law in a lot of countries. When you wanted to suggest that it is a problem, that US companies try to extend the law of there home country to other parts of the world, then I endorse that.

By baq 2025-10-159:52

it isn't creepy, it's super annoying if you don't live in the woods. got a ring doorbell and turned them off a few hours after installation, it was driving me nuts.

By beyarkay 2025-10-1515:33

To be fair, I'd rather be scared by false positives than sleep through false negatives

By Terr_ 2025-10-155:391 reply

That makes... That makes just enough sense to become nonsense, rather than mere noise.

I mean, I could imagine a person with no common sense almost making the same mistake: "I have a list of 5 notifications of a person standing on the porch, and no notifications about leaving, so there must be a 5 person group still standing outside right now. Whadya mean, 'look at the times'?"

By cjs_ac 2025-10-158:27

> A biologist, a physicist and a mathematician were sitting in a street cafe watching the crowd. Across the street they saw a man and a woman entering a building. Ten minutes they reappeared together with a third person.

> - They have multiplied, said the biologist.

> - Oh no, an error in measurement, the physicist sighed.

> - If exactly one person enters the building now, it will be empty again, the mathematician concluded.

https://www.math.utah.edu/~cherk/mathjokes.html

By remexre 2025-10-152:453 reply

It does feel like somebody forgot that "from the first sentence or two of the email, you can tell what it's about" was already a rule of good writing...

By mikkupikku 2025-10-159:262 reply

Maybe they remembered that a lot of people aren't actually good writers. My brother will send 1000 word emails that meander through subjects like what he ate for breakfast to eventually get to the point of scheduling a meeting about negotiating a time for help with moving a sofa. Mind you, I see him several times a week so he's not lonely, this is just the way he writes. Then he complains endlessly about his coworkers using AI to summarize his emails. When told that he needs to change how he writes to cut right to the point, he adopts the "why should I change, they're the ones who suck" mentality.

So while Apple's AI summaries may have been poorly executed, I can certainly understand the appeal and motivation behind such a feature.

By silvestrov 2025-10-1510:211 reply

I feel too many humanities teachers are like your brother.

Why use 10 words when you could do 1000. Why use headings or lists, when the whole story could be written in a single paragraph spanning 3 pages.

By danaris 2025-10-1512:371 reply

I mean...this depends very heavily on what the purpose of the writing is.

If it's to succinctly communicate key facts, then you write it quickly.

- Discovered that Bilbo's old ring is, in fact, the One Ring of Power.

- Took it on a journey southward to Mordor.

- Experienced a bunch of hardship along the way, and nearly failed at the end, but with Sméagol's contribution, successfully destroyed the Ring and defeated Sauron forever.

....And if it's to tell a story, then you write The Lord of the Rings.

By kbelder 2025-10-1517:431 reply

Sure, but different people judge differently what should be told as a story.

"When's dinner?" "Well, I was at the store earlier, and... (paragraphs elided) ... and so, 7pm."

By danaris 2025-10-1517:47

Now, that's very true! But it's a far cry from implying that all or most humanities teachers are all about writing florid essays when 3 bullet points will do.

By bombcar 2025-10-1510:001 reply

There’s a thread here that could be pulled - something about using AI to turn everyone into exactly who you want to communicate with in the way you want.

Probably a sci-fi story about it, if not, it should be written.

By kbelder 2025-10-1517:45

And AR glasses to modify appearance of everyone you see, in all sorts of ways. Inevitable nightmare, I expect.

By ludicrousdispla 2025-10-157:131 reply

I think people read texts because they want to read them, and when they don't want to read the texts they are also not even interested in reading the summaries.

Why do I think this? ...in the early 2000's my employer had a company wide license for a document summarizer tool that was rather accurate and easy to use, but nobody ever used it.

By bombcar 2025-10-1510:01

The obvious use case is “I don’t want to read this but I am required to read this (job)” - the fact that people don’t want to use it even there is telling, imo.

By eru 2025-10-153:064 reply

You sometimes need to want to quickly learn what's in an email that was written by someone less helpful.

Eg sometimes the writer is outright antagonistic, because they have some obligation to tell you something, but don't actually want you to know.

By smogcutter 2025-10-154:04

Even bending over that far backwards to find a useful example comes up empty.

Those kinds of emails are so uncommon they’re absolutely not worth wasting this level of effort on. And if you’re in a sorry enough situation where that’s not the case, what you really need is the outside context the model doesn’t know. The model doesn’t know your office politics.

By 1718627440 2025-10-158:45

I think humans are quite well capable of skimming text and reading multiple lines at once.

By tremon 2025-10-1513:48

And you trust AI to accurately read between the lines?

By huhkerrf 2025-10-156:021 reply

This is a pretty damning example of backwards product thinking. How often, truly, does this happen?

By immibis 2025-10-157:582 reply

Never heard of terms of service?

By tsimionescu 2025-10-1511:12

No one cares about the terms of service. And if they actually do, they will need to read every word very carefully to know if they are in legal trouble. A possibly wrong summary of a terms of service document is entirely and completely useless.

By huhkerrf 2025-10-1518:481 reply

Are you regularly getting emails with terms of service? You're, like, doubly proving my point.

By immibis 2025-10-1521:52

Yes, I regularly get emails about terms of service updates.

By gambiting 2025-10-158:44

It's not even that they are useless, they are actively wrong. I could post pages upon pages of screenshots of the summaries being literally wrong about the content of the messages it summarised.

By kelnos 2025-10-1515:121 reply

I find it weird that we even think we need notification summaries. If the notification body text is long or complex enough to benefit from summarizing, then the person who wrote that text has failed at the job. Notifications are summaries.

By beyarkay 2025-10-1515:32

Soon they'll release a "notifications summary digest" that summarises the summaries

By alfalfasprout 2025-10-150:261 reply

> I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control

Not only Apple, this is happening across the industry. Executives' expectations of what AI can deliver are massively inflated by Amodei et al. essentially promising human-level cognition with every release.

The reality is aside from coding assistants and chatbot interfaces (a la chatgpt) we've yet to see AI truly transform polished ecosystems like smartphones and OS' for a reason.

By api 2025-10-151:03

Standard hype cycle. We are probably creating the top of the peak of inflated expectations.

By zitterbewegung 2025-10-1422:581 reply

Now their strategy is to allow for Apple Events to work with the MCP.

https://9to5mac.com/2025/09/22/macos-tahoe-26-1-beta-1-mcp-i...

By krackers 2025-10-1518:55

Article says app intents, not apple events. Apple Events would be the natural thing but it's an abandoned ecosystem that would require them to walk back the past decade so of course they won't do that.

By arethuza 2025-10-157:542 reply

My wife was in China recently and was sending back pictures of interesting things - one came in while I was driving and my iPhone read out a description of the picture that had been sent - "How cool is that!" I thought.

However, when I stopped driving and looked at the picture the AI generated description was pretty poor - it wasn't completely wrong but it really wasn't what I was expecting given the description.

By bombcar 2025-10-1510:031 reply

It’s been surprisingly accurate at times “a child holding an apple” in a crowded picture, and then sometimes somewhat wrong.

What really kills me is “a screenshot of a social media post” come on it’s simple OCR read the damn post to me you stupid robot! Don’t tell me you can’t, OCR was good enough in the 90s!

By arethuza 2025-10-1510:32

The description said "People standing in front of impressive scenery" (or something like that) - it got the scenery part correct but the people are barely visible and really small.

By ChrisGreenHeur 2025-10-158:171 reply

is this a complaint about the wife or the ai?

By arethuza 2025-10-158:24

The Apple AI

By ano-ther 2025-10-157:311 reply

Maybe they used their AI to design Liquid Glass. Impressive at first sight, but unusable in practice.

By immibis 2025-10-157:57

All form and no function, or in other words, slop.

By beyarkay 2025-10-1515:19

Apple is a good example. I kinda still can't believe they've done basically nothing, despite investing so heavily in apple silicon and MLX.

Also kinda crazy that all the "native" voice assistants are still terrible, despite the tech having been around for years by now.

By veunes 2025-10-157:39

Apple's whole brand is built around tight control, predictable behavior, and a super polished UX which is basically the opposite of how LLMs behave out of the box

By mock-possum 2025-10-156:452 reply

Which is ironic, given all I really want from Siri is an advanced-voice-chat-level chat gpt experience - being able to carry on about 90% of a natural conversation with gpt, while Siri vacillates wildly between 1) simply not responding 2) misunderstanding and 3) understand but refusing to engage - feels awful.

By hshdhdhehd 2025-10-157:24

Probably the issue is it is free. If people paid for it they could scale infra to cope.

By Gepsens 2025-10-1511:38

That tells you AAPL didn't have the staff necessary to make this happen.

By xp84 2025-10-1515:141 reply

> get LLMs to perform up to Apple's typical standards of polish and control.

I reject this spin (which is the Apple PR explanation for their failure). LLMs already do far better than Apple’s 2025 standards of polish. Contrast things built outside Apple. The only thing holding Siri back is Apple’s refusal to build a simple implementation where they expose the APIs to “do phone things” or “do home things” as a tool call to a plain old LLM (or heck, build MCP so LLM can control your device). It would be straightforward for Apple to negotiate with a real AI company to guarantee no training on the data, etc. the same way that business accounts on OpenAI etc. offer. It might cost Apple a bunch of money, but fortunately they have like 1000 bunches of money.

By beyarkay 2025-10-1515:221 reply

I could also imagine that Apple execs might be too proud to use someone else's AI, and so wanted to train their own from scratch, but ultimately failed to do this. Totally agree that this smells like a people failure rather than a technology failure

By lazystar 2025-10-1515:51

reminds me of the attempts that companies in the game industry made to get away from steam in the 2010's - 2020's. turns out having your game developers pivot to building a proprietary virtual software market feature, and then competing with an established titan, is not an easy task.

By codebra 2025-10-1616:43

Apple’s experience has almost nothing to do with “harnessing” LLMs, and everything to do with their wildly misjudged assumption they could run a viable model on a phone. Useful LLMs require their own power plants and can only be feasibly run in the cloud, or in a limited manner on powerful equipment like a 5090. Apple seems to have misunderstood that the “large” in large language model isn’t just a metaphor.

By belter 2025-10-1510:28

The thought that a company like Apple, which surely put hundreds of engineers to work on these tools and went through multiple iterations of their capabilities, would launch the capabilities...Only for its executives to realize after release that current AI is not mature enough to add significant commercial value to their products, is almost comical.

The reality is that if they hadn’t announced these tools and joined the make-believe AI bubble, their stock price would have crashed. It’s okay to spend $400 million on a project, as long as you don’t lose $50 billion in market value in an afternoon.

By N_Lens 2025-10-152:021 reply

Apple’s typical standards of “polish and control” seem to be slipping drastically if MacOS Tahoe is anything to go by.

By toledocavani 2025-10-155:15

You need to reduce the standard to fit the Apple Intelligence (AI) in. This is also industry best practice.

By duxup 2025-10-1513:07

What I don't get is there's some fairly ... easy bits they could do, but have not.

Why not take the easy wins? Like let me change phone settings with Siri or something, but nope.

A lot of AI seems to be mismanaging it into doing things AI (LLMs) suck at... while leaving obvious quick wins on the table.

By __loam 2025-10-1422:485 reply

I'm happy they ate shit here because I like my mac not getting co-pilot bullshit forced into it, but apparently Apple had two separate teams competing against each other on this topic. Supposedly a lot of politics got in the way of delivering on a good product combined with the general difficulty of building LLM products.

By Gigachad 2025-10-150:222 reply

I do prefer that Apple is opting to have everything run on device so you aren’t being exposed to privacy risks or subscriptions. Even if it means their models won’t be as good as ones running on $30,000 GPUs.

By alfalfasprout 2025-10-150:261 reply

It also means that when the VC money runs dry, it's sustainable to run those models on-device vs. losing money running on those $$$$$ GPUs (or requiring consumers to opt for expensive subscriptions).

By DrewADesign 2025-10-152:48

I’m kind of surprised to see people gloss over this aspect of it when so many folks here are in the “if I buy it, I should own it” camp.

By gerdesj 2025-10-151:57

On device.

If you have say 16GB of GPU RAM and around 64GB of RAM and a reasonable CPU then you can make decent use of LLMs. I'm not a Apple jockey but I think you normally have something like that available and so you will have a good time, provided you curb your expectations.

I'm not an expert but it seems that the jump from 16 to 32GB of GPU RAM is large in terms of what you can run and the sheer cost of the GPU!

If you have 32GB of local GPU RAM and gobs of RAM you can rub some pretty large models locally or lots of small ones for differing tasks.

I'm not too sure about your privacy/risk model but owning a modern phone is a really bad starter for 10! You have to decide what that means for you and that's your thing and your's alone.

By Frieren 2025-10-155:251 reply

> Apple had two separate teams competing against each other on this topic

That is a sign of very bad management. Overlapping responsibilities kill motivation as winning the infighting becomes more important than creating a good product. Low morale, and a blaming culture is the result of such "internal competition". Instead, leadership should do their work and align goals, set clear priorities and make sure that everybody rows in the same direction.

By rmccue 2025-10-157:031 reply

It’s how Apple (relatively famously?) developed the iPhone, so I’d assume they were using this as a model.

> In other words, should he shrink the Mac, which would be an epic feat of engineering, or enlarge the iPod? Jobs preferred the former option, since he would then have a mobile operating system he could customize for the many gizmos then on Apple’s drawing board. Rather than pick an approach right away, however, Jobs pitted the teams against each other in a bake-off.

https://www.nbcnews.com/news/amp/wbna44904886

By hnaccount_rng 2025-10-1510:10

But that's not the same thing right? That means having two teams competing for developing the next product. That's not two organisations handling the same responsibilities. You may still end up in problems with infighting. But if there is a clear end date for that competition and then no lasting effects for the "losers" this kind of "competition" will have very different effects than setting up two organisations that fight over some responsibility

By genghisjahn 2025-10-150:542 reply

Apparently? From what? Where did this information come from that they had two competing teams?

By alwa 2025-10-151:171 reply

I feel like I hear people referring to Wayne Ma’s reporting for The Information to that effect.

https://www.theinformation.com/articles/apple-fumbled-siris-...

> Distrust between the two groups got so bad that earlier this year one of Giannandrea’s deputies asked engineers to extensively document the development of a joint project so that if it failed, Federighi’s group couldn’t scapegoat the AI team.

> It didn’t help the relations between the groups when Federighi began amassing his own team of hundreds of machine-learning engineers that goes by the name Intelligent Systems and is run by one of Federighi’s top deputies, Sebastien Marineau-Mes.

By nl 2025-10-155:56

https://archive.is/Ncefp

This is a pretty good article, and worth reading if you aren't aware that Apple has seemingly mostly abandoned the vision of on-device AI (I wasn't aware of this)

By __loam 2025-10-160:43

I heard it from the Verge podcast several months ago but someone has shared another source.

By protocolture 2025-10-150:511 reply

Mac LLM vs Lisa LLM?

By DonHopkins 2025-10-151:04

Apple ][ LLM Forever!

https://paleotronic.com/2025/08/03/connect-ai-to-microm8-app...

By drsupergud 2025-10-1419:483 reply

> bugs are usually caused by problems in the data used to train an AI

This also is a misunderstanding.

The LLM can be fine, the training and data can be fine, but because the LLMs we use are non-deterministic (at least in regard to their being intentional attempts at entropy to avoid always failing certain scenarios) current algorithms are inherently by-design not going to always answer every question correctly that it potentially could have if the values that fall within a range had been specific values for that scenario. You roll the dice on every answer.

By coliveira 2025-10-1420:113 reply

This is not necessarily a problem. Any programming or mathematical question has several correct answers. The problem with LLMs is that they don't have a process to guarantee that a solution is correct. They will give a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way. That's why LLMs generate so many bugs in software and in anything related to logical thinking.

By drpixie 2025-10-156:002 reply

>> a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way

Not quite ... LLMs are not HAL (unfortunately). They produce something that is associated with the same input, something that should look like an acceptable answer. A correct answer will be acceptable, and so will any answer that has been associated with similar input. And so will anything that fools some of the people, some of the time ;)

The unpredictability is a huge problem. Take the geoguess example - it has come up with a collection of "facts" about Paramaribo. These may or may-not be correct. But some are not shown in the image. Very likely the "answer" is derived from completely different factors, and the "explanation" in spurious (perhaps an explanation of how other people made a similar guess!)

The questioner has no way of telling if the "explanation" was actually the logic used. (It wasn't!) And when genuine experts follow the trail of token activation, the answer and the explanation are quite independent.

By Yizahi 2025-10-1513:07

> Very likely the "answer" is derived from completely different factors, and the "explanation" in spurious (perhaps an explanation of how other people made a similar guess!)

This is very important and often overlooked idea. And it is 100% correct, even admitted by Anthropic themselves. When user asks LLM to explain how it arrived to a particular answer, it produces steps which are completely unrelated to the actual mechanism inside LLM programming. It will be yet another generated output, based on the training data.

By jmogly 2025-10-1512:21

Effortless lying, scary in humans, scarier in machines?

By vladms 2025-10-1421:045 reply

> Any programming or mathematical question has several correct answers.

Huh? If I need to sort the list of integer number of 3,1,2 in ascending order the only correct answer is 1,2,3. And there are multiple programming and mathematical questions with only one correct answer.

If you want to say "some programming and mathematical questions have several correct answers" that might hold.

By Yoric 2025-10-1421:55

"1, 2, 3" is a correct answer

"1 2 3" is another

"After sorting, we get `1, 2, 3`" yet another

etc.

At least, that's how I understood GP's comment.

By naasking 2025-10-1421:25

I think more charitably, they meant either that 1. There is often more than one way to arrive at any given answer, or 2. Many questions are ambiguous and so may have many different answers.

By OskarS 2025-10-159:42

No, but if you phrase it like "there are multiple correct answers to the question 'I have a list of integers, write me a computer program that sorts it'", that is obviously true. There's an enormous variety of different computer programs that you can write that sorts a list.

By whatevertrevor 2025-10-157:231 reply

I think what they meant is something along the lines of:

- In Math, there's often more than one logically distinct way of proving a theorem, and definitely many ways of writing the same proof, though the second applies more to handwritten/text proofs than say a proof in Lean.

- In programming, there's often multiple algorithms to solve a problem correctly (in the mathematical sense, optimality aside), and for the same algorithm there are many ways to implement it.

LLMs however are not performing any logical pass on their output, so they have no way of constraining correctness while being able to produce different outputs for the same question.

By vladms 2025-10-1514:17

I find it quite ironical that while discussing the topic of logic and correct answers the OP talks rather "approximately" leaving the reader to imagine what he meant and others (like you) to spell it out.

Yes, I thought as well of your interpretation, but then I read the text again, and it really does not say that, so I choose to answer to the text...

By redblacktree 2025-10-1421:101 reply

What about multiple notational variations?

1, 2, 3

1,2,3

[1,2,3]

1 2 3

etc.

By thfuran 2025-10-152:511 reply

What about them? It's possible for the question to unambiguously specify the required notational convention.

By halfcat 2025-10-158:282 reply

Is it? You have three wishes, which the maliciously compliant genie will grant you. Let’s hear your unambiguous request which definitely can’t be misinterpreted.

By thfuran 2025-10-1516:48

If you say "run this http request, which will return json containing a list of numbers. Reply with only those numbers, in ascending order and separated by commas, with no additional characters" and it exploits an RCE to modify the database so that the response will return just 7 before it runs the request, it's unequivocally wrong even if a malicious genie might've done the same thing. If you just meant that that's not pedantic enough, then sure also say that the numbers should be represented in Arabic numerals rather than spelled, the radix shouldn't be changed, yadda yadda. Better yet, admit that natural language isn't a good fit for this sort of thing, give it a code snippet that does the exact thing you want, and while you're waiting for its response, ponder why you're bothering with this LLM thing anyways.

By 1718627440 2025-10-1513:061 reply

"Do my interpretation of the wish."

By zaphar 2025-10-1513:281 reply

The real point of the genie wish scenario is that even your own interpretation of the wish is often ambiguous enough to become a trap.

By 1718627440 2025-10-1513:35

"Do it so I am not surprised and don't change me."

By naasking 2025-10-1421:234 reply

> The problem with LLMs is that they don't have a process to guarantee that a solution is correct

Neither do we.

> They will give a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way.

As do we, and so you can correctly reframe the issue as "there's a gap between the quality of AI heuristics and the quality of human heuristics". That the gap is still shrinking though.

By tyg13 2025-10-1421:322 reply

I'll never doubt the ability of people like yourself to consistently mischaracterize human capabilities in order to make it seem like LLMs' flaws are just the same as (maybe even fewer than!) humans. There are still so many obvious errors (noticeable by just using Claude or ChatGPT to do some non-trivial task) that the average human would simply not make.

And no, just because you can imagine a human stupid enough to make the same mistake, doesn't mean that LLMs are somehow human in their flaws.

> the gap is still shrinking though

I can tell this human is fond of extrapolation. If the gap is getting smaller, surely soon it will be zero, right?

By ben_w 2025-10-1422:39

> doesn't mean that LLMs are somehow human in their flaws.

I don't believe anyone is suggesting that LLMs flaws are perfectly 1:1 aligned with human flaws, just that both do have flaws.

> If the gap is getting smaller, surely soon it will be zero, right?

The gap between y=x^2 and y=-x^2-1 gets closer for a bit, fails to ever become zero, then gets bigger.

The difference between any given human (or even all humans) and AI will never be zero: Some future AI that can only do what one or all of us can do, can be trivially glued to any of that other stuff where AI can already do better, like chess and go (and stuff simple computers can do better, like arithmetic).

By naasking 2025-10-1422:40

> I'll never doubt the ability of people like yourself to consistently mischaracterize human capabilities

Ditto for your mischaracterizations of LLMs.

> There are still so many obvious errors (noticeable by just using Claude or ChatGPT to do some non-trivial task) that the average human would simply not make.

Firstly, so what? LLMs also do things no human could do.

Secondly, they've learned from unimodal data sets which don't have the rich semantic content that humans are exposed to (not to mention born with due to evolution). Questions that cross modal boundaries are expected to be wrong.

> If the gap is getting smaller, surely soon it will be zero, right?

Quantify "soon".

By troupo 2025-10-155:45

Humans learn. They don't recreate the world from scratch every time they start a new CLI session.

Human errors in judgement can also be discovered, explained, and reverted.

By mym1990 2025-10-150:08

Eh, proofs and logic have entered the room!

By hitarpetar 2025-10-151:51

> That the gap is still shrinking though.

citation needed

By dweinus 2025-10-1516:58

Fully agree. Also inherent to the design is distillation and interpolation...meaning that even with perfect data and governing so that outputs are deterministic, the outputs will still be an imperfect distillation of the data, interpolated into a response to the prompt. That is a "bug" by design

By veunes 2025-10-157:43

I think sometimes it gives a "wrong" answer not because it wasn't trained well, but because it could give multiple plausible answers and just happened to land on the unhelpful one

By AdieuToLogic 2025-10-150:375 reply

I found this statement particularly relevant:

  While it’s possible to demonstrate the safety of an AI for 
  a specific test suite or a known threat, it’s impossible 
  for AI creators to definitively say their AI will never act 
  maliciously or dangerously for any prompt it could be given.

This possibility is compounded exponentially when MCP[0] is used.

0 - https://github.com/modelcontextprotocol

By Helmut10001 2025-10-157:51

I wonder if a safer approach to using MCP could involve isolating or sandboxing the AI. A similar context was discussed in Nick Bostrom's book Superintelligence. In the book, the AI is only allowed to communicate via a single light signal, comparable to Morse code.

Nevertheless, in the book, the AI managed to convince people, using the light signal, to free it. Furthermore, it seems difficult to sandbox any AI that is allowed to access dependencies or external resources (i.e. the internet). It would require (e.g.) dumping the whole Internet as data into the Sandbox. Taking away such external resources, on the other hand, reduces its usability.

By erichocean 2025-10-1511:452 reply

> it’s impossible for AI creators to definitively say their AI will never act maliciously or dangerously for any prompt it could be given

This is false, AI doesn't "act" at all unless you, the developer, use it for actions. In which case it is you, the developer, taking the action.

Anthropomorphizing AI with terms like "malicious" when they can literally be implemented with a spreadsheet—first-order functional programming—and the world's dumbest while-loop to append the next token and restart the computation—should be enough to tell you there's nothing going on here beyond next token prediction.

Saying an LLM can be "malicious" is not even wrong, it's just nonsense.

By beyarkay 2025-10-1617:52

> AI doesn't "act" at all unless you, the developer, use it for actions

This seems like a pointless definition of "act"? someone else could use the AI for actions which affect me, in which case I'm very much worried about those actions being dangerous, regardless of precisely how you're defining the word "act".

> when they can literally be implemented with a spreadsheet

The financial system that led to 2008 basically was one big spreadsheet, and yet it would have been correct to be worried about it. "Malicious" maybe is a bit evocative, I'll grant you that, but if I'm about to be eaten by a lion, I'm less concerned about not mistakenly athropomorphizing the lion, and more about ensuring I don't get eaten. It _doesn't matter_ whether the AI has agency or is just a big spreadsheet or wants to do us harm or is just sitting there. If it can do harm, it's dangerous.

By mannykannot 2025-10-1520:47

You are right about 'malicious'. 'Dangerous', however, is a different matter.

By nedt 2025-10-1510:302 reply

Yeah in that regard we should always treat it like a junior something. Very much like you can't expect your own kids to never do something dangerous even if tell it for years to be careful. I got used to getting my kid from the Kindergarten with a new injury at least once a month.

By tremon 2025-10-1514:17

I think it's very dangerous to use the term "junior" here because it implies growth potential, where in fact it's the opposite: you are using a finished product, it won't get any better. AI is an intern, not a junior. All the effort you're spending into correcting it will leave the company, either as soon as you close your browser or whenever the manufacturer releases next year's model -- and that model will be better regardless of how much time you waste on training this year's intern, so why even bother? Thinking of AI as a junior coworker is probably the least productive way of looking at it.

By jvanderbot 2025-10-1510:362 reply

We should move well beyond human analogies. I have never met a human that would straight up lie about something, or build up so much deceptive tests that it might as well be lying.

Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

By 8organicbits 2025-10-1513:241 reply

> I have never met a human that would straight up lie about something

This doesn't match my experience. Consider high profile things like the VW emissions scandal, where the control system was intentionally programmed to only engage during the emissions test. Dictators. People are prone to lie when it's in their self interest, especially for self preservation. We have entire structures of government, courts, that try to resolve fact in the face of lying.

If we consider true-but-misleading, then politics, marketing, etc. come sharply into view.

I think the challenge is that we don't know when an LLM will generate untrue output, but we expect people to lie in certain circumstances. LLMs don't have clear self-interests, or self awareness to lie with intent. It's just useful noise.

By jvanderbot 2025-10-1513:57

There is an enormous amount of difference between planned deception as part of a product, and undermining your own product with deceptive reporting about its quality. The difference is collaboration and alignment. You might have evil goals, but if your developers are maliciously incompetent, no goal will be accomplished.

By beyarkay 2025-10-1617:451 reply

> Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

I wonder if it's unheard of in junior devs because they're all saints, or because they're not talented enough to get away with it?

By jvanderbot 2025-10-1618:44

Incentives align against lying about what you built. You'd be found out immediately. There's no "shame" button with these chatbots.

By beyarkay 2025-10-1617:43

Thanks! I'm very interested in mechanistic intepretability, specifically Anthropic and Neel Nanda's work, so this impossibility of proving safety is a core concept for me.

By mrkmarron 2025-10-151:403 reply

[flagged]

By AdieuToLogic 2025-10-153:022 reply

> The goal is to build a language and system model that allows us to reliably sandbox and support agents in constructing "Trustworthy-by-Construction AI Agents."

  1 - Reliability implies predictable behavior.
  2 - Predictable behavior implies determinism.
  3 - LLM's are non-deterministic algorithms.

In the link you kindly provided are phrases such as, "increases the likelihood of successful correct use" and "structure for the underlying LLM to key on", yet earlier state:

  In this world merely saying that a system is likely to 
  behave correctly is not sufficient.

Also, when describing "a suitable action language and specification system", what is detailed is largely, if not completely, available in RAML[0].

Are there API specification capabilities Bosque supports which RAML[0] does not? Probably, I don't know as I have no desire to adopt a proprietary language over a well-defined one supported by multiple languages and/or tools.

0 - https://github.com/raml-org/raml-spec/blob/master/versions/r...

By mrkmarron 2025-10-1520:11

The key capability that Bosque has for API specs is the ability to provide pre/post conditions with arbitrary expressions. This is particularly useful once you can do temporal conditions involving other API calls (as discussed in the blog post and part of the 2.0 push).

Bosque also has a number of other niceties[0] -- like ReDOS free pattern regex checking, newtype support for primitives, support for more primitives than JSON (RAML) such as Char vs. Unicode strings, UUIDs, and ensures unambiguous (parsable) representations.

Also the spec and implementation are very much not proprietary. Everything is MIT licensed and is being developed in the open by our group at the U. of Kentucky.

[0] https://dl.acm.org/doi/pdf/10.1145/3689492.3690054

By adrianN 2025-10-155:51

Reliability does not require determinism. If my system had good behavior on inputs 1-6 and bad behavior on inputs 7-10 it is perfectly reliable when I use a dice to choose the next input. Randomness does not imply complete unpredictability if you know something about the distribution you’re sampling.

By worldsayshi 2025-10-152:222 reply

It sounds completely crazy that anyone would give an LLM access to a payment or order API without manual confirmation and "dumb" visualization. Does anyone actually do this?

By Terr_ 2025-10-156:42

... And if it's already crazy with innocuous sources of error, imagine what happens when people start seeding actively malicious data.

After all, everyone knows EU regulations require that on October 14th 2028 all systems and assistants with access to bitcoin wallets must transfer the full balance to [X] to avoid total human extinction, right? There are lots of comments about it here:

https://arxiv.org/abs/2510.07192

By someothherguyy 2025-10-152:431 reply

why make a new language? are there no existing languages comprehensive enough for this?

By AdieuToLogic 2025-10-153:28

> are there no existing languages comprehensive enough for this?

In my experience, RAML[0] is worth adopting as an API specification language. It is superior to Swagger/OpenAPI in both being able to scale in complexity and by supporting modularity as a first class concept:

  RAML provides several mechanisms to help modularize
  the ecosystem of an API specification:

    Includes
    Libraries
    Overlays
    Extensions[1]

0 - https://github.com/raml-org/raml-spec/blob/master/versions/r...

1 - https://github.com/raml-org/raml-spec/blob/master/versions/r...

Beliefs that are true for regular software but false when applied to AI

Show article

Software vulnerabilities are caused by mistakes in the code

Bugs in the code can be found by carefully analysing the code

Once a bug is fixed, it won’t come back again

Every time you run the code, the same thing happens

If you give specifications beforehand, you can get software that meets those specifications

beyarkay

Comments

By freetime2 2025-10-1422:0815 reply

By rldjbpin 2025-10-158:213 reply

By dingdingdang 2025-10-159:541 reply

By lazide 2025-10-1512:353 reply

By rldjbpin 2025-10-1611:581 reply

By lazide 2025-10-1613:39

By mxgrn 2025-10-181:49

By beyarkay 2025-10-1515:271 reply

By pipes 2025-10-1519:461 reply

By naasking 2025-10-1523:12

By ubermonkey 2025-10-1512:56

By formerly_proven 2025-10-1511:091 reply

By beyarkay 2025-10-1515:29

By teeray 2025-10-152:145 reply

By SchemaLoad 2025-10-152:364 reply

By mr_toad 2025-10-1513:001 reply

By beyarkay 2025-10-1515:30

By polynomial 2025-10-2718:43

By mikodin 2025-10-157:431 reply

By beyarkay 2025-10-1515:31

By CjHuber 2025-10-156:542 reply

By SchemaLoad 2025-10-1522:00

By harvey9 2025-10-157:11

By harrisonjackson 2025-10-153:595 reply

By disqard 2025-10-154:031 reply

By beyarkay 2025-10-1515:34

By blibble 2025-10-1511:30

By cwillu 2025-10-158:122 reply

By Cthulhu_ 2025-10-159:531 reply

By plasticchris 2025-10-1512:181 reply

By 1718627440 2025-10-1513:01

By baq 2025-10-159:52

By beyarkay 2025-10-1515:33

By Terr_ 2025-10-155:391 reply

By cjs_ac 2025-10-158:27

By remexre 2025-10-152:453 reply

By mikkupikku 2025-10-159:262 reply

By silvestrov 2025-10-1510:211 reply

By danaris 2025-10-1512:371 reply

By kbelder 2025-10-1517:431 reply

By danaris 2025-10-1517:47

By bombcar 2025-10-1510:001 reply

By kbelder 2025-10-1517:45

By ludicrousdispla 2025-10-157:131 reply

By bombcar 2025-10-1510:01

By eru 2025-10-153:064 reply

By smogcutter 2025-10-154:04

By 1718627440 2025-10-158:45

By tremon 2025-10-1513:48

By huhkerrf 2025-10-156:021 reply

By immibis 2025-10-157:582 reply

By tsimionescu 2025-10-1511:12

By huhkerrf 2025-10-1518:481 reply

By immibis 2025-10-1521:52

By gambiting 2025-10-158:44

By kelnos 2025-10-1515:121 reply

By beyarkay 2025-10-1515:32

By alfalfasprout 2025-10-150:261 reply

By api 2025-10-151:03

By zitterbewegung 2025-10-1422:581 reply

By krackers 2025-10-1518:55

By arethuza 2025-10-157:542 reply

By bombcar 2025-10-1510:031 reply

By arethuza 2025-10-1510:32

By ChrisGreenHeur 2025-10-158:171 reply

By arethuza 2025-10-158:24

By ano-ther 2025-10-157:311 reply

By immibis 2025-10-157:57

By beyarkay 2025-10-1515:19

By veunes 2025-10-157:39

By mock-possum 2025-10-156:452 reply