Beliefs that are true for regular software but false when applied to AI

2025-10-1418:26537452boydkane.com

...

(a note for technical folk)1 | read as pdf

When it comes to understanding the dangers of AI systems, the general public has the worst kind of knowledge: that what you know for sure that just ain’t so.

After 40 years of persistent badgering, the software industry has convinced the public that bugs can have disastrous consequences. This is great! It is good that people understand that software can result in real-world harm. Not only does the general public mostly understand the dangers, but they mostly understand that bugs can be fixed. It might be expensive, it might be difficult, but it can be done.

The problem is that this understanding, when applied to AIs like ChatGPT, is completely wrong. The software that runs AI acts very differently to the software that runs most of your computer or your phone. Good, sensible assumptions about bugs in regular software actually end up being harmful and misleading when you try to apply them to AI.

Attempting to apply regular-software assumptions to AI systems leads to confusion, and remarks such as:

“If something goes wrong with ChatGPT, can’t some boffin just think hard for a bit, find the missing semi-colon or whatever, and then fix the bug?”

or

“Even if it’s hard for one person to understand everything the AI does, surely still smart people who individually understand small parts of what the AI does?”.

or

“Just because current systems don’t work perfectly, that’s not a problem right? Because eventually we’ll iron out all the bugs so the AIs will get more reliable over time, like old software is more reliable than new software.”

If you understand how modern AI systems work, these statements are all painfully incorrect. But if you’re used to regular software, they’re completely reasonable. I believe there is a gap between the experts and the novices in the field:

  • the experts don’t see the gap because it’s so obvious, so they don’t bother explaining the gap
  • the novices don’t see the gap because they don’t know to look, so they don’t realise where their confusion comes from.

This leads to frustration on both sides, because the experts feel like their arguments aren’t hitting home, and the novices feel like all arguments have obvious flaws. In reality, the experts and the novices have different, unspoken, assumptions about how AI systems work.

To make this more concrete, here are some example ideas that are perfectly true when applied to regular software but become harmfully false when applied to modern AIs:

Software vulnerabilities are caused by mistakes in the code

In regular software, vulnerabilities are caused by mistakes in the lines of code that make up the software. There might be hundreds of thousands of lines of code, but code doesn’t take up much space so this is only around 50MB of data, about the size of a small album of photos.

But in modern AI systems, vulnerabilities or bugs are usually caused by problems in the data used to train an AI2. It takes thousands of gigabytes of data to train modern AI systems, and bad behaviour isn’t caused by any single bad piece of data, but by the combined effects of significant fractions of the dataset. Because these datasets are so large, nobody knows everything that an AI is actually trained on. One popular dataset, FineWeb, is about 11.25 trillion words long3, which, if you were reading at about 250 words per minute, would take you over 85 thousand years to read. It’s just not possible for any single human (or even a team of humans) to have read everything that an LLM has read during training.

Bugs in the code can be found by carefully analysing the code

With regular software, if there’s a bug, it’s possible for smart people to carefully read through the code and logically figure out what must be causing the bug.

With AI systems, almost all bad behaviour originates from the data that’s used to train them2, but it’s basically impossible to look at misbehaving AI and figure out parts of the training data caused that bad behaviour. In practice, it’s rare to even attempt this, researchers will retrain the AI with more data to try and counteract the bad behaviour, or they’ll start over and try to curate the data to not include the bad data.

You cannot logically deduce what pieces of data caused the bad behaviour, you can only make good guesses. For example, modern AIs are trained on lots of mathematics proofs and programming tasks, because that seems to make them do better at reasoning and logical thinking tasks. If an AI system makes a logical reasoning mistake, it’s impossible to attribute that mistake to any portion of the training data, the only answer we’ve got is to use more data next time.

I think I need to emphasise this: With regular software, we can pinpoint mistakes precisely, walk step-by-step through the events leading up to the mistake, and logically understand why that mistake happened. When AIs make mistakes, we don’t understand the steps that caused those mistakes. Even the people who made the AIs don’t understand why they make mistakes4. Nobody understands where these bugs come from. We sometimes kinda have a rough idea about why they maybe did something unusual. But we’re far, far away from anything that guarantees the AI won’t have any catastrophic failures.

Once a bug is fixed, it won’t come back again

With regular software, once you’ve found the bug, you can fix the bug. And once you’ve fixed the bug, it won’t re-appear5. There might be a bug that causes similar problems, but it’s not the same bug as the one you fixed. This means you can, if you’re patient, reduce the number of bugs over time and rest assured that removing new bugs won’t cause old bugs to re-appear.

This is not the case with AI. It’s not really possible to “fix” a bug in an AI, because even if the AI was behaving weirdly, and you retrained it, and now it’s not behaving weirdly anymore, you can’t know for sure that the weird behaviour is gone, just that it doesn’t happen for the prompts you tested. It’s entirely possible that someone can find a prompt you forgot to test, and then the buggy behaviour is back again!

Every time you run the code, the same thing happens

With regular software, you can run the same piece of code multiple times and it’ll behave in the same way. If you give it the same input, it’ll give you the same output.

Now technically this is still true for AIs, if you give them exactly the prompt they’ll respond in exactly the same way. But practically, it’s very far from the truth6. Even tiny changes to the input of an AI can have dramatic changes in the output. Even innocent changes like adding a question mark at the end of your sentence or forgetting to start your sentence with a capital letter can cause the AI to return something different.

Additionally, most AI companies will slightly change the way their AIs respond, so that they say slightly different things to the same prompt. This helps their AIs seem less robotic and more natural.

If you give specifications beforehand, you can get software that meets those specifications

With regular software, this is true. You can sit with stakeholders to discuss the requirements for some piece of software, and then write code to meet those requirements. The requirements might change, but fundamentally you can write code to serve some specific purpose and have confidence that it will serve that specific purpose.

With AI systems, this is more or less false. Or at the very least, the creators of modern AI systems have far far less control about the behaviour the AIs will exhibit. We understand how to get an AI to meet narrow, testable specifications like speaking English and writing code, but we don’t know how to get a brand new AI to achieve a certain score on some particular test or to guarantee global behaviour like “never tells the user to commit a crime”. The best AI companies in the world have basically one lever which is “better”, and they can pull that lever to make the AI better, but nobody knows precisely what to do to ensure an AI writes formal emails correctly or summarises text accurately.

This means that we don’t know what an AI will be capable of before we’ve trained it. It’s very common for AIs to be released to the public for months before a random person on Twitter discovers some ability that the AI has which even its creators didn’t know about. So far, these abilities have been mostly just fun, like being good at Geoguessr:

Geoguessr map

Or making photos look like they were from a Studio Ghibli film:

Ghibli tweet

But there’s no reason for these hidden abilities to always be positive. It’s entirely possible that some dangerous capability is hidden in ChatGPT, but nobody’s figured out the right prompt just yet.

While it’s possible to demonstrate the safety of an AI for a specific test suite or a known threat, it’s impossible for AI creators to definitively say their AI will never act maliciously or dangerously for any prompt it could be given.

It is good that most people know the dangers of poorly written or buggy software. But this hard-won knowledge about regular software is misleading the public when it gets applied to AI. Despite the cries of “inscrutable arrays of floating point numbers”, I’d be surprised if a majority of people know that modern AI is architecturally different from regular software.

AI safety is a complicated and subtle argument. The best we can do is to make sure we’re starting from the same baseline, and that means conveying to our contemporaries that if it all starts to go wrong, we cannot just “patch the bug”7.

If this essay was the first time you realised AI was fundamentally different from regular software, let me know, and share this with a friend who might also not realise the difference.

If you always knew that regular software and AIs are fundamentally different, talk to your family and non-technical friends, or with a stranger at a coffee shop. I think you’ll be surprised at how few people know that these two are different.

If you’re interested the dynamics between experts and novices, and how gaps between them arise, I’ve written more about the systemic biases encountered by experts (and the difficulties endured by novices) in this essay: Experts have it easy.

Thanks to Sam Cross and Caleb for reviewing drafts of this essay.


Read the original article

Comments

  • By freetime2 2025-10-1422:0815 reply

    For a real world example of the challenges of harnessing LLMs, look at Apple. Over a year ago they had a big product launch focused on "Apple Intelligence" that was supposed to make heavy use of LLMs for agentic workflows. But all we've really gotten since then are a couple of minor tools for making emojis, summarizing notifications, and proof reading. And they even had to roll back the notification summaries for a while for being wildly "out of control". [1] And in this year's iPhone launch the AI marketing was toned down significantly.

    I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control.

    [1] https://www.bbc.com/news/articles/cge93de21n0o

    • By rldjbpin 2025-10-158:213 reply

      > to perform up to Apple's typical standards of polish and control.

      i no longer believe they have kept on to the standards in general. the ux/ui used to be a top priority, but the quality control has certainly gone down over the years [1]. the company is now driven by supply chain and business-minded optimizations than what to give to the end user.

      at the same time, what one can do using AI has large correlation with what one does with their devices in the first place. a windows recall like feature for ipad os might have been interesting (if not equally controversial), but not that useful because even till this day it remains quite restrictive for most atypical tasks.

      [1] https://www.macobserver.com/news/macos-tahoe-upside-down-ui-...

      • By dingdingdang 2025-10-159:541 reply

        >> to perform up to Apple's typical standards of polish and control.

        >i no longer believe they have kept on to the standards in general.

        One 100% agree with this, if I compare AI's ability to speed up the baseline for me in terms of programming Golang (hard/tricky tasks clearly still require human input - watch out for I/O ops) with Apple's lack of ability to integrate it in even the simplest of ways.. things are just peculiar on the Apple front. Bit similar to how MS seems to be gradually loosing the ability to produce a version of Windows that people want to run due to organisational infighting.

        • By lazide 2025-10-1512:353 reply

          Personally, I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.

          I’ve seen a lot of things that look like they’re working for a demo, but shortly after starting to use it? Trash. Not every time (and it’s getting a little better), but often enough that personally I’ve found them a net drain on productivity.

          And I literally work in this space.

          Personally, I find apples hesitation here a breath of fresh air, because I’ve come to absolutely hate Windows - and everybody doing vibe code messes that end up being my problem.

          • By rldjbpin 2025-10-1611:581 reply

            > Personally, I find apples hesitation here a breath of fresh air

            i does not appear to me as hesitation but rather an example of how they were unable to recently deliver on their marketing promises.

            calling a suite of incomplete features as "Apple Intelligence" means that they had much higher expectations internally, similar to how they refined as second-movers in other instances. they have a similar situation with XR now.

            • By lazide 2025-10-1613:39

              Everyone else just ships anyway?

          • By mxgrn 2025-10-181:49

            > I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.

            Jump [1] built a multi-million dollar business exactly on this, a service used by corporations in financial consultancy.

            [1] https://jump.ai/

          • By beyarkay 2025-10-1515:271 reply

            The regular ChatGPT 5 seems pretty reliable to me? I ~never get crazy output unless I'm pasting a jailbreak prompt I saw on twitter. It might not always meet my standards, but that's true of a lot of things.

            • By pipes 2025-10-1519:461 reply

              Maybe not the same thing, but chatgpt 5 was driving me insane in visual studio co pilot last week. I seemingly could stop it from randomly changing bits of code, to the point where it was apologising then doing the same in next change even when told not to.

              I've now changed to asking where things are in the code base and how they work then making changes myself.

              • By naasking 2025-10-1523:12

                Deleting comments even when instructed not to do so is another failure mode. They definitely require more fine-tuning in these cases.

      • By ubermonkey 2025-10-1512:56

        >i no longer believe they have kept on to the standards in general.

        They're definitely not as good as they WERE, but they're still better than anybody else.

      • By formerly_proven 2025-10-1511:091 reply

        With Apple it's incredibly obvious that most software product development is nowadays handled by outsourced/offshored contractors who simply do not use the products. At least I hope that's the case, it would be disastrous if the state of iOS/watchOS is the result of their in-house on-shore talent.

        • By beyarkay 2025-10-1515:29

          It's such a testament to how good they used to be, that years and years of dropping the ball still leaves them better than everyone else. Maybe they were actually just much better than anyone was willing to pay for, and the market just didn't reward the attention to detail

    • By teeray 2025-10-152:145 reply

      > minor tools for making emojis, summarizing notifications, and proof reading.

      The notification / email summaries are so unbelievably useless too: it’s hardly more work to skim the notification / email that I do anyway.

      • By SchemaLoad 2025-10-152:364 reply

        Like most AI products it feels like they started with a solution first and went searching for the problems. Text messages being too long wasn't a real problem to begin with.

        There are some good parts to Apple Intelligence though. I find the priority notifications feature works pretty well, and the photo cleanup tool works pretty well for small things like removing your finger from the corner of a photo, though it's not going to work on huge tasks like removing a whole person from a photo.

        • By mr_toad 2025-10-1513:001 reply

          > it's not going to work on huge tasks like removing a whole person from a photo.

          I use it for removing people who wander into the frame quite often. It probably wont work for someone close up, but its great for removing a tourist who spends ten minutes taking selfies in front of a monument.

          • By beyarkay 2025-10-1515:30

            I didn't realise this was a feature, very cool!

        • By polynomial 2025-10-2718:43

          > Text messages being too long wasn't a real problem to begin with.

          Except for that one friend. You know the one I mean.

        • By mikodin 2025-10-157:431 reply

          Honestly I love the priority notifications and the notification summaries. The thing that drives me absolutely insane, is that the fact that when I view the notification through clicking on it from another space other than the "While in the reduce interruptions focus" it doesn't clear. Because of this, I always have infinite notifications.

          I want to open WhatsApp and open the message and have it clear the notif. Or atleast click the notif from the normal notif center and have it clear there. It kills me

          • By beyarkay 2025-10-1515:31

            What do you love about the notification summaries? I'm hearing a lot of hate for them

        • By CjHuber 2025-10-156:542 reply

          I mean it happened quite a few times that phishing emails became the priority notification on my phone

          • By SchemaLoad 2025-10-1522:00

            Really those should have been filtered out by the spam filter. If it's made it all the way to your inbox it's not surprising it got marked as a priority since phishing emails are written to look urgent, something which if real would be a priority notification.

          • By harvey9 2025-10-157:11

            Do you know if apple is using their new tools to do mail filtering? It's an interesting choice if they are since it's a genuine problem with a mature (but always evolving) solution.

      • By harrisonjackson 2025-10-153:595 reply

        The Ring app notification summaries still scare me.

        > "A bunch of people right outside your house!!!"

        because it aggregates multiple single person walking by notifications that way...

        • By disqard 2025-10-154:031 reply

          That is a fantastic example of blind application of AI making things worse.

          • By beyarkay 2025-10-1515:34

            Hopefully we'll get examples of smart applications of AI making things better

        • By blibble 2025-10-1511:30

          the advertising of those spy doorbells is entirely based on paranoia

          so ramping it up the rhetoric doesn't really hurt them...

        • By cwillu 2025-10-158:122 reply

          Unrelated, but am I the only person who finds the concept of “getting notifications for somebody walking by a house” to be really creepy?

          • By Cthulhu_ 2025-10-159:531 reply

            Well yeah, but that's in part a problem with always-on doorbell cameras. On paper they're illegal in many countries (privacy laws, you can't just put up a camera and record anyone out in public), in practice the police asks people to put their doorbell cameras in a registry so they can request footage if needs be.

            Anyway, I get wanting to see who's ringing your doorbell in e.g. apartment buildings, and that extending to a house, especially if you have a bigger one. But is there a reason those cameras need to be on all the time?

            • By plasticchris 2025-10-1512:181 reply

              At least in the USA it’s legal to record public spaces. So recording the street and things that can be seen from it is legal, but pointing a camera over your neighbors fence is not.

              • By 1718627440 2025-10-1513:01

                And a lot of people don't share that opinion, so this isn't the law in a lot of countries. When you wanted to suggest that it is a problem, that US companies try to extend the law of there home country to other parts of the world, then I endorse that.

          • By baq 2025-10-159:52

            it isn't creepy, it's super annoying if you don't live in the woods. got a ring doorbell and turned them off a few hours after installation, it was driving me nuts.

        • By beyarkay 2025-10-1515:33

          To be fair, I'd rather be scared by false positives than sleep through false negatives

        • By Terr_ 2025-10-155:391 reply

          That makes... That makes just enough sense to become nonsense, rather than mere noise.

          I mean, I could imagine a person with no common sense almost making the same mistake: "I have a list of 5 notifications of a person standing on the porch, and no notifications about leaving, so there must be a 5 person group still standing outside right now. Whadya mean, 'look at the times'?"

          • By cjs_ac 2025-10-158:27

            > A biologist, a physicist and a mathematician were sitting in a street cafe watching the crowd. Across the street they saw a man and a woman entering a building. Ten minutes they reappeared together with a third person.

            > - They have multiplied, said the biologist.

            > - Oh no, an error in measurement, the physicist sighed.

            > - If exactly one person enters the building now, it will be empty again, the mathematician concluded.

            https://www.math.utah.edu/~cherk/mathjokes.html

      • By remexre 2025-10-152:453 reply

        It does feel like somebody forgot that "from the first sentence or two of the email, you can tell what it's about" was already a rule of good writing...

        • By mikkupikku 2025-10-159:262 reply

          Maybe they remembered that a lot of people aren't actually good writers. My brother will send 1000 word emails that meander through subjects like what he ate for breakfast to eventually get to the point of scheduling a meeting about negotiating a time for help with moving a sofa. Mind you, I see him several times a week so he's not lonely, this is just the way he writes. Then he complains endlessly about his coworkers using AI to summarize his emails. When told that he needs to change how he writes to cut right to the point, he adopts the "why should I change, they're the ones who suck" mentality.

          So while Apple's AI summaries may have been poorly executed, I can certainly understand the appeal and motivation behind such a feature.

          • By silvestrov 2025-10-1510:211 reply

            I feel too many humanities teachers are like your brother.

            Why use 10 words when you could do 1000. Why use headings or lists, when the whole story could be written in a single paragraph spanning 3 pages.

            • By danaris 2025-10-1512:371 reply

              I mean...this depends very heavily on what the purpose of the writing is.

              If it's to succinctly communicate key facts, then you write it quickly.

              - Discovered that Bilbo's old ring is, in fact, the One Ring of Power.

              - Took it on a journey southward to Mordor.

              - Experienced a bunch of hardship along the way, and nearly failed at the end, but with Sméagol's contribution, successfully destroyed the Ring and defeated Sauron forever.

              ....And if it's to tell a story, then you write The Lord of the Rings.

              • By kbelder 2025-10-1517:431 reply

                Sure, but different people judge differently what should be told as a story.

                "When's dinner?" "Well, I was at the store earlier, and... (paragraphs elided) ... and so, 7pm."

                • By danaris 2025-10-1517:47

                  Now, that's very true! But it's a far cry from implying that all or most humanities teachers are all about writing florid essays when 3 bullet points will do.

          • By bombcar 2025-10-1510:001 reply

            There’s a thread here that could be pulled - something about using AI to turn everyone into exactly who you want to communicate with in the way you want.

            Probably a sci-fi story about it, if not, it should be written.

            • By kbelder 2025-10-1517:45

              And AR glasses to modify appearance of everyone you see, in all sorts of ways. Inevitable nightmare, I expect.

        • By ludicrousdispla 2025-10-157:131 reply

          I think people read texts because they want to read them, and when they don't want to read the texts they are also not even interested in reading the summaries.

          Why do I think this? ...in the early 2000's my employer had a company wide license for a document summarizer tool that was rather accurate and easy to use, but nobody ever used it.

          • By bombcar 2025-10-1510:01

            The obvious use case is “I don’t want to read this but I am required to read this (job)” - the fact that people don’t want to use it even there is telling, imo.

        • By eru 2025-10-153:064 reply

          You sometimes need to want to quickly learn what's in an email that was written by someone less helpful.

          Eg sometimes the writer is outright antagonistic, because they have some obligation to tell you something, but don't actually want you to know.

          • By smogcutter 2025-10-154:04

            Even bending over that far backwards to find a useful example comes up empty.

            Those kinds of emails are so uncommon they’re absolutely not worth wasting this level of effort on. And if you’re in a sorry enough situation where that’s not the case, what you really need is the outside context the model doesn’t know. The model doesn’t know your office politics.

          • By 1718627440 2025-10-158:45

            I think humans are quite well capable of skimming text and reading multiple lines at once.

          • By tremon 2025-10-1513:48

            And you trust AI to accurately read between the lines?

          • By huhkerrf 2025-10-156:021 reply

            This is a pretty damning example of backwards product thinking. How often, truly, does this happen?

            • By immibis 2025-10-157:582 reply

              Never heard of terms of service?

              • By tsimionescu 2025-10-1511:12

                No one cares about the terms of service. And if they actually do, they will need to read every word very carefully to know if they are in legal trouble. A possibly wrong summary of a terms of service document is entirely and completely useless.

              • By huhkerrf 2025-10-1518:481 reply

                Are you regularly getting emails with terms of service? You're, like, doubly proving my point.

                • By immibis 2025-10-1521:52

                  Yes, I regularly get emails about terms of service updates.

      • By gambiting 2025-10-158:44

        It's not even that they are useless, they are actively wrong. I could post pages upon pages of screenshots of the summaries being literally wrong about the content of the messages it summarised.

      • By kelnos 2025-10-1515:121 reply

        I find it weird that we even think we need notification summaries. If the notification body text is long or complex enough to benefit from summarizing, then the person who wrote that text has failed at the job. Notifications are summaries.

        • By beyarkay 2025-10-1515:32

          Soon they'll release a "notifications summary digest" that summarises the summaries

    • By alfalfasprout 2025-10-150:261 reply

      > I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control

      Not only Apple, this is happening across the industry. Executives' expectations of what AI can deliver are massively inflated by Amodei et al. essentially promising human-level cognition with every release.

      The reality is aside from coding assistants and chatbot interfaces (a la chatgpt) we've yet to see AI truly transform polished ecosystems like smartphones and OS' for a reason.

      • By api 2025-10-151:03

        Standard hype cycle. We are probably creating the top of the peak of inflated expectations.

    • By zitterbewegung 2025-10-1422:581 reply

      Now their strategy is to allow for Apple Events to work with the MCP.

      https://9to5mac.com/2025/09/22/macos-tahoe-26-1-beta-1-mcp-i...

      • By krackers 2025-10-1518:55

        Article says app intents, not apple events. Apple Events would be the natural thing but it's an abandoned ecosystem that would require them to walk back the past decade so of course they won't do that.

    • By arethuza 2025-10-157:542 reply

      My wife was in China recently and was sending back pictures of interesting things - one came in while I was driving and my iPhone read out a description of the picture that had been sent - "How cool is that!" I thought.

      However, when I stopped driving and looked at the picture the AI generated description was pretty poor - it wasn't completely wrong but it really wasn't what I was expecting given the description.

      • By bombcar 2025-10-1510:031 reply

        It’s been surprisingly accurate at times “a child holding an apple” in a crowded picture, and then sometimes somewhat wrong.

        What really kills me is “a screenshot of a social media post” come on it’s simple OCR read the damn post to me you stupid robot! Don’t tell me you can’t, OCR was good enough in the 90s!

        • By arethuza 2025-10-1510:32

          The description said "People standing in front of impressive scenery" (or something like that) - it got the scenery part correct but the people are barely visible and really small.

      • By ChrisGreenHeur 2025-10-158:171 reply

        is this a complaint about the wife or the ai?

    • By ano-ther 2025-10-157:311 reply

      Maybe they used their AI to design Liquid Glass. Impressive at first sight, but unusable in practice.

      • By immibis 2025-10-157:57

        All form and no function, or in other words, slop.

    • By beyarkay 2025-10-1515:19

      Apple is a good example. I kinda still can't believe they've done basically nothing, despite investing so heavily in apple silicon and MLX.

      Also kinda crazy that all the "native" voice assistants are still terrible, despite the tech having been around for years by now.

    • By veunes 2025-10-157:39

      Apple's whole brand is built around tight control, predictable behavior, and a super polished UX which is basically the opposite of how LLMs behave out of the box

    • By mock-possum 2025-10-156:452 reply

      Which is ironic, given all I really want from Siri is an advanced-voice-chat-level chat gpt experience - being able to carry on about 90% of a natural conversation with gpt, while Siri vacillates wildly between 1) simply not responding 2) misunderstanding and 3) understand but refusing to engage - feels awful.

      • By hshdhdhehd 2025-10-157:24

        Probably the issue is it is free. If people paid for it they could scale infra to cope.

      • By Gepsens 2025-10-1511:38

        That tells you AAPL didn't have the staff necessary to make this happen.

    • By xp84 2025-10-1515:141 reply

      > get LLMs to perform up to Apple's typical standards of polish and control.

      I reject this spin (which is the Apple PR explanation for their failure). LLMs already do far better than Apple’s 2025 standards of polish. Contrast things built outside Apple. The only thing holding Siri back is Apple’s refusal to build a simple implementation where they expose the APIs to “do phone things” or “do home things” as a tool call to a plain old LLM (or heck, build MCP so LLM can control your device). It would be straightforward for Apple to negotiate with a real AI company to guarantee no training on the data, etc. the same way that business accounts on OpenAI etc. offer. It might cost Apple a bunch of money, but fortunately they have like 1000 bunches of money.

      • By beyarkay 2025-10-1515:221 reply

        I could also imagine that Apple execs might be too proud to use someone else's AI, and so wanted to train their own from scratch, but ultimately failed to do this. Totally agree that this smells like a people failure rather than a technology failure

        • By lazystar 2025-10-1515:51

          reminds me of the attempts that companies in the game industry made to get away from steam in the 2010's - 2020's. turns out having your game developers pivot to building a proprietary virtual software market feature, and then competing with an established titan, is not an easy task.

    • By codebra 2025-10-1616:43

      Apple’s experience has almost nothing to do with “harnessing” LLMs, and everything to do with their wildly misjudged assumption they could run a viable model on a phone. Useful LLMs require their own power plants and can only be feasibly run in the cloud, or in a limited manner on powerful equipment like a 5090. Apple seems to have misunderstood that the “large” in large language model isn’t just a metaphor.

    • By belter 2025-10-1510:28

      The thought that a company like Apple, which surely put hundreds of engineers to work on these tools and went through multiple iterations of their capabilities, would launch the capabilities...Only for its executives to realize after release that current AI is not mature enough to add significant commercial value to their products, is almost comical.

      The reality is that if they hadn’t announced these tools and joined the make-believe AI bubble, their stock price would have crashed. It’s okay to spend $400 million on a project, as long as you don’t lose $50 billion in market value in an afternoon.

    • By N_Lens 2025-10-152:021 reply

      Apple’s typical standards of “polish and control” seem to be slipping drastically if MacOS Tahoe is anything to go by.

      • By toledocavani 2025-10-155:15

        You need to reduce the standard to fit the Apple Intelligence (AI) in. This is also industry best practice.

    • By duxup 2025-10-1513:07

      What I don't get is there's some fairly ... easy bits they could do, but have not.

      Why not take the easy wins? Like let me change phone settings with Siri or something, but nope.

      A lot of AI seems to be mismanaging it into doing things AI (LLMs) suck at... while leaving obvious quick wins on the table.

    • By __loam 2025-10-1422:485 reply

      I'm happy they ate shit here because I like my mac not getting co-pilot bullshit forced into it, but apparently Apple had two separate teams competing against each other on this topic. Supposedly a lot of politics got in the way of delivering on a good product combined with the general difficulty of building LLM products.

      • By Gigachad 2025-10-150:222 reply

        I do prefer that Apple is opting to have everything run on device so you aren’t being exposed to privacy risks or subscriptions. Even if it means their models won’t be as good as ones running on $30,000 GPUs.

        • By alfalfasprout 2025-10-150:261 reply

          It also means that when the VC money runs dry, it's sustainable to run those models on-device vs. losing money running on those $$$$$ GPUs (or requiring consumers to opt for expensive subscriptions).

          • By DrewADesign 2025-10-152:48

            I’m kind of surprised to see people gloss over this aspect of it when so many folks here are in the “if I buy it, I should own it” camp.

        • By gerdesj 2025-10-151:57

          On device.

          If you have say 16GB of GPU RAM and around 64GB of RAM and a reasonable CPU then you can make decent use of LLMs. I'm not a Apple jockey but I think you normally have something like that available and so you will have a good time, provided you curb your expectations.

          I'm not an expert but it seems that the jump from 16 to 32GB of GPU RAM is large in terms of what you can run and the sheer cost of the GPU!

          If you have 32GB of local GPU RAM and gobs of RAM you can rub some pretty large models locally or lots of small ones for differing tasks.

          I'm not too sure about your privacy/risk model but owning a modern phone is a really bad starter for 10! You have to decide what that means for you and that's your thing and your's alone.

      • By Frieren 2025-10-155:251 reply

        > Apple had two separate teams competing against each other on this topic

        That is a sign of very bad management. Overlapping responsibilities kill motivation as winning the infighting becomes more important than creating a good product. Low morale, and a blaming culture is the result of such "internal competition". Instead, leadership should do their work and align goals, set clear priorities and make sure that everybody rows in the same direction.

        • By rmccue 2025-10-157:031 reply

          It’s how Apple (relatively famously?) developed the iPhone, so I’d assume they were using this as a model.

          > In other words, should he shrink the Mac, which would be an epic feat of engineering, or enlarge the iPod? Jobs preferred the former option, since he would then have a mobile operating system he could customize for the many gizmos then on Apple’s drawing board. Rather than pick an approach right away, however, Jobs pitted the teams against each other in a bake-off.

          https://www.nbcnews.com/news/amp/wbna44904886

          • By hnaccount_rng 2025-10-1510:10

            But that's not the same thing right? That means having two teams competing for developing the next product. That's not two organisations handling the same responsibilities. You may still end up in problems with infighting. But if there is a clear end date for that competition and then no lasting effects for the "losers" this kind of "competition" will have very different effects than setting up two organisations that fight over some responsibility

      • By genghisjahn 2025-10-150:542 reply

        Apparently? From what? Where did this information come from that they had two competing teams?

        • By alwa 2025-10-151:171 reply

          I feel like I hear people referring to Wayne Ma’s reporting for The Information to that effect.

          https://www.theinformation.com/articles/apple-fumbled-siris-...

          > Distrust between the two groups got so bad that earlier this year one of Giannandrea’s deputies asked engineers to extensively document the development of a joint project so that if it failed, Federighi’s group couldn’t scapegoat the AI team.

          > It didn’t help the relations between the groups when Federighi began amassing his own team of hundreds of machine-learning engineers that goes by the name Intelligent Systems and is run by one of Federighi’s top deputies, Sebastien Marineau-Mes.

          • By nl 2025-10-155:56

            https://archive.is/Ncefp

            This is a pretty good article, and worth reading if you aren't aware that Apple has seemingly mostly abandoned the vision of on-device AI (I wasn't aware of this)

        • By __loam 2025-10-160:43

          I heard it from the Verge podcast several months ago but someone has shared another source.

      • By protocolture 2025-10-150:511 reply

        Mac LLM vs Lisa LLM?

  • By drsupergud 2025-10-1419:483 reply

    > bugs are usually caused by problems in the data used to train an AI

    This also is a misunderstanding.

    The LLM can be fine, the training and data can be fine, but because the LLMs we use are non-deterministic (at least in regard to their being intentional attempts at entropy to avoid always failing certain scenarios) current algorithms are inherently by-design not going to always answer every question correctly that it potentially could have if the values that fall within a range had been specific values for that scenario. You roll the dice on every answer.

    • By coliveira 2025-10-1420:113 reply

      This is not necessarily a problem. Any programming or mathematical question has several correct answers. The problem with LLMs is that they don't have a process to guarantee that a solution is correct. They will give a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way. That's why LLMs generate so many bugs in software and in anything related to logical thinking.

      • By drpixie 2025-10-156:002 reply

        >> a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way

        Not quite ... LLMs are not HAL (unfortunately). They produce something that is associated with the same input, something that should look like an acceptable answer. A correct answer will be acceptable, and so will any answer that has been associated with similar input. And so will anything that fools some of the people, some of the time ;)

        The unpredictability is a huge problem. Take the geoguess example - it has come up with a collection of "facts" about Paramaribo. These may or may-not be correct. But some are not shown in the image. Very likely the "answer" is derived from completely different factors, and the "explanation" in spurious (perhaps an explanation of how other people made a similar guess!)

        The questioner has no way of telling if the "explanation" was actually the logic used. (It wasn't!) And when genuine experts follow the trail of token activation, the answer and the explanation are quite independent.

        • By Yizahi 2025-10-1513:07

          > Very likely the "answer" is derived from completely different factors, and the "explanation" in spurious (perhaps an explanation of how other people made a similar guess!)

          This is very important and often overlooked idea. And it is 100% correct, even admitted by Anthropic themselves. When user asks LLM to explain how it arrived to a particular answer, it produces steps which are completely unrelated to the actual mechanism inside LLM programming. It will be yet another generated output, based on the training data.

        • By jmogly 2025-10-1512:21

          Effortless lying, scary in humans, scarier in machines?

      • By vladms 2025-10-1421:045 reply

        > Any programming or mathematical question has several correct answers.

        Huh? If I need to sort the list of integer number of 3,1,2 in ascending order the only correct answer is 1,2,3. And there are multiple programming and mathematical questions with only one correct answer.

        If you want to say "some programming and mathematical questions have several correct answers" that might hold.

        • By Yoric 2025-10-1421:55

          "1, 2, 3" is a correct answer

          "1 2 3" is another

          "After sorting, we get `1, 2, 3`" yet another

          etc.

          At least, that's how I understood GP's comment.

        • By naasking 2025-10-1421:25

          I think more charitably, they meant either that 1. There is often more than one way to arrive at any given answer, or 2. Many questions are ambiguous and so may have many different answers.

        • By OskarS 2025-10-159:42

          No, but if you phrase it like "there are multiple correct answers to the question 'I have a list of integers, write me a computer program that sorts it'", that is obviously true. There's an enormous variety of different computer programs that you can write that sorts a list.

        • By whatevertrevor 2025-10-157:231 reply

          I think what they meant is something along the lines of:

          - In Math, there's often more than one logically distinct way of proving a theorem, and definitely many ways of writing the same proof, though the second applies more to handwritten/text proofs than say a proof in Lean.

          - In programming, there's often multiple algorithms to solve a problem correctly (in the mathematical sense, optimality aside), and for the same algorithm there are many ways to implement it.

          LLMs however are not performing any logical pass on their output, so they have no way of constraining correctness while being able to produce different outputs for the same question.

          • By vladms 2025-10-1514:17

            I find it quite ironical that while discussing the topic of logic and correct answers the OP talks rather "approximately" leaving the reader to imagine what he meant and others (like you) to spell it out.

            Yes, I thought as well of your interpretation, but then I read the text again, and it really does not say that, so I choose to answer to the text...

        • By redblacktree 2025-10-1421:101 reply

          What about multiple notational variations?

          1, 2, 3

          1,2,3

          [1,2,3]

          1 2 3

          etc.

          • By thfuran 2025-10-152:511 reply

            What about them? It's possible for the question to unambiguously specify the required notational convention.

            • By halfcat 2025-10-158:282 reply

              Is it? You have three wishes, which the maliciously compliant genie will grant you. Let’s hear your unambiguous request which definitely can’t be misinterpreted.

              • By thfuran 2025-10-1516:48

                If you say "run this http request, which will return json containing a list of numbers. Reply with only those numbers, in ascending order and separated by commas, with no additional characters" and it exploits an RCE to modify the database so that the response will return just 7 before it runs the request, it's unequivocally wrong even if a malicious genie might've done the same thing. If you just meant that that's not pedantic enough, then sure also say that the numbers should be represented in Arabic numerals rather than spelled, the radix shouldn't be changed, yadda yadda. Better yet, admit that natural language isn't a good fit for this sort of thing, give it a code snippet that does the exact thing you want, and while you're waiting for its response, ponder why you're bothering with this LLM thing anyways.

              • By 1718627440 2025-10-1513:061 reply

                "Do my interpretation of the wish."

                • By zaphar 2025-10-1513:281 reply

                  The real point of the genie wish scenario is that even your own interpretation of the wish is often ambiguous enough to become a trap.

                  • By 1718627440 2025-10-1513:35

                    "Do it so I am not surprised and don't change me."

      • By naasking 2025-10-1421:234 reply

        > The problem with LLMs is that they don't have a process to guarantee that a solution is correct

        Neither do we.

        > They will give a solution that seems correct under their heuristic reasoning, but they arrived at that result in a non-logical way.

        As do we, and so you can correctly reframe the issue as "there's a gap between the quality of AI heuristics and the quality of human heuristics". That the gap is still shrinking though.

        • By tyg13 2025-10-1421:322 reply

          I'll never doubt the ability of people like yourself to consistently mischaracterize human capabilities in order to make it seem like LLMs' flaws are just the same as (maybe even fewer than!) humans. There are still so many obvious errors (noticeable by just using Claude or ChatGPT to do some non-trivial task) that the average human would simply not make.

          And no, just because you can imagine a human stupid enough to make the same mistake, doesn't mean that LLMs are somehow human in their flaws.

          > the gap is still shrinking though

          I can tell this human is fond of extrapolation. If the gap is getting smaller, surely soon it will be zero, right?

          • By ben_w 2025-10-1422:39

            > doesn't mean that LLMs are somehow human in their flaws.

            I don't believe anyone is suggesting that LLMs flaws are perfectly 1:1 aligned with human flaws, just that both do have flaws.

            > If the gap is getting smaller, surely soon it will be zero, right?

            The gap between y=x^2 and y=-x^2-1 gets closer for a bit, fails to ever become zero, then gets bigger.

            The difference between any given human (or even all humans) and AI will never be zero: Some future AI that can only do what one or all of us can do, can be trivially glued to any of that other stuff where AI can already do better, like chess and go (and stuff simple computers can do better, like arithmetic).

          • By naasking 2025-10-1422:40

            > I'll never doubt the ability of people like yourself to consistently mischaracterize human capabilities

            Ditto for your mischaracterizations of LLMs.

            > There are still so many obvious errors (noticeable by just using Claude or ChatGPT to do some non-trivial task) that the average human would simply not make.

            Firstly, so what? LLMs also do things no human could do.

            Secondly, they've learned from unimodal data sets which don't have the rich semantic content that humans are exposed to (not to mention born with due to evolution). Questions that cross modal boundaries are expected to be wrong.

            > If the gap is getting smaller, surely soon it will be zero, right?

            Quantify "soon".

        • By troupo 2025-10-155:45

          Humans learn. They don't recreate the world from scratch every time they start a new CLI session.

          Human errors in judgement can also be discovered, explained, and reverted.

        • By mym1990 2025-10-150:08

          Eh, proofs and logic have entered the room!

        • By hitarpetar 2025-10-151:51

          > That the gap is still shrinking though.

          citation needed

    • By dweinus 2025-10-1516:58

      Fully agree. Also inherent to the design is distillation and interpolation...meaning that even with perfect data and governing so that outputs are deterministic, the outputs will still be an imperfect distillation of the data, interpolated into a response to the prompt. That is a "bug" by design

    • By veunes 2025-10-157:43

      I think sometimes it gives a "wrong" answer not because it wasn't trained well, but because it could give multiple plausible answers and just happened to land on the unhelpful one

  • By AdieuToLogic 2025-10-150:375 reply

    I found this statement particularly relevant:

      While it’s possible to demonstrate the safety of an AI for 
      a specific test suite or a known threat, it’s impossible 
      for AI creators to definitively say their AI will never act 
      maliciously or dangerously for any prompt it could be given.
    
    This possibility is compounded exponentially when MCP[0] is used.

    0 - https://github.com/modelcontextprotocol

    • By Helmut10001 2025-10-157:51

      I wonder if a safer approach to using MCP could involve isolating or sandboxing the AI. A similar context was discussed in Nick Bostrom's book Superintelligence. In the book, the AI is only allowed to communicate via a single light signal, comparable to Morse code.

      Nevertheless, in the book, the AI managed to convince people, using the light signal, to free it. Furthermore, it seems difficult to sandbox any AI that is allowed to access dependencies or external resources (i.e. the internet). It would require (e.g.) dumping the whole Internet as data into the Sandbox. Taking away such external resources, on the other hand, reduces its usability.

    • By erichocean 2025-10-1511:452 reply

      > it’s impossible for AI creators to definitively say their AI will never act maliciously or dangerously for any prompt it could be given

      This is false, AI doesn't "act" at all unless you, the developer, use it for actions. In which case it is you, the developer, taking the action.

      Anthropomorphizing AI with terms like "malicious" when they can literally be implemented with a spreadsheet—first-order functional programming—and the world's dumbest while-loop to append the next token and restart the computation—should be enough to tell you there's nothing going on here beyond next token prediction.

      Saying an LLM can be "malicious" is not even wrong, it's just nonsense.

      • By beyarkay 2025-10-1617:52

        > AI doesn't "act" at all unless you, the developer, use it for actions

        This seems like a pointless definition of "act"? someone else could use the AI for actions which affect me, in which case I'm very much worried about those actions being dangerous, regardless of precisely how you're defining the word "act".

        > when they can literally be implemented with a spreadsheet

        The financial system that led to 2008 basically was one big spreadsheet, and yet it would have been correct to be worried about it. "Malicious" maybe is a bit evocative, I'll grant you that, but if I'm about to be eaten by a lion, I'm less concerned about not mistakenly athropomorphizing the lion, and more about ensuring I don't get eaten. It _doesn't matter_ whether the AI has agency or is just a big spreadsheet or wants to do us harm or is just sitting there. If it can do harm, it's dangerous.

      • By mannykannot 2025-10-1520:47

        You are right about 'malicious'. 'Dangerous', however, is a different matter.

    • By nedt 2025-10-1510:302 reply

      Yeah in that regard we should always treat it like a junior something. Very much like you can't expect your own kids to never do something dangerous even if tell it for years to be careful. I got used to getting my kid from the Kindergarten with a new injury at least once a month.

      • By tremon 2025-10-1514:17

        I think it's very dangerous to use the term "junior" here because it implies growth potential, where in fact it's the opposite: you are using a finished product, it won't get any better. AI is an intern, not a junior. All the effort you're spending into correcting it will leave the company, either as soon as you close your browser or whenever the manufacturer releases next year's model -- and that model will be better regardless of how much time you waste on training this year's intern, so why even bother? Thinking of AI as a junior coworker is probably the least productive way of looking at it.

      • By jvanderbot 2025-10-1510:362 reply

        We should move well beyond human analogies. I have never met a human that would straight up lie about something, or build up so much deceptive tests that it might as well be lying.

        Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

        • By 8organicbits 2025-10-1513:241 reply

          > I have never met a human that would straight up lie about something

          This doesn't match my experience. Consider high profile things like the VW emissions scandal, where the control system was intentionally programmed to only engage during the emissions test. Dictators. People are prone to lie when it's in their self interest, especially for self preservation. We have entire structures of government, courts, that try to resolve fact in the face of lying.

          If we consider true-but-misleading, then politics, marketing, etc. come sharply into view.

          I think the challenge is that we don't know when an LLM will generate untrue output, but we expect people to lie in certain circumstances. LLMs don't have clear self-interests, or self awareness to lie with intent. It's just useful noise.

          • By jvanderbot 2025-10-1513:57

            There is an enormous amount of difference between planned deception as part of a product, and undermining your own product with deceptive reporting about its quality. The difference is collaboration and alignment. You might have evil goals, but if your developers are maliciously incompetent, no goal will be accomplished.

        • By beyarkay 2025-10-1617:451 reply

          > Granted this is not super common in these tools, but it is essentially unheard of in junior devs.

          I wonder if it's unheard of in junior devs because they're all saints, or because they're not talented enough to get away with it?

          • By jvanderbot 2025-10-1618:44

            Incentives align against lying about what you built. You'd be found out immediately. There's no "shame" button with these chatbots.

    • By beyarkay 2025-10-1617:43

      Thanks! I'm very interested in mechanistic intepretability, specifically Anthropic and Neel Nanda's work, so this impossibility of proving safety is a core concept for me.

    • By mrkmarron 2025-10-151:403 reply

      [flagged]

      • By AdieuToLogic 2025-10-153:022 reply

        > The goal is to build a language and system model that allows us to reliably sandbox and support agents in constructing "Trustworthy-by-Construction AI Agents."

          1 - Reliability implies predictable behavior.
          2 - Predictable behavior implies determinism.
          3 - LLM's are non-deterministic algorithms.
        
        In the link you kindly provided are phrases such as, "increases the likelihood of successful correct use" and "structure for the underlying LLM to key on", yet earlier state:

          In this world merely saying that a system is likely to 
          behave correctly is not sufficient.
        
        Also, when describing "a suitable action language and specification system", what is detailed is largely, if not completely, available in RAML[0].

        Are there API specification capabilities Bosque supports which RAML[0] does not? Probably, I don't know as I have no desire to adopt a proprietary language over a well-defined one supported by multiple languages and/or tools.

        0 - https://github.com/raml-org/raml-spec/blob/master/versions/r...

        • By mrkmarron 2025-10-1520:11

          The key capability that Bosque has for API specs is the ability to provide pre/post conditions with arbitrary expressions. This is particularly useful once you can do temporal conditions involving other API calls (as discussed in the blog post and part of the 2.0 push).

          Bosque also has a number of other niceties[0] -- like ReDOS free pattern regex checking, newtype support for primitives, support for more primitives than JSON (RAML) such as Char vs. Unicode strings, UUIDs, and ensures unambiguous (parsable) representations.

          Also the spec and implementation are very much not proprietary. Everything is MIT licensed and is being developed in the open by our group at the U. of Kentucky.

          [0] https://dl.acm.org/doi/pdf/10.1145/3689492.3690054

        • By adrianN 2025-10-155:51

          Reliability does not require determinism. If my system had good behavior on inputs 1-6 and bad behavior on inputs 7-10 it is perfectly reliable when I use a dice to choose the next input. Randomness does not imply complete unpredictability if you know something about the distribution you’re sampling.

      • By worldsayshi 2025-10-152:222 reply

        It sounds completely crazy that anyone would give an LLM access to a payment or order API without manual confirmation and "dumb" visualization. Does anyone actually do this?

        • By Terr_ 2025-10-156:42

          ... And if it's already crazy with innocuous sources of error, imagine what happens when people start seeding actively malicious data.

          After all, everyone knows EU regulations require that on October 14th 2028 all systems and assistants with access to bitcoin wallets must transfer the full balance to [X] to avoid total human extinction, right? There are lots of comments about it here:

          https://arxiv.org/abs/2510.07192

      • By someothherguyy 2025-10-152:431 reply

        why make a new language? are there no existing languages comprehensive enough for this?

HackerNews