A new Google model is nearly perfect on automated handwriting recognition

2025-11-1113:52553314generativehistory.substack.com

A mysterious new model currently in testing on Google’s AI Studio is nearly perfect on automated handwriting recognition but it is also showing signs of spontaneous, abstract, symbolic reasoning.

Google has a webapp called AI Studio where people can experiment with prompts and models. In the last week, users have found that every once in awhile they will get two results and are asked to select the better one. The big AI labs typically do this type of A/B testing on new models just before they’re released, so speculation is rampant that this might be Gemini-3. Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

Curious, I tried it out on transcribing some handwritten texts and the results were shocking: not only was the transcription very nearly perfect—at expert human levels—but it did a something else unexpected that can only be described as genuine, human-like, expert level reasoning. It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental.

What follows are my first impressions of this new model with all the requisite caveats that entails. But if my observations hold true, this will be a big deal when it’s released. We appear to be on the cusp of an era when AI models will not only start to read difficult handwritten historical documents just as well as expert humans but also analyze them in deep and nuanced ways. While this is important for historians, we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required. As is so often the case with AI, that is exciting and frightening all at once. Even a few months ago, I thought this level of capability was still years away.

Rumours started appearing on X a week ago that there was a new Gemini model in A/B testing in AI Studio. It’s always hard to know what these mean but I wanted to see how well this thing would do on handwritten historical documents because that has become my own personal benchmark. I am interested in LLM performance on handwriting for a couple of reasons. First, I am a historian so I intuitively see why fast, cheap, and accurate transcription would be useful to me in my day-to-day work. But in trying to achieve that, and in learning about AI, I have come to believe that recognizing historical handwriting poses something of a unique challenge and a great overall test for LLM abilities in general. I also think it shines a small amount of light on the larger question of whether LLMs will ultimately prove capable of expert human levels of reasoning or prove to be a dead end. Let me explain.

Most people think that deciphering historical handwriting is a task that mainly requires vision. I agree that this is true, but only to a point. When you step back in time, you enter a different country, or so the saying goes. People talk differently, using unfamiliar words or familiar words in unfamiliar ways. People in the past used different systems of measurement and accounting, different turns of phrase, punctuation, capitalization, and spelling. Implied meanings were different as were assumptions about what readers would know.

While it can be easy to decipher most of the words in a historical text, without contextual knowledge about the topic and time period it’s nearly impossible to understand a document well-enough to accurately transcribe the whole thing—let alone to use it effectively. The irony is that some of the most crucial information in historical letters is also the most period specific and thus hardest to decipher.

Even beyond context awareness, though, paleography involves linking vision with reasoning to make logical inferences: we use known words and thus known letters to identify uncertain letters. As we shall see, documents very often become logic puzzles and LLMs have mixed performance on logic puzzles, especially novel formulations they have not been trained on. For this reason, it has been my intuition for some time that models would either solve the problem of historical handwriting and other similar problems as they increased in scale, or they would plateau at high but imperfect, sub-human expert levels of accuracy.

I don’t want to get too technical here, but it is important to understand why these types of things are so hard for LLMs and why the results I am reporting here are significant. Since the first vision model, GPT-4, was released in February 2023, we’ve seen HTR scores steadily improve to the point that they get about 90% (or more) of a given text correct. Much of this can be chalked up to technical improvements in image processing and better training data, but it’s that last 10% that I’ve been talking about above.

Remember that LLMs are inherently predictive by nature, trained to choose the most probable way to complete a sequence like “the cat sat on the …”. They are, in effect, made up of tables which record those probabilities. Spelling errors and stylistic inconsistencies are, by definition, unpredictable, low probability answers and so LLMs must chafe against their training data to transcribe “the cat sat on the rugg” instead of “mat”. This is also why LLMs are not very good at transcribing unfamiliar people’s names (especially last names), obscure places, dates, or numbers such as sums of money.

From a statistical point of view, these all appear as arbitrary choices to an LLM with no meaningful differences in their statistical probabilities: in isolation, one is no more likely than another. Was a letter written by Richard Darby or Richard Derby? Was it dated 15 March 1762 or 16 March 1782? Did the author enclosed a bill for 339 dollars or 331 dollars? Correct answer to those questions cannot normally be predicted from the preceding contents of a letter. You need other types of information to find the answer when letters prove indecipherable. Yet basic correctness on these types of information—names, dates, places, and sums—is a prerequisite to their being useful to me as a historian. This makes the final mile of accuracy the only one that really counts.

More importantly, these issues with handwriting recognition are only one small facet of a much larger debate about whether the predictive architecture behind LLMs is inherently limiting or whether scaling (making the models larger) will allow the models to break free of regurgitation and do something new.

So when I benchmark an LLM on handwriting, in my mind I feel I am also getting some insight into that larger question of whether LLMs are plateauing or continuing to grow in capabilities. To benchmark LLM handwriting accuracy, last year Dr. Lianne Leddy and I developed a set of 50 documents comprising some 10,000 words—we had to choose them carefully and experiment to ensure that these documents were not already in the LLM training data (full disclosure: we can’t know for sure, but we took every reasonable precaution). We’ve written about the set several times before, but in short it includes dozens of different hands, images captured with a variety of tools from smartphones to scanners, and document with different styles of writing from virtually illiterate scrawl to formal secretary hand. In my experience, they are representative of the types of documents that I, and English-language historian currently working on 18th and 19th century records, most often encounter.

We measure transcription error rates in terms of the percentage of incorrect characters (CER) and words (WER) in a given text. These are standardized but blunt instruments: a word may be spelled correctly but if the first letter is wrongly capitalized or it is followed with a comma rather than a semicolon it counts as an erroneous word. But what constitutes an error is also not always clear. Capitalization and punctuation were not standardized until the 20th century (in English) and are often ambiguous in historical documents. Another example: should we transcribe the long f (as in leſs) using an “f” for the first “s” or just write it out as “less”? That’s a judgement call. Sometimes letters and whole words are simply indecipherable and up for interpretation.

In truth, it’s usually impossible to score 100% accuracy in most real-world scenarios. Studies show that non-professionals typically score WERs of 4-10%. Even professional transcription services expect a few errors. They typically guarantee a 1% WER (or around a 2-3% CER), but only when the texts are clear and readable. So that is essentially the ceiling in terms of accuracy.

Figure 1: Performance of Trasnskribus, Humans, and Google models on HTR over time

Last winter, on our test-set, Gemini-2.5-Pro began to score in the human range: a strict CER of 4% and WER of 11%. When we excluded errors of punctuation and capitalization—errors that don’t change the actual meaning of the text or its usefulness for search and readability purposes—those scores dropped to CERs of 2% and WERs of 4%. The best specialized HTR software achieves CERs around 8% and WERs around 20% without specialized training which reduces errors rates to about those of Gemini-2.5-Pro. Improvement has indeed been steady across each generation of models. Those of Gemini-2.5-Pro were about 50-70% better than the ones we reported for Gemini-1.5-Pro a few months before, which were about 50-70% better than the initial scores reported for GPT-4 a few months before that. A similar progression is evident in Google’s faster, cheaper version of Gemini-FlashThe open question has been: will they keep improving at a similar rate.

On a (Canadian) Thanksgiving trip to visit family, I started to play with the new Google model. Here is what I had to do to access it. First, I uploaded an image to AI Studio, and gave it the following system instructions (the same ones we’ve used on all our tests…I’d like to modify them but I need to keep them consistent across all the tests):

“Your task is to accurately transcribe handwritten historical documents, minimizing the CER and WER. Work character by character, word by word, line by line, transcribing the text exactly as it appears on the page. To maintain the authenticity of the historical text, retain spelling errors, grammar, syntax, and punctuation as well as line breaks. Transcribe all the text on the page including headers, footers, marginalia, insertions, page numbers, etc. If these are present, insert them where indicated by the author (as applicable). In your final response write “Transcription:” followed only by your transcription.”

But then I had to wait for the result and manually retry the prompt, over and over again—sometimes 30 or more times—until I was given a choice between two answers. Needless to say, this was time consuming, expensive, and I repeatedly hit rate limits which delayed things even more. As a result, I could only get through five documents from our set. In response, I tried to choose the most error-prone and difficult to decipher documents from the set, texts that are not only written in a messy hand but are full of spelling and grammatical errors, lacking in proper punctuation, and that contain lots of inconsistent capitalization. My goal was not to be definitive—that will come later—but to get a sense of what this model could do.

Figure 2: The AI Studio interface showing the A/B Test rather than a single output.

The results were immediately stunning. On each of the five documents I transcribed (totalling a little over 1,000 words or 10% of our total sample), the model achieved a strict CER of 1.7% and WER of 6.5%—in other words, about 1 in 50 characters were wrong including punctuation marks and capitalization. But as analyzed the data I saw something new: for the first time, nearly all the errors were capitalization and punctuation, very few were actual words. I also found that a lot of the punctuation marks and capital letters it was getting wrong were actually highly ambiguous. When those types of errors were excluded from the count, the error rates fell to a modified CER of 0.56% and WER of 1.22%. In other words, the new Gemini model was only getting about 1 in 200 characters wrong, not counting punctuation marks and capital letters.

Figure 3: A good side by side comparison on a particularly difficult document.No other model comes close on this letter.

The new Gemini model’s performance on HTR meets the criteria for expert human performance. These results are also 50-70% better than those achieved by Gemini-2.5-Pro. In two years, we have in effect gone from transcriptions that were little more than gibberish to expert human levels of accuracy. And the consistency in the leap between each generation of model is exactly what you would expect to see if scaling laws hold: as a model gets bigger and more complex, you should be able to predict how well it will perform on tasks like this just by knowing the size of the model alone.

Here is where is starts to get really weird and interesting. Fascinated with the results, I decided to push the model further. Up to this point, no model has been able to reliably decipher tabular handwritten data, the kind of data we find in merchant ledger, account books, and daybooks. These are extremely difficult to decipher for humans but (until now) nearly impossible for LLMs because there is very little about the text that is predictive.

Take this page (Figure 4) from a 1758 Albany merchant’s daybook (a running tally of sales) which is especially hard to read. It is messy, to be sure, but was also kept in English by a Dutch clerk who may not have spoken much English and whose spelling and letter formation was highly irregular, mixing Dutch and English together. The sums in the accounts were also written in the old style of pounds / shillings / pence using a shorthand typical of the period: “To 30 Gallons Rum @4/6 6/15/0”. This means that someone purchased (a charge to their account) 30 gallons of rum where each gallon cost 4 shillings and 6 pence for a total of 6 pounds, 15 shillings, and 0 pence.

To most people today, this non-decimalized way of measuring money is foreign: there are 12 pennies (pence) in a shilling and 20 shillings in pound (see this description by the Royal Mint). Individual transactions were written into the book as they happened, divided from one another by a horizontal rule with a number signifying the day of the month written in the middle. Each transaction was recorded as s debt (Dr), that is a purchase, or a Credit (Cr) meaning a payment. Some transactions were also crossed out, probably to indicate they had been balanced or transferred to the client’s account in the merchant’s main ledger (similar to when a pending transaction is posted in your online banking). And none of this was written in a standardized way.

LLMs have had a hard time with such books, not only because there is very limited training data available for these types of records (ledgers are less likely to digitized and even less likely to be transcribed than diaries or letters because: who wants to read them unless they have to?) but because none of this is predictive: a person can buy any amount of anything at any arbitrary cost recorded in sums that don’t add up according to conventional methods…which LLMs have had enough issues with over the years. I’ve found that models can often decipher some of the names and some of the items in a ledger, but become utterly lost on the numbers. They have a hard time transcribing digits in general (again you can’t predict whether it’s 30 or 80 gallons if the first digit is poorly formed), but also tended to merge the item costs and totals together. In effect, they often don’t seem to realize that the old-style sums are amounts of money at all. Telling them to check the numbers by adding the totals together does not help and often makes things worse. Especially complex pages temporarily break the model, causing it to repeat certain numbers or phrases repeatedly until they reach their output limits. Other times they think for a long time and then fail to answer entirely.

But there is a something in this new machine that is markedly different. From my admittedly limited tests, the new Gemini model handles this type of data much better than any previous model or student I’ve encountered: after completing the five documents from our test-set I uploaded the Albany merchant’s daybook page above (Figure 4) with the same prompt, just to see what would happen and amazingly, it was again almost perfect. The numbers are, remarkably, all correct. More interesting, though, are that its errors are actually corrections or clarifications. For example, when Samuel Stitt purchased 2 punch bowls, the clerk recorded that they cost 2/ each meaning 2 shillings each; for brevity’s sake he implied 0 pennies rather than writing it out. Yet for consistency, the model transcribed this as @2/0 which is actually a more correct way of writing the sum and clarifies the meaning. Strictly speaking, though, it is an error.

Figure 5: Transcription by new unknown Gemini model of page from the Albany Account Book

In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document. This was exactly the type of errors I’ve seen many times before: in the absence of good context the model guessed, inserting a hallucination. But then I realized that it had actually done some extremely clever.

What Gemini did was to correctly infer that the digits 1, 4, 5 were units of measurement describing the total weight of sugar purchased. This was not an obvious conclusion to draw, though, from the document itself. All the other nineteen entries clearly specify total units of purchase up front: 30 gallons, 17 yds, 1 barrel and so on. The sugar loaf entry does this too (1 loaf is written at the start of the entry) and it is the only one that lists a number at the end of the description. There is a tiny mark above the 1 which may also (ambiguously) have been used to indicate pounds (thanks to Thomas Wein for noticing this). But if Gemini interpreted it this way, it would also have read the phrase as something like 1 lb 45 or 145 lb, given the placement of the mark above the 1. It was also able to glean from the text that sugar was being sold at 1 shilling and 4 pence per something, and inferred that this something was pounds.

Figure 6: Close-up of the transcription
Figure 7: Closeup of the Original Document

To determine the correct obverse weight, decoding the 145, Gemini then did something remarkable: it worked through the numbers, using the final total cost of 0/19/1 to work backwards to determine the weight, a series of operations that would require it to convert between two decimalized and two non-decimalized systems of measurement. While we don’t know its actual reasoning process, it must have been something akin to this: the sugar cost 1 shilling and 4 pence per unit, and that that sum can also be expressed as 16 pence. We also know that the total value of the sale was 0 pounds, 19 shillings, and 1 penny, so we can express this as 229 pence to create a common unit of comparison. To find how much sugar was purchased we then divide 229 by 16 to get the result: 14.3125 or 14 and 5/16 or 14 lb 5 oz. Therefore, Gemini concluded, it was not 1 45, nor 145 but 14 5 and then 14 lb 5 oz, and it chose to clarify this in its transcription.

[Added 17/10/2025]: If that ambiguous mark above the 1 tipped it off that the 145 was a measurement in pounds, the result was a similar process of logical deduction and self correction. In that case, Gemini would have had to intentionally question the most obvious version of the transcription, realizing (in effect) that 1 lb 45 or 145 lbs (which is the only way to read the original) did not balance with the tally of 0 19 1. Getting to 14 lb 5 oz would then arise from the same process as above.

This is exactly the type of logic problem at which LLMs often fail: first there is the ambiguity in the writing itself and in the form of the text, then the double meaning of the word “pounds”, and finally the need to convert back and forth between not one but two different non-decimalized systems of measurement. And no one asked Gemini to do this. It took the initiative to investigate and clarify the meaning of the ambiguous number all on its own. And it was correct.

In my testing, no other model has done anything like this when tasked with transcribing the same document. Indeed even if you give Gemini-2.5-Pro hints, asking it to pay attention to missing units of measurement it occasionally inserst “lb” or “wt” after the 5 in 145, but deletes the other numbers. GPT-5 Pro typically transcribes the line as: “To 1 Loaf Sugar 1 lb 5 0 19 1”. Interestingly, you can nudge both GPT-5 and Gemini-2.5-Pro towards the correct answer by asking it what the numbers 1 4 5 mean in the sugar loaf entry. And even then answers vary, often suggesting that it was 145 lbs of sugar rather than 14 lb 5 oz.

I have diligently tried to replicate this result, but sadly after hundreds of refreshes on AI Studio, I have yet to see the A/B test again on this document. I suspect that Google may have ended it, or at least for me.

What makes this example so striking is that it seems to cross a boundary that some experts have long claimed current models cannot pass. Strictly speaking, the Gemini model is not engaging in symbolic reasoning in the traditional sense: it is not manipulating explicit rules or logical propositions as a classical AI system would be expected to do. Yet its behaviour mirrors that outcome. Faced with an ambiguous number, it inferred missing context, performed a set of multi-step conversions between historical systems of currency and weight, and arrived at a correct conclusion that required abstract reasoning about the world the document described. In other words, it behaved as if it had access to symbols, even though none were ever explicitly defined. Did it create these symbolic representations for itself? If so, what does that mean? If not, how did it do this?

What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model that (I don’t believe…Google please clarify!) was designed to reason symbolically at all. And the point is that we don’t know what it actually did or why.

The safer view is to assume that Gemini did not “know” that it was solving a problem of eighteenth-century arithmetic at all, but its internal representations were rich enough to emulate the process of doing so. But that answer seems to ignore the obvious facts: it followed an intentional, analytical process across several layers of symbolic abstraction, all unprompted. This seems new and important.

If this behaviour proves reliable and replicable, it points to something profound that the labs are also starting to admit: that true reasoning may not require explicit rules or symbolic scaffolding to arise, but can instead emerge from scale, multimodality, and exposure to enough structured complexity. In that case, the sugar-loaf entry is more than a remarkable transcription, it is a small but clear (and I think unambiguous) sign that the line between pattern recognition and genuine understanding is beginning to blur.

For historians, the implications are immediate and profound. If these results hold up under systematic testing, we will be entering an era in which large language models can not only transcribe historical documents at expert-human levels of accuracy, but can also reason about them in historically meaningful ways. That is, they are no longer simply seeing letters and words—and correct ones at that—they are beginning to interpret context, logic, and material reality. A model that can infer the meaning of “145” as “14 lb 5 oz” in an 18th-century merchant ledger is not just performing text recognition: it is demonstrating an understanding of the economic and cultural systems in which those records were produced…and then using that knowledge to re-interpret the past in intelligible ways. This moves the work of automated transcription from a visual exercise into an interpretive one, bridging the gap between vision and reasoning in a way that mirrors what human experts do.

But the broader implications are even more striking. Handwritten Text Recognition is one of the oldest problems in the field of AI research, going back to the late 1940s before AI even had a name. For decades, AI researchers have treated handwritten text recognition as a bounded technical problem, that is an engineering challenge in vision. This began with the IBM 1287 which could read digits and five letters when it debuted in 1966 and continued on through the creation of specialized HTR models developed only a few years ago.

What this new Gemini model seems to show is that near-perfect handwriting recognition is better achieved through the generalist approach of LLMs. Moreover, the model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself.

If so, the “handwriting problem” may turn out to have been a proxy for something much larger. What began with a test on the readability of old documents may now be revealing, by accident, the beginnings of machines that can actually reason in abstract, symbolic ways about the world they see.


Read the original article

Comments

  • By throwup238 2025-11-1422:1613 reply

    I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.

    > Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

    This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

    • By kace91 2025-11-1423:562 reply

      >I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge

      Completely off topic, but out of curiosity, where are you reading these documents? As a Spaniard I’m kinda interested.

      • By throwup238 2025-11-150:141 reply

        I use the Portal de Archivos Españoles [1] for Spanish colonial documents. Each country has their own archive but the Spanish one has the most content (35 million digitized pages)

        The hard part is knowing where to look since most of the images haven’t gone through HRT/OCR or indexing so you have to understand Spanish colonial administration and go through the collections to find stuff.

        [1] https://pares.cultura.gob.es/pares/en/inicio.html

        • By throwout4110 2025-11-150:284 reply

          Want to collab on a database and some clustering and analysis? I’m a data scientist at FAIR with an interest in antiquarian docs and books

          • By dr_dshiv 2025-11-1517:431 reply

            Hit me up, if you can. I’m focused on neolatin texts from the renaissance. Less than 30% of known book editions have been scanned and less than 5% translated. And that’s before even getting to the manuscripts.

            https://Ancientwisdomtrust.org

            Also working on kids handwriting recognition for https://smartpaperapp.com

          • By throwup238 2025-11-151:522 reply

            Sadly I'm just an amateur armchair historian (at best) so I doubt I'd be of much help. I'm mostly only doing the translation for my own edification

            • By cco 2025-11-157:44

              You may be surprised (or not?) at how many important scientific and historical works are done by armchair practitioners.

            • By throwout4110 2025-11-1613:25

              No problem at all, if you have some databases or catalogs I’d be interested in learning more

          • By vintermann 2025-11-159:06

            You should maybe reach out to the author of this blog post, professor Mark Humphries. Or to the genealogy communities, we struggle with handwritten historical texts no public AI model can make a dent in, regularly.

          • By rmonvfer 2025-11-150:391 reply

            Spaniard here. Let me know if I can somehow help navigate all of that. I’m very interested in history and everything related to the 1400-1500 period (although I’m not an expert by any definition) and I’d love to see what modern technology could do here, specially OCRs and VLMs.

      • By SJC_Hacker 2025-11-1515:401 reply

        Do you have six fingers, per chance ?

        • By ChrisMarshallNY 2025-11-1522:10

          I don’t know if the six-fingered man was a Spaniard, but Inigo Montoya was…

    • By viftodi 2025-11-152:182 reply

      You are right to be skeptical.

      There are plenty of so called windows(or other) web 'os' clones.

      There were a couple of these posted on HN actually this very year.

      Here is one example I google dthat was also on HN : https://news.ycombinator.com/item?id=44088777

      This is not an OS as in emulating a kernel in javascript or wasm, this is making a web app that looks like the desktop of an OS.

      I have seen plenty such projects, some mimick windows UI entirely, you xan find them via google.

      So this was definitely in the training data, and is not as impressive as the blog post or the twitter thread make it to be.

      The scary thing is the replies in the twitter thread have no critical thinking at all and are impressed beyond belief, they think it coded a whole kernel, os, made an interpeter for it, ported games etc.

      I think this is the reason why some people are so impressed by AI, when you can only judge an app visually or only how you intetcat with it and don't have the depth of knowledge to understand, for such people it works all the way.land AI seems magical beyond comprehension.

      But all this is only superficial IMHO.

      • By krackers 2025-11-152:421 reply

        Every time a model is about to be released, there are a bunch of these hype accounts that spin up. I don't know they get paid or they spring up organically to farm engagement. Last time there was such hype for a model was "strawberry" (o1) then gpt-5, and both turned out to be meaningful improvements but nowhere near the hype.

        I don't doubt though that new models will be very good at frontend webdev. In fact this is explicitly one of the recent lmarena tasks so all the labs have probably been optimizing for it.

        • By tyre 2025-11-1512:371 reply

          My guess is that there are insiders who know about the models and can’t keep their mouths shut. They like being on the inside and leaking.

          • By DrewADesign 2025-11-1515:50

            I’d also bet my car on there being a ton of AI product/policy/optics astroturfing/shilling going on, here and everywhere else. Social proof is a hell of a marketing tool and I see a lot of comments suspiciously bullish about mediocre things, or suspiciously aggressive towards people that aren’t enthused. I don’t have any direct proof so I could be wrong, but it seems more extreme than a iPhone/Android (though I suspect deliberate marketing forces there, too,) Ford/Chevy brand-based-identity kind of thing, and naive to think this tactic is limited to TikTok and Instagram videos. The crowd here is so targeted, I wouldn’t be surprised if a single-digit percentage of the comments are laying down plausible comment history facade for marketing use. The economics might make it worthwhile for the professional manipulators of the world.

      • By risyachka 2025-11-1510:282 reply

        Its always amusing when "an app like windows xp" considered hard or challenging somehow.

        Literally the most basic html/css, not sure why it is even included in benchmarks.

        • By viftodi 2025-11-1521:30

          While it is obviously much easier than creating a real OS, some people have created desktop managers web apps, with resizeable and movable windows, apps such as terminals, nodepads, file explorer etc.

          This is still a challenging task and requires lots of work to get this far.

        • By ACCount37 2025-11-1510:401 reply

          Those things are LLMs, with text and language at the core of their capabilities. UIs are, notably, not text.

          An LLM being able to build up interfaces that look recognizably like an UI from a real OS? That sure suggests a degree of multimodal understanding.

          • By cowboy_henk 2025-11-1515:52

            UIs made in the HyperText Markup Language are, in fact, text.

    • By Aperocky 2025-11-151:43

      > This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

      Thanks for this, I was almost convinced and about to re-think my entire perspective and experience with LLMs.

    • By jvreeland 2025-11-1422:32

      I'd love to find more info on this but from what I can find it seems to be making webpages that look like those product, and seemingly can "run python" or "emulate a game" but writing something that, based on all of GitHub, can approximate an iPhone or emulator in javscript/css/HTML is very very very different than writing an OS.

    • By dotancohen 2025-11-1521:50

      My language does not use Latin letters, but they are separate letters. Is there a way to train some handwriting recognition on my own handwriting in my own language, such that it will be effective and useful? I mostly need to recognize text in PDF documents, generated by writing on an e-ink tablet with an EMR stylus.

    • By smusamashah 2025-11-151:38

      > Whats the kernel look like?

      Those clones are all HTML/CSS, same for game clones made by Gemini.

    • By nestorD 2025-11-1422:271 reply

      Oh! That's a nice use-case and not too far from stuff I have been playing with! (happily I do not have to deal with handwriting, just bad scans of older newspapers and texts)

      I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.

      The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

      • By throwup238 2025-11-1423:211 reply

        > The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

        What does that look like? How well does it work?

        I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.

        • By nestorD 2025-11-152:431 reply

          I started with a UI that sounded like it was built along the same lines as yours, which had the advantage of letting me enforce a pipeline and exhaustivity of search (I don't want the 10 most promising documents, I want all of them).

          But I realized I was not using it much because it was that big and inflexible (plus I keep wanting to stamp out all the bugs, which I do not have the time to do on a hobby project). So I ended up extracting it into MCPs (equipped to do full-text search and download OCR from the various databases I care about) and AGENTS.md files (defining pipelines, as well as patterns for both searching behavior and reporting of results). I also put together a sub-agent for translation (cutting away all tools besides reading and writing files, and giving it some document-specific contextual information).

          That lets me use Claude Code and Codex CLI (which, anecdotally, I have found to be the better of the two for that kind of work; it seems to deal better with longer inputs produced by searches) as the driver, telling them what I am researching and maybe how I would structure the search, then letting them run in the background before checking their report and steering the search based on that.

          It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context), but I now find myself reaching for it regularly, and I polish out problems one at a time. The next goal is to add more data sources and to maybe unify things further.

          • By throwup238 2025-11-154:09

            > It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context)

            This has been the biggest problem for me too. I jokingly call it the LLM halting problem because it never knows the proper time to stop working on something, finishing way too fast without going through each item in the list. That’s why I’ve been doing my own custom orchestration, drip feeding it results with a mix of summarization and content extraction to keep the context from different documents chained together.

            Especially working with unindexed content like colonial documents where I’m searching through thousands of pages spread (as JPEGs) over hundreds of documents for a single one that’s relevant to my research, but there are latent mentions of a name that ties them all together (like a minor member of an expedition giving relevant testimony in an unrelated case). It turns into a messy web of named entity recognition and a bunch of more classical NLU tasks, except done with an LLM because I’m lazy.

    • By snickerbockers 2025-11-1422:255 reply

      I'm skeptical that they're actually capable of making something novel. There are thousands of hobby operating systems and video game emulators on github for it to train off of so it's not particularly surprising that it can copy somebody else's homework.

      • By jstummbillig 2025-11-1423:379 reply

        I remain confused but still somewhat interested as to a definition of "novel", given how often this idea is wielded in the AI context. How is everyone so good at identifying "novel"?

        For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

        • By snickerbockers 2025-11-150:302 reply

          Generally novel either refers to something that is new, or a certain type of literature. If the AI is generating something functionally equivalent to a program in its training set (in this case, dozens or even hundreds of such programs) then it by definition cannot be novel.

          • By brulard 2025-11-150:554 reply

            This is quite a narrow view of how the generation works. AI can extrapolate from the training set and explore new directions. It's not just cutting pieces and gluing together.

            • By throwaway173738 2025-11-155:211 reply

              Calling it “exploring” is anthropomorphising. The machine has weights that yield meaningful programs given specification-like language. It’s a useful phenomenon but it may be nothing like what we do.

              • By grosswait 2025-11-1512:36

                Or it may be remarkably similar to what we do

            • By beeflet 2025-11-151:321 reply

              In practice, I find the ability for this new wave of AI to extrapolate very limited.

              • By fragmede 2025-11-151:43

                Do you have any concrete examples you'd care to share? While this new wave of AI doesn't have unlimited powers of extrapolation, the post we're commenting on is asserting that this latest AI from Google was able to extrapolate solutions to two of AI's oldest problems, which would seem to contradict an assertion of "very limited".

            • By kazinator 2025-11-153:132 reply

              Positively not. It is pure interpolation and not extrapolation. The training set is vast and supports an even vaster set of possible traversal paths; but they are all interpolative.

              Same with diffusion and everything else. It is not extrapolation that you can transfer the style of Van Gogh onto a photographl it is interpolation.

              Extrapolation might be something like inventing a style: how did Van Gogh do that?

              And, sure, the thing can invent a new style---as a mashup of existing styles. Give me a Picasso-like take on Van Gogh and apply it to this image ...

              Maybe the original thing there is the idea of doing that; but that came from me! The execution of it is just interpolation.

              • By BoorishBears 2025-11-153:532 reply

                This is knock against you at all, but in a naive attempt to spare someone else some time: remember that based on this definition it is impossible for an LLM to do novel things and more importantly, you're not going to change how this person defines a concept as integral to one's being as novelty.

                I personally think this is a bit tautological of a definition, but if you hold it, then yes LLMs are not capable of anything novel.

                • By Libidinalecon 2025-11-1512:503 reply

                  I think you should reverse the question, why would we expect LLMs to even have the ability to do novel things?

                  It is like expecting a DJ remixing tracks to output original music. Confusing that the DJ is not actually playing the instruments on the recorded music so they can't do something new beyond the interpolation. I love DJ sets but it wouldn't be fair to the DJ to expect them to know how to play the sitar because they open the set with a sitar sample interpolated with a kick drum.

                  • By HeinzStuckeIt 2025-11-160:07

                    A lot of musicians these days are using sample libraries instead of actually holding real instruments in their hands. It’s not just DJs or electronic producers. It’s remarkable that Brendan Perry of Dead Can Dance, for example, who played guitar and bass as a young man and once amassed a collection of exotic instruments from around the world, built recent albums largely out of instrument sample libraries. One of technology’s effects on culture that maybe doesn’t get talked about as much as outright electronic genres.

                  • By BoorishBears 2025-11-1522:08

                    It just depends on how you define novel.

                    Would you consider the instrumental at 33 seconds a new song? https://youtu.be/eJA0wY1e-zU?si=yRrDlUN2tqKpWDCv

                  • By 8note 2025-11-1520:411 reply

                    kid koala does jazz solos on a disk of 12 notes, jumping the track back and forth to get different notes.

                    i think that, along with the sitar player are still interpolating. the notes are all there on the instrument. even without an instrument, its still interpolating. the space that music and aound can be in is all well known wave math. if you draw a fourier transform view, you could see one chart with all 0, and a second with all +infinite, and all music and sound is gonna sit somewhere between the two.

                    i dont know that "just interpolation" is all that meaningful to whether something is novel or interesting.

                    • By kazinator 2025-11-1523:47

                      The DJ's tracks are just tone producing elements.

                      If he plucked one of the 13 strings of a koto, we wouldn't say he is just remixing the vibration of the koto. Perhaps we could say that, if we had justification. There is a way of using a musical instrument as just a noise maker to produce its characteristics sounds.

                      Similarly, a writer doesn't just remix the alphabet, spaces and punctuation symbols. A randomly generated soup of those symbols could the thought of as their remix, in a sense.

                      The question is, is there a meaning being expressed using those elements as symbols?

                      Or is just the mixing all there is to the meaning? I.e. the result says "I'm a mix of this stuff and nothing more".

                      If you mix Alphagetti and Zoodles, you don't have a story about animals.

                • By kazinator 2025-11-155:272 reply

                  That is not strictly true, because being able to transfer the style of Van Gogh onto an arbitrary photographic scene is novel in a sense, but it is interpolative.

                  Mashups are not purely derivative: the choice of what to mash up carries novelty: two (or more) representations are mashed together which hitherto have not been.

                  We cannot deny that something is new.

                  • By regularfry 2025-11-158:07

                    Innovation itself is frequently defined as the novel combination of pre-existing components. It's mashups all the way down.

                  • By BoorishBears 2025-11-1520:58

                    I'm saying their comment is calling that not something new.

                    I don't agree, but by their estimation adding things together is still just using existing things.

              • By ozgrakkurt 2025-11-154:56

                This is how people do things as well imo. LLM does the same thing on some level but it is just not good enough for majority of use cases

            • By snickerbockers 2025-11-151:444 reply

              uhhh can it? I've certainly not seen any evidence of an AI generating something not based on its training set. It's certainly smart enough to shuffle code around and make superficial changes, and that's pretty impressive in its own way but not particularly useful unless your only goal is to just launder somebody else's code to get around a licensing problem (and even then it's questionable if that's a derived work or not).

              Honest question: if AI is actually capable of exploring new directions why does it have to train on what is effectively the sum total of all human knowledge? Shouldn't it be able to take in some basic concepts (language parsing, logic, etc) and bootstrap its way into new discoveries (not necessarily completely new but independently derived) from there? Nobody learns the way an LLM does.

              ChatGPT, to the extent that it is comparable to human cognition, is undoubtedly the most well-read person in all of history. When I want to learn something I look it up online or in the public library but I don't have to read the entire library to understand a concept.

              • By BobbyTables2 2025-11-153:482 reply

                You have to realize AI is trained the same way one would train an auto-completer.

                Theres no cognition. It’s not taught language, grammar, etc. none of that!

                It’s only seen a huge amount of text that allows it to recognize answers to questions. Unfortunately, it appears to work so people see it as the equivalent to sci-fi movie AI.

                It’s really just a search engine.

                • By snickerbockers 2025-11-154:512 reply

                  I agree and that's the case I'm trying to make. The machine-learning community expects us to believe that it is somehow comparable to human cognition, yet the way it learns is inherently inhuman. If an LLM was in any way similar to a human I would expect that, like a human, it might require a little bit of guidance as it learns but ultimately it would be capable of understanding concepts well enough that it doesn't need to have memorized every book in the library just to perform simple tasks.

                  In fact, I would expect it to be able to reproduce past human discoveries it hasn't even been exposed to, and if the AI is actually capable of this then it should be possible for them to set up a controlled experiment wherein it is given a limited "education" and must discover something already known to the researchers but not the machine. That nobody has done this tells me that either they have low confidence in the AI despite their bravado, or that they already have tried it and the machine failed.

                  • By throwaway173738 2025-11-155:25

                    There’s a third possible reason which is that they’re taking it as a given that the machine is “intelligent” as a sales tactic, and they’re not academic enough to want to test anything they believe.

                  • By ezst 2025-11-155:20

                    > The machine-learning community

                    Is it? I only see a few individuals, VCs, and tech giants overblowing LLMs capabilities (and still puzzled as to how the latter dragged themselves into a race to the bottom through it). I don't believe the academic field really is that impressed with LLMs.

                • By ninetyninenine 2025-11-155:132 reply

                  no it's not I work on AI and what these things do are much much more then a search engine or an autocomplete. If an autocomplete passed the turing test you'd dismiss it because it's still an autocomplete.

                  The characterization you are regurgitating here is from laymen who do not understand AI. You are not just mildly wrong but wildly uninformed.

                  • By versteegen 2025-11-1513:23

                    Well, I also work on AI, and I completely agree with you. But I've reached the point of thinking it's hopeless to argue with people about this: It seems that as LLMs become ever better people aren't going to change their opinions, as I had expected. If you don't have good awareness of how human cognition actually works, then it's not evidently contradictory to think that even a superintelligent LLM trained on all human knowledge is just pattern matching and that humans are not. Creativity, understanding, originality, intent, etc, can all be placed into a largely self-consistent framework of human specialness.

                  • By MangoToupe 2025-11-1513:081 reply

                    To be fair, it's not clear human intelligence is much more than search or autocomplete. The only thing that's clear here is that LLMs can't reproduce it.

                    • By ninetyninenine 2025-11-1513:131 reply

                      Yes but colloquially this characterization you see used by laymen is deliberately used to deride AI and dismiss it. It is not honest about the on the ground progress AI has made and it’s not intellectual honest about the capabilities and weaknesses of Ai.

                      • By MangoToupe 2025-11-1513:351 reply

                        I disagree. The actual capabilities of LLMs remain unclear, and there's a great deal of reasons to be suspicious of anyone whose paycheck relies on pimping them.

                        • By ninetyninenine 2025-11-1513:411 reply

                          The capabilities of LLMs are unclear but it is clear that they are not just search engines or autocompletes or stochastic parrots.

                          You can disagree. But this is not an opinion. You are factually wrong if you disagree. And by that I mean you don’t know what you’re talking about and you are completely misinformed and lack knowledge.

                          The long term outcome if I’m right is that AI abilities continue to grow and it basically destroys my career and yours completely. I stand not to benefit from this reality and I state it because it is reality. LLMs improve every month. It’s already to the point of where if you’re not vibe coding you’re behind.

                          • By MangoToupe 2025-11-1611:471 reply

                            > It’s already to the point of where if you’re not vibe coding you’re behind.

                            I like being productive, not babysitting a semi-literate program incapable of learning

                            • By ninetyninenine 2025-11-1616:581 reply

                              Let me be utterly clear. People with your level of programming skill who incorporate AI into their workflow are in general significantly more productive than you. You are a less productive, less effective programmer if you are not using AI. That is a fundamental fact. And all of this was not true a year ago.

                              Again if you don’t agree then you are lost and uninformed. There are special cases where there are projects where human coding is faster but that is a minority.

              • By ninetyninenine 2025-11-155:141 reply

                >I've certainly not seen any evidence of an AI generating something not based on its training set.

                There is plenty of evidence for this. You have to be blind not to realize this. Just ask the AI to generate something not in it's training set.

                • By gf000 2025-11-1517:12

                  Like the seahorse emoji?

              • By fragmede 2025-11-154:31

                Isn't that what's going on with synthetic data? The LLM is trained, then is used to generate data that gets put into the training set, and then gets further trained on that generated data?

              • By BirAdam 2025-11-153:24

                You didn’t have to read the whole library because your brain has been absorbing knowledge from multiple inputs your entire life. AI systems are trying to temporally compress a lifetime into the time of training. Then, given that these systems have effectively a single input method of streams of bits, they need immense amounts of it to be knowledgeable at all.

          • By taneq 2025-11-1512:271 reply

            OK, but by that definition, how many human software developers ever develop something "novel"? Of course, the "functionally equivalent" term is doing a lot of heavy lifting here: How equivalent? How many differences are required to qualify as different? How many similarities are required to qualify as similar? Which one overrules the other? If I write an app that's identical to Excel in every single aspect except that instead of a Microsoft Flight Simulator easter egg, there's a different, unique, fully playable game that can't be summed up with any combination of genre lables, is that 'novel'?

            • By gf000 2025-11-1517:14

              I think the importance is the ability. Not every human have produced (or even can) something novel in their life, but there are humans who have time after time.

              Meanwhile, depending on how you rate LLM's capabilities, no matter how many trials you give it, it may not be considered capable of that.

              That's a very important distinction.

        • By terminalshort 2025-11-153:302 reply

          If a LLM had written Linux, people would be saying that it isn't novel because it's just based on previous OS's. There is no standard here, only bias.

          • By jofla_net 2025-11-1516:281 reply

            Cept its not made Linux (in the absence of it).

            At any point prior to the final output it can garner huge starting point bias from ingested reference material. This can be up to and including whole solutions to the original prompt minus some derivations. This is effectively akin to cheating for humans as we cant bring notes to the exam. Since we do not have a complete picture of where every part of the output comes from we are at a loss to explain if it indeed invented it or not. The onus is and should be on the applicant to ensure that the output wasn't copied (show your work), not on the graders to prove that it wasn't copied. No less than what would be required if it was a human. Ultimately it boils down to what it means to 'know' something, whether a photographic memory is, in fact, knowing something, or rather derivations based on other messy forms of symbolism. It is nevertheless a huge argument as both sides have a mountain of bias in either directions.

            • By jstummbillig 2025-11-1518:261 reply

              > Cept its not made Linux (in the absence of it).

              Neither did you (or I). Did you create anything that you are certain your peers would recognize as more "novel" than anything a LLM could produce?

              • By snickerbockers 2025-11-1521:17

                >Neither did you (or I).

                Not that specifically but I certainly have the capability to create my own OS without having to refer to the source code of existing operating systems. Literally "creating a linux" is a bit on the impossible side because it implies compatibility with an existing kernel despite the constraints prohibiting me from referring to the source of that existing kernel (maybe possible if i had some clean-room RE team that would read through the source and create a list of requirements without including any source).

                If we're all on the same page regarding the origins of human intelligence (ie, that it does not begin with satan tricking adam and eve into eating the fruit of a tree they were specifically instructed not to touch) then it necessarily follows that any idea or concept was new at some point and had to be developed by somebody who didn't already have an entire library of books explaining the solution at his disposal.

                For the Linux thought-experiment you could maybe argue that Linux isn't totally novel since its creator was intentionally mimicking behavior of an existing well-known operating system (also iirc he had access to the minix source) and maybe you could even argue that those predecessors stood on the shoulders of their own proverbial giants, but if we keep kicking the ball down the road eventually we reach a point where somebody had an idea which was not in any way inspired by somebody else's existing idea.

                The argument I want to make is not that humans never create derivative or unoriginal works (that obviously cannot be true) but that humans have the capability to create new things. I'm not convinced that LLMs have that same capability; maybe I'm wrong but I'm still waiting to see evidence of them discovering something new. As I said in another post, this could easily be demonstrated with a controlled experiment in which the model is bootstrapped with a basic yet intentionally-limited "education" and then tasked with discovering something already known to the experimenters which was not in its training set.

                >Did you create anything that you are certain your peers would recognize as more "novel" than anything a LLM could produce?

                Yes, I have definitely created things without first reading every book in the library and memorizing thousands of existing functionally-equivalent solutions to the same problem. So have you so long as I'm not actually debating an LLM right now.

          • By veegee 2025-11-154:01

            [dead]

        • By baq 2025-11-1510:091 reply

          If the model can map an unseen problem to something in its latent space, solve it there, map back and deliver an ultimately correct solution, is it novel? Genuine question, ‘novel’ doesn’t seem to have a universally accepted definition here

          • By gf000 2025-11-1517:21

            Good question, though I would say that there may be different grades of novelty.

            One grade might be your example, while something like Gödel's incompleteness theorems or Einstein's relativity could go into a different grade.

        • By visarga 2025-11-156:53

          > For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

          The secret ingredient is the world outside, and past experiences from the world, which are unique for each human. We stumble onto novelty in the environment. But AI can do that too - move 37 AlphaGo is an example, much stumbling around leads to discoveries even for AI. The environment is the key.

        • By QuadmasterXLII 2025-11-152:12

          A system of humans creates bona fide novel writing. We don’t know which human is responsible for the novelty in homoerotic fanfiction of the Odyssey, but it wasn’t a lizard. LLMs don’t have this system-of-thinkers bootstrapping effect yet, or if they do it requires an absolutely enormous boost to get going

        • By Workaccount2 2025-11-150:171 reply

          [flagged]

        • By testaccount28 2025-11-1423:434 reply

          why would you admit on the internet that you fail the reverse turing test?

          • By mikestorrent 2025-11-152:002 reply

            Didn't some fake AI country song just get on the top 100? How novel is novel? A lot of human artists aren't producing anything _novel_.

            • By magicalist 2025-11-153:16

              > Didn't some fake AI country song just get on the top 100?

              No

              Edit: to be less snarky, it topped the Billboard Country Digital Song Sales Chart, which is a measure of sales of the individual song, not streaming listens. It's estimated it takes a few thousand sales to top that particular chart and it's widely believed to be commonly manipulated by coordinated purchases.

            • By terminalshort 2025-11-153:31

              It was a real AI country song, not a fake one, but yes.

          • By CamperBob2 2025-11-150:07

            You have no idea if you're talking to an LLM or a human, yourself, so ... uh, wait, neither do I.

          • By greygoo222 2025-11-150:42

            Because I'm an LLM and you are too

          • By fragmede 2025-11-1423:46

            Because not everyone here has a raging ego and no humility?

        • By kazinator 2025-11-153:06

          Because we know that the human only read, say, fifty books since they were born, and watched a few thousand videos, and there is nothing in them which resembles what they wrote.

      • By sosuke 2025-11-151:16

        Doing something novel is incredibly difficult through LLM work alone. Dreaming, hallucinating, might eventually make novel possible but it has to be backed up be rock solid base work. We aren't there yet.

        The working memory it holds is still extremely small compared to what we would need for regular open ended tasks.

        Yes there are outliers and I'm not being specific enough but I can't type that much right now.

      • By flatline 2025-11-1422:542 reply

        I believe they can create a novel instance of a system from a sufficient number of relevant references - i.e. implement a set of already-known features without (much) code duplication. LLMs are certainly capable of this level of generalization due to their huge non-relevant reference set. Whether they can expand beyond that into something truly novel from a feature/functionality standpoint is a whole other, and less well-defined, question. I tend to agree that they are closed systems relative to their corpus. But then, aren't we? I feel like the aperture for true novelty to enter is vanishingly small, and cultures put a premium on it vis-a-vis the arts, technological innovation, etc. Almost every human endeavor is just copying and iterating on prior examples.

        • By beeflet 2025-11-151:461 reply

          Almost all of the work in making a new operating system or a gameboy emulator or something is in characterizing the problem space and defining the solution. How do you know what such and such instruction does? What is the ideal way to handle this memory structure here? You know, knowledge you gain from spending time tracking down a specific bug or optimizing a subroutine.

          When I create something, it's an exploratory process. I don't just guess what I am going to do based on my previous step and hope it comes out good on the first try. Let's say I decide to make a car with 5 wheels. I would go through several chassis designs, different engine configurations until I eventually had something that works well. Maybe some are too weak, some too expensive, some are too complicated. Maybe some prototypes get to the physical testing stage while others don't. Finally, I publish this design for other people to work on.

          If you ask the LLM to work on a novel concept it hasn't been trained on, it will usually spit out some nonsense that either doesn't work or works poorly, or it will refuse to provide a specific enough solution. If it has been trained on previous work, it will spit out something that looks similar to the solved problem in its training set.

          These AI systems don't undergo the process of trial and error that suggests it is creating something novel. Its process of creation is not reactive with the environment. It is just cribbing off of extant solutions it's been trained on.

          • By vidarh 2025-11-151:51

            I'm literally watching Claude Code "undergo the process of trial and error" in another window right now.

        • By imiric 2025-11-150:192 reply

          Here's a thought experiment: if modern machine learning systems existed in the early 20th century, would they have been able to produce an equivalent to the theory of relativity? How about advance our understanding of the universe? Teach us about flight dynamics and take us into space? Invent the Turing machine, Von Neumann architecture, transistors?

          If yes, why aren't we seeing glimpses of such genius today? If we've truly invented artificial intelligence, and on our way to super and general intelligence, why aren't we seeing breakthroughs in all fields of science? Why are state of the art applications of this technology based on pattern recognition and applied statistics?

          Can we explain this by saying that we're only a few years into it, and that it's too early to expect fundamental breakthroughs? And that by 2027, or 2030, or surely by 2040, all of these things will suddenly materialize?

          I have my doubts.

          • By famouswaffles 2025-11-150:554 reply

            >Here's a thought experiment: if modern machine learning systems existed in the early 20th century, would they have been able to produce an equivalent to the theory of relativity? How about advance our understanding of the universe? Teach us about flight dynamics and take us into space? Invent the Turing machine, Von Neumann architecture, transistors?

            Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.

            >If yes, why aren't we seeing glimpses of such genius today?

            Again, most humans can't actually do any of the things you just listed. Only our most intelligent can. LLMs are great, but they're not (yet?) as capable as our best and brightest (and in many ways, lag behind the average human) in most respects, so why would you expect such genius now ?

            • By lelanthran 2025-11-1510:41

              > Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.

              Sure, agreed, but the difference between a small percentage and zero percentage is infinite.

            • By gf000 2025-11-1517:28

              > Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.

              A definite, absolute and unquestionable no, and a small, but real chance is absolutely different categories.

              You may wait for a bunch of rocks to sprout forever, but I would put my money on a bunch of random seeds, even if I don't know how they were kept.

            • By imiric 2025-11-153:051 reply

              > LLMs are great, but they're not (yet?) as capable as our best and brightest (and in many ways, lag behind the average human) in most respects, so why would you expect such genius now ?

              I'm not expecting novel scientific theories today. What I am expecting are signs and hints of such genius. Something that points in the direction that all tech CEOs are claiming we're headed in. So far I haven't seen any of this yet.

              And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans. They contain the sum of human knowledge, far more than any individual human in history. Are they not intelligent, capable of thinking and reasoning? Are we not at the verge of superintelligence[1]?

              > we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.

              If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.

              And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.

              [1]: https://blog.samaltman.com/the-gentle-singularity

              • By famouswaffles 2025-11-153:491 reply

                >I'm not expecting novel scientific theories today. What I am expecting are signs and hints of such genius.

                Like I said, what exactly would you be expecting to see with the capabilities that exist today ? It's not a gotcha, it's a genuine question.

                >And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans.

                There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do. You can't slot an LLM in place of most knowledge workers and expect everything to be fine and dandy. There's no ambiguity on that.

                >They contain the sum of human knowledge, far more than any individual human in history.

                It's not really the total sum of human knowledge but let's set that aside. Yeah so ? Einstein, Newton, Von Newman. None of these guys were privy to some super secret knowledge their contemporaries weren't so it's obviously not simply a matter of more knowledge.

                >Are they not intelligent, capable of thinking and reasoning?

                Yeah they are. And so are humans. So were the peers of all those guys. So why are only a few able to see the next step ? It's not just about knowledge, and intelligence lives in degrees/is a gradient.

                >If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.

                Yeah and that exists. Terence Tao has shared a lot of his (and his peers) experiences on the matter.

                https://mathstodon.xyz/@tao/115306424727150237

                https://mathstodon.xyz/@tao/115420236285085121

                https://mathstodon.xyz/@tao/115416208975810074

                >And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.

                Well it's a good thing that's not true then

                • By imiric 2025-11-1510:20

                  > Like I said, what exactly would you be expecting to see with the capabilities that exist today ?

                  And like I said, "signs and hints" of superhuman intelligence. I don't know what that looks like since I'm merely human, but I sure know that I haven't seen it yet.

                  > There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do.

                  This claim is directly opposed to claims by Sam Altman and his cohort, which I'll repeat:

                  > we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.

                  So which is it? If they're "smarter than people in many ways", where is the product of that superhuman intelligence? If they're able to "significantly amplify the output of people using them", then all of humanity should be empowered to produce incredible results that were previously only achievable by a limited number of people. In hands of the best and brightest humans, it should empower them to produce results previously unreachable by humanity.

                  Yet all positive applications of this technology show that it excels at finding and producing data patterns, and nothing more than that. Those experience reports by Terence Tao are prime examples of this. The system was fed a lot of contextual information, and after being coaxed by highly intelligent humans, was able to find and produce patterns that were difficult to see by humans. This is hardly a showcase of intelligence that you and others think it is. Including those highly intelligent humans, some of whom have a lot to gain from pushing this narrative.

                  We have seen similar reports by programmers as well[1]. Yet I'm continually amazed that these highly intelligent people are surprised that a pattern finding and producing system was able to successfully find and produce useful patterns, and then interpret that as a showcase of intelligence. So much so that I start to feel suspicious about the intentions and biases of those people.

                  To be clear: I'm not saying that these systems can't be very useful in the right hands, and potentially revolutionize many industries. Ultimately many real-world problems can be modeled as statistical problems where a pattern recognition system can excel. What I am saying is that there's a very large gap from the utility of such tools, and the extraordinary claims that they have intelligence, let alone superhuman and general intelligence. So far I have seen no evidence of the latter, despite of the overwhelming marketing euphoria we're going through.

                  > Well it's a good thing that's not true then

                  In the world outside of the "AI" tech bubble, that is very much the reality.

                  [1]: https://news.ycombinator.com/item?id=45784179

            • By beeflet 2025-11-151:501 reply

              Were they the best of the best? or were they just at the right place and time to be exposed to a novel idea?

              I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity. It is a matter of motivation and exposure to new information. The vast majority of the population doesn't benefit from working in some niche field of physics in the first place.

              Perhaps LLMs will never be at the right place and the right time because they are only trained on ideas that already exist.

              • By famouswaffles 2025-11-152:061 reply

                >Were they the best of the best? or were they just at the right place and time to be exposed to a novel idea?

                It's not an "or" but an "and". Being at the right place and time is a necessary precondition, but it's not sufficient. Newton stood on the shoulders of giants like Kepler and Galileo, and Einstein built upon the work of Maxwell and Lorentz. The key question is, why did they see the next step when so many of their brilliant contemporaries, who had the exact same information and were in similar positions, did not? That's what separates the exceptional from the rest.

                >I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity.

                There is a pretty massive gap between understanding a revolutionary idea and originating it. It's the difference between being the first person to summit Everest without a map, and a tourist who takes a helicopter to the top to enjoy the view. One requires genius and immense effort; the other requires following instructions. Today, we have a century of explanations, analogies, and refined mathematics that make relativity understandable. Einstein had none of that.

                • By Kim_Bruning 2025-11-156:11

                  It's entirely plausible that sometimes one genius sees the answer all alone -I'm sure it happens sometimes- but it's also definitely a common theme that many people/ a subset of society as a whole may start having similar ideas all around the same time. In many cases where a breakthrough is attributed to one person, if you look more closely you'll often see some sort of team effort or societal ground swell.

          • By tanseydavid 2025-11-150:321 reply

            How about "Protein Folding"?

            • By imiric 2025-11-150:38

              A great use case for pattern recognition.

      • By n8cpdx 2025-11-150:37

        The windows (~2000) kernel itself is on GitHub. Even exquisitely documented if AI can read .doc files.

        https://github.com/ranni0225/WRK

      • By fragmede 2025-11-1515:50

        Of course they can come up with something novel. They're called hallucinations when they do, and that's something that can't be in their training data, because it's not true/doesn't exist. Of course, when they do come up totally novel hallucinations, suddenly being creative is a bad thing to be "fixed".

    • By otherdave 2025-11-1513:241 reply

      Where can I find these Conquistador documents? Sounds like something I might like to read and explore.

    • By jchw 2025-11-157:071 reply

      I'm surprised people didn't click through to the tweet.

      https://x.com/chetaslua/status/1977936585522847768

      > I asked it for windows web os as everyone asked me for it and the result is mind blowing , it even has python in terminal and we can play games and run code in it

      And of course

      > 3D design software, Nintendo emulators

      No clue what these refer to but to be honest it sounds like they've incrementally improved one-shotting capabilities mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a Gameboy or NES emulator working to boot Tetris or Mario, while it is a decent chunk of code to get things going, there's an absolute boatload of code on the Internet, and the complexity is lower than you might imagine. (I have written a couple of toy Gameboy emulators from scratch myself.)

      Don't get me wrong, it is pretty cool that a machine can do this. A lot of work people do today just isn't that novel and if we can find a way to tame AI models to make them trustworthy enough for some tasks it's going to be an easy sell to just throw AI models at certain problems they excel at. I'm sure it's already happening though I think it still mostly isn't happening for code at least in part due to the inherent difficulty of making AI work effectively in existing large codebases.

      But I will say that people are a little crazy sometimes. Yes it is very fascinating that an LLM, which is essentially an extremely fancy token predictor, can one-shot a web app that is mostly correct, apparently without any feedback, like being able to actually run the application or even see editor errors, at least as far as we know. This is genuinely really impressive and interesting, and not the aspect that I think anyone seeks to downplay. However, consider this: even as relatively simple as an NES is compared to even moderately newer machines, to make an NES emulator you have to know how an NES works and even have strategies for how to emulate it, which don't necessarily follow from just reading specifications or even NES program disassembly. The existence of many toy NES emulators and a very large amount of documentation for the NES hardware and inner workings on the Internet, as well as the 6502, means that LLMs have a lot of training data to help them out.

      I think that these tasks which extremely well-covered in the training data gives people unrealistic expectations. You could probably pick a simpler machine that an LLM would do significantly worse at, even though a human who knows how to write emulation software could definitely do it. Not sure what to pick, but let's say SEGA's VMU units for the Dreamcast - very small, simple device, and I reckon there should be information about it online, but it's going to be somewhat limited. You might think, "But that's not fair. It's unlikely to be able to one-shot something like that without mistakes with so much less training data on the subject." Exactly. In the real world, that comes up. Not always, but often. If it didn't, programming would be an incredibly boring job. (For some people, it is, and these LLMs will probably be disrupting that...) That's not to say that AI models can never do things like debug an emulator or even do reverse engineering on its own, but it's increasingly clear that this won't emerge from strapping agents on top of transformers predicting tokens. But since there is a very large portion of work that is not very novel in the world, I can totally understand why everyone is trying to squeeze this model as far as it goes. Gemini and Claude are shockingly competent.

      I believe many of the reasons people scoff at AI are fairly valid even if they don't always come from a rational mindset, and I try to keep my usage of AI to be relatively tasteful. I don't like AI art, and I personally don't like AI code. I find the push to put AI in everything incredibly annoying, and I worry about the clearly circular AI market, overhyped expectations. I dislike the way AI training has ripped up the Internet, violated people's trust, and lead to a more closed Internet. I dislike that sites like Reddit are capitalizing on all of the user-generated content that users submitted which made them rich in the first place, just to crap on them in the process.

      But I think that LLMs are useful, and useful LLMs could definitely be created ethically, it's just that the current AI race has everyone freaking the fuck out. I continue to explore use cases. I find that LLMs have gotten increasingly good at analyzing disassembly, though it varies depending on how well-covered the machine is in its training data. I've also found that LLMs can one-shot useful utilities and do a decent job. I had an LLM one-shot a utility to dump the structure of a simple common file format so I could debug something... It probably only saved me about 15-30 minutes, but still, in that case I truly believe it did save me time, as I didn't spend any time tweaking the result; it did compile, and it did work correctly.

      It's going to be troublesome to truly measure how good AI is. If you knew nothing about writing emulators, being able to synthesize an NES emulator that can at least boot a game may seem unbelievable, and to be sure it is obviously a stunning accomplishment from a PoV of scaling up LLMs. But what we're seeing is probably more a reflection of very good knowledge rather than very good intelligence. If we didn't have much written online about the NES or emulators at all, then it would be truly world-bending to have an AI model figure out everything it needs to know to write one on-the-fly. Humans can actually do stuff like that, which we know because humans had to do stuff like that. Today, I reckon most people rarely get the chance to show off that they are capable of novel thought because there are so many other humans that had to do novel thinking before them. Being able to do novel thinking effectively when needed is currently still a big gap between humans and AI, among others.

      • By stOneskull 2025-11-1510:34

        i think google is going to repeat history with gemini.. as in chatgpt, grok, etc will be like altavista, lycos, etc

    • By ninetyninenine 2025-11-157:303 reply

      I'm skeptical because my entire identity is basically built around being a software engineer and thinking my IQ and intelligence is higher than other people. If this AI stuff is real then it basically destroys my entire identity so I choose the most convenient conclusion.

      Basically we all know that AI is just a stochastic parrot autocomplete. That's all it is. Anyone who doesn't agree with me is of lesser intelligence and I feel the need to inform them of things that are obvious: AI is not a human, it does not have emotions. It just a search engine. Those people who are using AI to code and do things that are indistinguishable from human reasoning are liars. I choose to focus on what AI gets wrong, like hallucinations, while ignoring the things it gets right.

      • By hju22_-3 2025-11-158:171 reply

        > [...] my entire identity is basically built around [...] thinking my IQ and intelligence is higher than other people.

        Well, there's your first problem.

        • By vintermann 2025-11-159:081 reply

          I don't know, that's commendable self-insight, it's true of lots and lots of people but there are few who would admit it!

          • By ninetyninenine 2025-11-1513:472 reply

            I am unique. Totally. It is not like HN is flooded with cognition or psychology or IQ articles every other hour. Not at all. And whenever one shows up, you do not immediately get a parade of people diagnosing themselves with whatever the headline says. Never happens. You post something about slow thinking and suddenly half the thread whispers “that is literally me.” You post something about fast thinking and the other half says “finally someone understands my brain.” You post something about overthinking and everyone shows up with “wow I feel so seen.” You post something about attention and now the entire site has ADHD.

            But yes. I am the unique one.

            • By tptacek 2025-11-1519:581 reply

              HN is not in fact flooded with cognition, psychology, and IQ articles every other hour.

              • By ninetyninenine 2025-11-1522:401 reply

                There was more prior to AI but yes I exaggerated it. I mean it’s obvious right? The title of this page is hacker so it must be tech related articles every hour.

                But articles on IQ and cognition and psychology are extremely common in HN. Enough to be noticeably out of place.

                • By tptacek 2025-11-1522:561 reply

                  They are actually not really all that common at all. We get 1, maybe 2 in a busy month.

                  • By ninetyninenine 2025-11-161:261 reply

                    Disagree highly with this. It was up to twice a week before AI. Curious why AI made the rate go down.

                    You seem like a high iq individual. So someone with your intellectual capability must be offended that I would even suggest that HNers love to think of themselves as smart.

                    • By tptacek 2025-11-163:331 reply

                      Troll elsewhere.

                      • By ninetyninenine 2025-11-164:54

                        Im not trolling. Just sarcasm to make a point. What I said is true and im saying for you specifically it hit a nerve.

                        Look no offense. The truth sometimes is like that. Everybody needs a bit to stay grounded.

            • By vintermann 2025-11-1517:541 reply

              Ah, so you were just attempting sarcasm?

              • By ninetyninenine 2025-11-164:57

                Yeah I’m not unbiased enough to actually have that level of self awareness. I thought the ludicrousness of it made it obvious it was sarcasm.

      • By twoodfin 2025-11-1512:16

        This kind of comment certainly shows that no organic stochastic parrots post to hn threads!

      • By cindyllm 2025-11-158:03

        [dead]

    • By Footprint0521 2025-11-150:12

      Bro split that up, use LLMs for transcription first, then take that and translate it

    • By WhyOhWhyQ 2025-11-1422:221 reply

      "> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts."

      Wow I'm doing it way wrong. How do I get the good stuff?

      • By zer00eyz 2025-11-1422:291 reply

        Your not.

        I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.

        You can make something that looks like a cake but would not be good to eat.

        The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.

        • By hinkley 2025-11-1422:512 reply

          We got the results back. You are a horrible person. I’m serious, that’s what it says: “Horrible person.”

          We weren’t even testing for that.

          • By joshstrange 2025-11-1423:012 reply

            Source: Portal 2, you can see the line and listen to it here (last one in section): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

            • By hinkley 2025-11-1423:19

              I figured it was appropriate given the context.

              I’m still amazed that game started as someone’s school project. Long live the Orange Box!

            • By chihuahua 2025-11-156:52

              I'd really like Alexa+ to have the voice of GLaDOS.

          • By erulabs 2025-11-1423:001 reply

            Well, what does a neck-bearded old engineer know about fashion? He probably - Oh, wait. It's a she. Still, what does she know? Oh wait, it says she has a medical degree. In fashion! From France!

            • By joshstrange 2025-11-1423:021 reply

              If you want to listen to the line from Portal 2 it's on this page (second line in the section linked): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

              • By fragmede 2025-11-1423:492 reply

                Just because "Die motherfucker die motherfucker die" appeared in a song once doesn't mean it's not also death threat when someone's pointing a gun at you and saying that.

                • By joshstrange 2025-11-1518:46

                  I think you might be confused or mistaken (or you are making a whole different joke).

                  My 2 comments are linking to different quotes from Portal 2, both the original comment

                  > We got the results back.....

                  and

                  > Well, what does a neck-bearded old engineer know about fashion?.....

                  Are from Portal 2 and the first Portal 2 quote is just a reference to the parent of that saying:

                  > The cake, sometimes, is a lie.

                  (Another Portal reference if that wasn't clear), they weren't calling the parent horrible, they were just putting in quote they liked from the game that was referenced.

                  That's one reason why I linked the quote, so people would understand it was a reference to the game, not the person actually saying the parent was horrible. The other reason I linked it is just because I like added metadata where possible.

                • By scubbo 2025-11-150:171 reply

                  ...what?

                  • By fragmede 2025-11-151:381 reply

                    hinkley wrote:

                    > We got the results back. You are a horrible person. I’m serious, that’s what it says: “Horrible person.”

                    > We weren’t even testing for that.

                    joshstrange then wrote:

                    > If you want to listen to the line from Portal 2 it's on this page (second line in the section linked): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

                    as if the fact that the words that hinkley wrote are from a popular video game excuses the fact that hinkley just also called zer00eyz horrible.

                    • By hinkley 2025-11-153:001 reply

                      So if two sentences that make no sense to you sandwich one that does, you should totally accept the middle one at face value.

                      K.

                      • By fragmede 2025-11-1518:381 reply

                        Yes. You chose to repeat those words in that sequence in that place. You could have said anything else in the whole wide world, but you chose to use a quote from an ancient video game stating that someone was horrible. Sorry if I'm being autistic and taking things too literally again, working on having social skills was a different thread from today.

                        • By Dylan16807 2025-11-163:36

                          Is it an autistic thing to pull a single sentence out of its context to treat literally? I wasn't familiar with that being a thing.

                          If that sentence was by itself, I would understand your complaint. But as-is I'm having a hard time seeing the issue.

                          And the weird analogy where you added "someone's pointing a gun at you" undermines your stance more than it helps.

  • By lelanthran 2025-11-1510:384 reply

    > In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.

    I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?

    I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.

    This way there'd be more confidence that the model itself reasoned and did not make an error.

    • By HarHarVeryFunny 2025-11-1515:411 reply

      Yes, and as the article itself notes, the page image has more than just "145" - there's a "u"-like symbol over the 1, which the model is either failing to notice, or perhaps is something it recognizes from training as indicating pounds.

      The article's assumption of how the model ended up "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb 5oz" seems very speculative. It seems more reasonable to assume that a massive frontier model knows something about loaves of sugar and their weight range, and in fact Google search's "AI overview" of "how heavy is a loaf of sugar" says the common size is approximately 14lb.

      • By wrs 2025-11-1519:18

        There’s also a clear extra space between the 4 and 5, so figuring out to group it as “not 1 45, nor 145 but 14 5” doesn’t seem worthy of astonishment.

    • By drawfloat 2025-11-1519:121 reply

      If I ask a model to transcribe something exactly and it outputs an interpretation, that is an error and not a success.

      • By fsniper 2025-11-1523:431 reply

        Author already mentions that a correction is still an error in the context of this task.

        • By drawfloat 2025-11-205:45

          And then refers to it as almost perfect. Being unable to follow a basic command like that means it is “nearly usable” rather than “nearly perfect”.

    • By yomismoaqui 2025-11-1512:112 reply

      I implemented a receipt scanner to Google Sheet using Gemini Flash.

      The fact that it is ”intelligent" it's fine for some things.

      For example I created structured output schema that had a field "currency" with the 3 letter format (USD, EUR...). So I scanned a receipt from some shop in Jakarta and it filled that field with IDR (Indonesian Rupiah). It inferred that data because of the city name on the receipt.

      Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

      Note: if needed maybe I could have changed the prompt to not infer the currency when not explicitly listed on the receipt.

      • By Someone 2025-11-1514:071 reply

        > Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

        If there’s a decent chance it infers the wrong currency, potentially one where the value of each unit is a few units of scale larger or smaller than that of IDR, it might be better to not infer it.

        • By rtpg 2025-11-1523:03

          I think most tools in this space do the "infer a bunch of data and show it to the user for confirmation", which lowers the pain of a miss here.

      • By otabdeveloper4 2025-11-1515:091 reply

        > Would it be better for my use case that it would have returned no data for the currency field?

        Almost certainly yes.

        • By DangitBobby 2025-11-1516:351 reply

          Except in setups where you always check its work, and the effort from the 5% of the time you have to correct the currency is vastly outweighed due to effort saved from the other 95% of the time. Pretty common situation.

          • By otabdeveloper4 2025-11-179:301 reply

            Even in those setups it's better to leave the currency field blank instead of hallucinating something.

            • By yomismoaqui 2025-11-1818:33

              You have the option to prompt it to do what you say. Of course it will not be 100% deterministic, but that's what evals are for.

    • By YeGoblynQueenne 2025-11-1512:351 reply

      [flagged]

      • By nopinsight 2025-11-1513:182 reply

        The comment above seems to violate several HN guidelines. Curious, I asked GPT and Gemini which ones stood out. Both replied with the same top three:

        https://news.ycombinator.com/newsguidelines.html

        They are:

        1. “Be kind. Don't be snarky. … Edit out swipes.”

        2. “Please don't sneer, including at the rest of the community.”

        3. “Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.”

        • By qchris 2025-11-1514:171 reply

          I'd be interested in seeing these guidelines updated to include "don't re-post the output of an LLM" to reduce comments of this sort.

          I don't really feel like comments with LLM output as the primary substance meet the bar of "thoughtful and substantive", and (ironically, in this instance) could actually be used as good example of shallow dismissal, since you, a human, didn't actually provide an opinion or take a stance either way that I could use to begin a good-faith engagement on the topic.

          • By nopinsight 2025-11-1514:43

            My comment above serves as a covert commentary on the utility of current frontier LLMs, which imo can often generate higher-quality responses than some HN comments. (And yes, I did agree with their responses above.)

            I enjoy the recursiveness of it all. Perhaps I should have said it outright.

        • By trial3 2025-11-1514:221 reply

          genuinely, why is your response to being curious to ask two different LLMs to explain something to you?

          the list of guidelines has 18 items in it. did you actually need them to interpret it? or is it, perhaps, you couldn’t resist a little sneering yourself?

  • By roywiggins 2025-11-153:006 reply

    My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.

    Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.

    I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!

    • By IanCal 2025-11-1514:03

      Most models don’t have good spatial information from the images. Gemini models do preprocessing and so are typically better for that. It depends a lot on how things get segmented though.

    • By chrischen 2025-11-153:461 reply

      But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?

      • By roywiggins 2025-11-1514:271 reply

        There would be no point in going via an LLM then, if I had a specialist model ready I'd just invoke it on the images directly. I don't particularly need or want a chatbot for this.

        • By chrischen 2025-11-1618:361 reply

          Current LLMs are doing this for coding, and it's very effective. It delegates to tool calls, but a specialized model can just be thought of as another tool. The LLM can be weak in some stuff handled by simple shell scripts or utilities, but strong in knowing what scripts/commands to call. For example, doing math via the model natively may be inaccurate, but the model may know to write the code to do math. An LLM can automate a higher level of abstraction, in the same way a manager or CEO might delegate tasks to specialists.

          • By roywiggins 2025-11-171:59

            In this case I'm building a batch workflow: images come in, images get analyzed through a pipeline, images go into a GUI for review. The idea of using a VLM was just to avoid hand-building a solution, not because I actually want to use it in a chatbot. It's just interesting that a generalist model that has expert-level handwriting recognition completely falls apart on a different, but much easier, task.

    • By lern_too_spel 2025-11-1515:52

      This might be showing bugs in the training data. It is common to augment image data sets with mirroring, which is cheap and fast.

    • By hopelite 2025-11-153:401 reply

      What is the “normal” way? Is that defined in a technical specification? Did you provide the definition/description of what you mean by “normal”?

      I would not have expected a language model to perform well on what sounds like a computer vision problem? Even if it was agentic, as you also imply how a five year old could learn how to do it, so too an AI system would need to be trained or at the very least be provided with a description of what is looking at.

      Imagine you took an MRI brain scan back in time and showed it to a medical Doctor in even the 1950s or maybe 1900. Do you think they would know what the normal orientation is, let alone what they are looking at?

      I am a bit confused and also interested in how people are interacting with AI in general, it really seems to have a tendency to highlight significant holes in all kinds of human epistemological, organizational, and logical structures.

      I would suggest maybe you think of it as a kind of child, and with that, you would need to provide as much context and exact detail about the requested task or information as possible. This is what context engineering (are we still calling it that?) concerns itself with.

      • By roywiggins 2025-11-1514:17

        The models absolutely do know what the standard orientation is for a scan. They respond extensively about what they're looking for and what the correct orientation would be, more or less accurately. They are aware.

        They then give the wrong answer, hallucinating anatomical details in the wrong place, etc. I didn't bother with extensive prompting because it doesn't evince any confusion on the criteria, it just seems to not understand spatial orientations very well, and it seemed unlikely to help.

        The thing is that it's very, very simple: an axial slice of a brain is basically egg-shaped. You can work out whether it's pointing vertically (ie, nose pointing to towards the top of the image) or horizontally by looking at it. LLMs will insist it's pointing vertically when it isn't. it's an easy task for someone with eyes.

        Essentially all images an LLM will have seen of brains will be in this orientation, which is either a help or a hindrance, and I think in this case a hindrance- it's not that it's seen lots of brains and doesn't know which are correct, it's that it has only ever seen them in the standard orientation and it can't see the trees for the forest, so to speak.

    • By fragmede 2025-11-1517:19

      and then, in a different industry, one that has physical factories, there's this obsession about getting really good at making the machine that makes the machine (product) being the route to success. So it's funny that LLMs being able to write programs to do the thing you want is seen as a failure here.

    • By moritonal 2025-11-1511:231 reply

      That's fairly unfair comparison. Did you include in the prompt a basic set of instructions about which way is "correct" and what to look for?

      • By roywiggins 2025-11-1514:17

        I didn't give a detailed explanation to the model, but I should have been more clear: they all seemed to know what to look for, they wrote explanations of what they were looking for, which were generally correct enough. They still got the answer wrong, hallucinating the locations of the anatomical features they insisted they were looking at.

        It's something that you can solve by just treating the brain as roughly egg-shaped and working out which way the pointy end is, or looking for the very obvious bilateral symmetry. You don't really have to know what any of the anatomy actually is.

HackerNews