Comments

  • By DanMcInerney 2025-06-1021:096 reply

    I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing slight improvement, I have seen that each new generation feels actively better at exactly the same tasks I gave the previous generation.

    It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.

    • By codingwagie 2025-06-1021:236 reply

      I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.

      • By IanCal 2025-06-1114:38

        A really important thing is the distinction between performance and utility.

        Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.

      • By shmoogy 2025-06-1021:28

        Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.

      • By protocolture 2025-06-123:11

        I am finding that my ability to use it to code, aligns almost perfectly with increasing token memory.

      • By kevinqi 2025-06-124:24

        yeah, the benchmarks are just a proxy. o3 was a step change where I started to really be able to build stuff I couldn't before

      • By alightsoul 2025-06-110:371 reply

        mind telling examples?

        • By motorest 2025-06-115:022 reply

          Not OP, but a couple of days ago I managed to vibecode my way through a small app that pulled data from a few services and did a few validation checks. By itself its not very impressive, but my input was literally "this is how the responses from endpoint A,B and C look like. This field included somewhere in A must be somewhere in the response from B, and the response from C must feature this and that from response A and B. If the responses include links, check that they exist". To my surprise, it generated everything in one go. No retry nor Agent mode churn needed. In the not so distant past this would require progressing through smaller steps, and I had to fill in tests to nudge Agent mode to not mess up. Not today.

          • By corysama 2025-06-123:13

            I’m wrapping up doing literally the same thing. I did it step-by-step. But, for me there was also a process of figuring out how it should work.

          • By alightsoul 2025-06-116:161 reply

            what tools did you use?

            • By motorest 2025-06-1110:021 reply

              > what tools did you use?

              Nothing fancy. Visual Studio Code + Copilot, agent mode, a couple prompt files, and that's it.

              • By munksbeer 2025-06-129:33

                Do you mind me asking which language and if you have any esoteric constraints in the apps you build? We use a java in a monorepo, and have a full custom rolled framework on top of which we build our apps. Do you find vibe coding works ok with those sort of constraints, or do you just end up with a generic app?

      • By iLoveOncall 2025-06-127:382 reply

        Okay but this has all to do with the tooling and nothing to do with the models.

        • By efunnekol 2025-06-1212:26

          I mostly disagree with this.

          I have been using 'aider' as my go to coding tool for over a year. It basically works the same way that it always has: you specify all the context and give it a request and that goes to the model without much massaging.

          I can see a massive improvement in results with each new model that arrives. I can do so much more with Gemini 2.5 or Claude 4 than I could do with earlier models and the tool has not really changed at all.

          I will agree that for the casual user, the tools make a big difference. But if you took the tool of today and paired it with a model from last year, it would go in circles

        • By mofeien 2025-06-127:471 reply

          Can you explain why?

          • By iLoveOncall 2025-06-128:332 reply

            You can write projects with LLMs thanks to tools that can analyze your local project's context, which didn't exist a year ago.

            You could use Cursor, Windsurf, Q CLI, Claude Code, whatever else with Claude 3 or even an older model and you'd still get usable results.

            It's not the models which have enabled "vibe coding", it's the tools.

            An additional proof of that is that the new models focus more and more on coding in their releases, and other fields have not benefited at all from the supposed model improvements. That wouldn't be the case if improvements were really due to the models and not the tooling.

            • By eru 2025-06-129:301 reply

              You need a certain quality of model to make 'vibe coding' work. For example, I think even with the best tooling in the world, you'd be hard pressed to make GPT 2 useful for vibe coding.

              • By iLoveOncall 2025-06-129:462 reply

                I'm not claiming otherwise. I'm just saying that people say "look what we can do with the new models" when they're completely ignoring the fact that the tooling has improved a hundred fold (or rather, there was no tooling at all and now there is).

                • By signatoremo 2025-06-1211:531 reply

                  That contradicts what you said earlier -- "this has all to do with the tooling and nothing to do with the models".

                  • By iLoveOncall 2025-06-1217:06

                    Clearly nobody is talking about GPT-2 here, but I posit that you would have a perfectly reasonable "vibe coding" experience with models like the initial ChatGPT one, provided you have all the tools we have today.

                • By eru 2025-06-129:55

                  OK, no objections from me there.

            • By broast 2025-06-1211:571 reply

              Chatgpt itself has gotten much better at producing and reading code since a year ago, in my experience

              • By pyman 2025-06-1311:19

                They're using a specific model for that, and since they can't access private GitHub repos like MS, they rely on code shared by devs, which keeps growing every month.

    • By energy123 2025-06-1021:301 reply

      That would require AIME 2024 going above 100%.

      There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.

      Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.

      If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.

      • By croddin 2025-06-1022:011 reply

        There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

        "ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

        ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

        Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

        - https://x.com/arcprize/status/1932535378080395332

        • By saberience 2025-06-1022:302 reply

          I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

          Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

          • By nipah 2025-06-1121:173 reply

            "most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.

            • By viraptor 2025-06-1210:191 reply

              > so extremely simple they have basically 100% human approval rates

              Are you thinking of a different set? Arc-agi-2 has average 60% success for a single person and questions require only 2 out of 9 correct answers to be accepted. https://docs.google.com/presentation/d/1hQrGh5YI6MK3PalQYSQs...

              > and even some other mammals to do.

              No, that's not the case.

              • By nipah 2025-06-1214:101 reply

                No, I think I saw the graphs on someone's channel, but maybe I misinterpreted the results. But to be fair, my point never depended on 100% of the participants being right 100% of the questions, there are innumerous factors that could affect your performance on those tests, including the pressure. The AI also had access to lenient conventions, so it should be "fair" in this sense.

                Either way, there's something fishy about this presentation, it says: "ARC-AGI-1 WAS EASILY BRUTE-FORCIBLE", but when o3 initially "solved" most of it the co-founder or ARC-PRIZE said: "Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.", he was saying confidently that it would not be a result of brute-forcing the problems. And it was not the first time, "ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training."

                Now they are saying ARC-AGI-2 is not bruteforcible, what is happening there? They didn't provided any reasoning for why one was bruteforcible and the other not, nor how they are so sure about that. They "recognized" that it could be brute-forced before, but in a way less expressive manner, by explicitly stating it would need "unlimited resources and time" to solve. And they are using the non-bruteforceability in this presentation as a point for it.

                --- Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

                • By viraptor 2025-06-1215:491 reply

                  > my point never depended on 100% of the participants being right 100% of the questions

                  You told someone that their reasoning is so bad they should get checked by a doctor. Because they didn't find the test easy, even though it averages 60% score per person. You've been a dick to them while significantly misrepresenting the numbers - just stop digging.

                  • By nipah 2025-06-1215:57

                    The second test scores 60%, the first was way higher. And I specifically said ""unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist"", to which this person did not responded. I saw the tests, solved some, I suspect the variability here is more a question of methodology than an inherent problem for those people. I also never stated that my point depended on those people scoring 100% specifically on the tests, even if it is in fact extremely easy (and it is, the objective of this test is to literally make tests that most humans could easily beat but that would be hard for an AI) variability will still exist and people with different perceptions would skew the results, this is expected. "Significantly misrepresenting the numbers" is also a stretch, I only mentioned the numbers ONE time in my point, most of it was about that inherent nature (or at least, the intended nature) of the tests.

                    So on the edge, if he was not able to understand them at all, and this was not just a problem of grasping the problem, my point was that this would possibly indicate a neurological problem, or developmental, due to the nature of them. It's not a question of "you need to get all of them right", his point was that he was unable to understand them at all, that it confused them to an understanding level.

            • By saberience 2025-06-1210:031 reply

              lol 100% approval rates? No they don’t.

              Also mammals? What mammals could even understand we were giving it a test?

              Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

              This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.

              Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.

              • By nipah 2025-06-1214:11

                quoting my own previous response: > Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

                ---

                > Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

                I can show them to people on my family, I'll do it today and come back with the answer, it's the best way of testing that out.

            • By clbrmbr 2025-06-127:291 reply

              You may be above average intelligence. Those challenges are like classic IQ tests and I bet have a significant distribution among humans.

              • By achierius 2025-06-128:122 reply

                No, they've done testing against samples from the general population.

                • By yorwba 2025-06-1210:25

                  The ARC-AGI-2 paper https://arxiv.org/pdf/2505.11831#figure.4 uses a non-representative sample, success rate differs widely across participants and "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them. The average test-taker solved 66% of tasks they attempted. 100% of ARC-AGI-2 tasks were solved by at least two people (many were solved by more) in two attempts or less."

                  Certainly those non-representative humans are much better than current models, but they're also far from scoring 100%.

                • By cubefox 2025-06-1212:25

                  The original ARC-AGI test was much easier than the recent v2.

          • By HDThoreaun 2025-06-1113:181 reply

            arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"

            • By saberience 2025-06-1210:151 reply

              There are humans who cannot do arc agi though so how does an LLM not doing it mean that LLMs don’t have general intelligence?

              LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

              But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

              That must mean most humans on this planet aren’t generally intelligent too.

              • By HDThoreaun 2025-06-1210:362 reply

                > LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

                I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room

                > But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

                Yes. Being intelligent is about recognizing patterns and thats what arc agi tests. It tests ability to learn. A lot of people are not very smart.

                • By ben_w 2025-06-1213:33

                  > I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room

                  I agree. The problem I have with the Chinese Room thought experiment is: just as the human who mechanically reading books to answer questions they don't understands does not themselves know Chinese, likewise no neuron in the human brain knows how the brain works.

                  The intelligence, such as it is, is found in the process that generated the structure — of the translation books in the Chinese room, of the connectome in our brains, and of the weights in an LLM.

                  What comes out of that process is an artefact of intelligence, and that artefact can translate Chinese or whatever.

                  Because all current AI take a huge number of examples to learn anything, I think it's fair to say they're not particularly intelligent — but likewise, they can to an extent make up for being stupid by being stupid very very quickly.

                  But: this definition of intelligence doesn't really fit "can solve novel puzzle", as there's a lot of room for getting good at that my memorising lot of things that puzzle-creators tend to do.

                  And any mind (biological or synthetic) must learn patterns before getting started: the problem of induction* is that no finite number of examples is ever guaranteed to be sufficient to predict the next item in a sequence, there is always an infinite set of other possible solutions in general (though in reality bounded by 2^n, where n = the number of bits required to express the universe in any given state).

                  I suspect, but cannot prove, that biological intelligence learns from fewer examples for a related reason, that our brains have been given a bias by evolution towards certain priors from which "common sense" answers tend to follow. And "common sense" is often wrong, c.f. Aristotelian physics (never mind Newtonian) instead of QM/GR.

                  * https://en.wikipedia.org/wiki/Problem_of_induction

                • By saberience 2025-06-1313:283 reply

                  The LLMs are not just memorising stuff though, they solve math and physics problems better than almost every person alive. Problems they've never seen before. They write code which has never been seen before better than like 95% of active software engineers.

                  I love how the bar for are LLMs smart just goes up every few months.

                  In a year it will be, well, LLMs didn't create totally breakthrough new Quantum Physics, it's still not as smart as us... lol

                  • By HDThoreaun 2025-06-145:08

                    All code has been seen before, thats why LLMs are so good at writing it.

                    I agree things are looking up for LLMs, but the semantics do matter here. In my experience LLMs are still pretty bad at solving novel problems(like arc agi 2) which is why I do not believe they have much intelligence. They seem to have started doing it a little, but are still mostly regurgitating.

                  • By bjourne 2025-06-1319:47

                    Well... There are two perspectives. Llms are smarter than we thought or people are stupider than we thought.

    • By XCSme 2025-06-1022:324 reply

      I remember the saying that from 90% to 99% is a 10x increase in accuracy, but 99% to 99.999% is a 1000x increase in accuracy.

      Even though it's a large10% increase first then only a 0.999% increase.

      • By zmgsabst 2025-06-124:001 reply

        Sometimes it’s nice to frame it the other way, eg:

        90% -> 1 error per 10

        99% -> 1 error per 100

        99.99% -> 1 error per 10,000

        That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).

        • By XCSme 2025-06-1213:562 reply

          Still, for the human mind it doesn't make intuitive sense.

          I guess it's the same problem with the mind not intuitively grasping the concept of exponential growth and how fast it grows.

          • By pixl97 2025-06-1216:52

            The lily pad example of the lake being half full on the 29th day out of 30 is also a good one.

          • By XCSme 2025-06-1213:59

            ChatGPT quick explanation:

            Humans struggle with understanding exponential growth due to a cognitive bias known as *Exponential Growth Bias (EGB)*—the tendency to underestimate how quickly quantities grow over time. Studies like Wagenaar & Timmers (1979) and Stango & Zinman (2009) show that even educated individuals often misjudge scenarios involving doubling, such as compound interest or viral spread. This is because our brains are wired to think linearly, not exponentially, a mismatch rooted in evolutionary pressures where linear approximations were sufficient for survival.

            Further research by Tversky & Kahneman (1974) explains that people rely on mental shortcuts (heuristics) when dealing with complex concepts. These heuristics simplify thinking but often lead to systematic errors, especially with probabilistic or nonlinear processes. As a result, exponential trends—such as pandemics, technological growth, or financial compounding—often catch people by surprise, even when the math is straightforward.

      • By bobbylarrybobby 2025-06-124:33

        I think the proper way to compare probabilities/proportions is by odds ratios. 99:1 vs 99999:1. (So a little more than 1000x.) This also lets you talk about “doubling likelihood”, where twice as likely as 1/2=1:1 is 2:1=2/3, and twice as likely again is 4:1=4/5.

      • By jsjohnst 2025-06-123:42

        The saying goes:

        From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.

      • By AtlasBarfed 2025-06-1213:53

        What's the required computation power for those extra 9s? Is it linear, poly, or exponential?

        Imo we got to the current state by harnessing GPUs for a 10-20x boost over CPUs. Well, and cloud parallelization, which is ?100x?

        ASIC is probably another 10x.

        But the training data may need to vastly expand, and that data isn't going to 10x. It's probably going to degrade.

    • By littlestymaar 2025-06-126:562 reply

      > I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro.

      This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.

      At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.

      • By indigo945 2025-06-129:06

        I have tried Claude 4.0 for agentic programming tasks, and it really outperforms Claude 3.7 by quite a bit. I don't follow the benchmarks - I find them a bit pointless - but anecdotally, Claude 4.0 can help me in a lot of situations where 3.7 would just flounder, completely misunderstand the problem and eventually waste more of my time than it saves.

        Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.

      • By Voloskaya 2025-06-128:421 reply

        > We're well into the diminishing returns at this point

        Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.

        Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.

        Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.

        * GPT-1 was 117M parameters.

        * GPT-2 was 1.5B params (~10x).

        * GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).

        After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.

        But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.

        So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.

        So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.

        • By littlestymaar 2025-06-1318:35

          > but I am assuming you mean diminishing beyond what the scaling laws predict.

          You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.

          But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.

          I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.

    • By avereveard 2025-06-129:35

      There's a new set of metrics that capture advances better than MMLU or it's pro version but nothing yet as standardized and specifically very few have a hidden set of tests to keep advancements from been from directional fine tuning.

    • By jstummbillig 2025-06-1022:162 reply

      It's hard to be 100% certain, but I am 90% certain that the benchmarks leveling off, at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

      • By motorest 2025-06-115:12

        > (...) at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

        I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?

      • By alightsoul 2025-06-110:38

        either that or the improvements aren't as large as before.

  • By chad1n 2025-06-1021:035 reply

    The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.

    • By simonw 2025-06-1021:128 reply

      That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.

      https://simonwillison.net/2025/Jun/10/o3-pro/

      • By teruakohatu 2025-06-123:02

        Do you think a cycling pelican is still a valid cursory benchmark? By now surely discussions about it are in the training set.

        There is quite a few on Google Image search.

        On the other hand they still seem to struggle!

      • By FergusArgyll 2025-06-110:002 reply

        Wow! pelican benchmark is now saturated

        • By esperent 2025-06-115:43

          Not until I can count the feathers, ask for a front view of the same pelican, then ask for it to be animated, all still using SVG.

        • By dtech 2025-06-127:34

          I wonder how much of that is because it's getting more and more included in training data.

          We now need to start using walrusses riding rickshaws

      • By CamperBob2 2025-06-1022:032 reply

        Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.

        Of course by now it'll be in-distribution. Time for a new benchmark...

        • By jstummbillig 2025-06-1022:182 reply

          I love that we are in the timeline where we are somewhat seriously evaluating probably super human intelligence by their ability to draw a svg of a cycling pelican.

          • By CamperBob2 2025-06-1022:251 reply

            I still remember my jaw hitting the floor when the first DALL-E paper came out, with the baby daikon radish walking a dog. How the actual fuck...? Now we're probably all too jaded to fully appreciate the next advance of that magnitude, whatever that turns out to be.

            E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.

            • By datameta 2025-06-124:43

              This makes me think of a reduction gear as a metaphor. At a high enough ratio, the torque is enormous but being put toward barely perceptible movement. There is the huge amount of computation happening to result in SVG that resembles a pelican on a bicycle.

          • By cdblades 2025-06-1112:441 reply

            I don't love that this is the conversation and when these models bake-in these silly scenarios with training data, everyone goes "see, pelican bike! super human intelligence!"

            The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?

            • By CamperBob2 2025-06-1114:561 reply

              "I'm taking this talking dog right back to the pound. It told me to go long on AAPL. Totally overhyped"

              • By johnmaguire 2025-06-123:151 reply

                Just because it's impressive doesn't mean it has "super human intelligence" though.

                • By CamperBob2 2025-06-1218:33

                  Well, it certainly came up with a better-looking SVG pelican than this human could have.

        • By simonw 2025-06-1023:55

          I like the Gemini 2.5 Pro ones a little more: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

      • By AstroBen 2025-06-1022:19

        That's one good looking pelican

      • By torginus 2025-06-129:41

        This made me think of the 'draw a bike experiment', where people were asked to draw a bike from memory, and were suprisingly bad at recreating how the parts fit together in a sensible manner:

        https://road.cc/content/blog/90885-science-cycology-can-you-...

        ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?

      • By eru 2025-06-129:32

        Well, that might be more of a function of how long they let it 'reason' than anything intrinsic to the model?

      • By Terretta 2025-06-112:33

        > It's only available via the newer Responses API

        And in ChatGPT Pro.

    • By torginus 2025-06-129:09

      I've wondered if some kind of smart pruning is possible during evaluation.

      What I mean by that, is if a neuron implements a sigmoid function and its input weights are 10,1,2,3 that means if the first input is active, then evaluation the other ones is mathematically pointless, since it doesn't change the result, which recursively means the inputs of those neurons that contribute to the precursors are pointless as well.

      I have no idea how feasible or practical is it to implement such an optimization and full network scale, but I think its interesting to think about

    • By gkamradt 2025-06-1021:26

      o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585

    • By weinzierl 2025-06-1021:36

      Is there a way to figure out likely quantization from the output. I mean, does quantization degrade output quality in certain ways that are different from other modification of other model properties (e.g. size or distillation)?

    • By hapticmonkey 2025-06-126:242 reply

      What a great future we are building. If AI is supposed to run everything, everywhere....then there will be 2, maybe 3, AI companies. And nobody outside those companies knows how they work.

      • By eru 2025-06-131:29

        What makes you think so? So far, many new AI companies are sprouting and many of them seem to be able to roughly match the state-of-the-art very quickly. (But pushing the frontier seems to be harder.)

        From the evidence we have so far, it does not look like there's any natural monopoly (or even natural oligopoly) in AI companies. Just the opposite. Especially with open weight models, or oven more so complete open source models.

      • By jsjohnst 2025-06-1211:51

        > And nobody outside those companies knows how they work.

        I think you meant to say:

        And nobody knows how they work.

  • By manmal 2025-06-1020:234 reply

    The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?

    • By lhl 2025-06-1110:373 reply

      I've been using o3 extensively since release (and a lot of Deep Research). I also use a lot of Claude and Gemini 2.5 Pro (most of the times, for code I'll let all of them go at it and iterate on my fav results).

      So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.

      I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.

      Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).

      I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).

      Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.

      (I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)

      • By petesergeant 2025-06-123:502 reply

        I wonder if we'll start to see artisanal benchmarks. You -- and I -- have preferred models for certain tasks. There's a world in which we start to see how things score on the "simonw chattiness index", and come to rely on smaller more specific benchmarks I think

        • By lhl 2025-06-124:49

          Yeah, I think personalized evals will definitely be a thing. Besides reviewing way too much Arena, WildChat and having now seen lots of live traces firsthand, there's a wide range of LLM usage (and preferences), which really don't match my own tastes or requirements, lol.

          For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...

        • By HDThoreaun 2025-06-1215:51

          I think its more likely that we move away from benchmarks and towards more of a traditional reviewer model. People will find LLM influencers whose takes they agree with and follow them to keep up with new models.

      • By manmal 2025-06-1114:101 reply

        Thanks for your input, very appreciated. Just in case you didn’t mean Claude Code, it’s really good in my experience and mostly stable. If something fails, it just retries and I don’t notice it much. Its autonomous discovery and tool use is really good and I‘m relying more and more on it.

        • By lhl 2025-06-124:431 reply

          For the Claude issues, I'm referring to the claude.ai frontend. While I use some Codex, Aider, and other agentic tools, I found Claude Code to be not to my taste - for my uses it tended burn a lot of tokens and gave relatively mediocre results, but I know it works well for others, so YMMV.

          • By mwigdahl 2025-06-1213:06

            If you're happy with your current tools that's good, but if not, and if you haven't tried Claude Code recently, you might give it a retry. I'm not sure what all they've been changing, but it burns a lot fewer tokens for me on tasks now than it did when I first started using it, with better results.

    • By petesergeant 2025-06-123:452 reply

      I am starting to feel like hallucination is a fundamentally unsolvable problem with the current architecture, and is going to keep squeezing the benchmarks until something changes.

      At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.

      • By OccamsMirror 2025-06-125:121 reply

        If you could actually teach these models things, not just in the current context, but as temporal learning, then that would alleviate a lot of the issues of hallucination. I imagine being able to say "that method doesn't exist, don't recommend it again" and then give it the documentation and it would absorb that information permanently, that would fundamentally change how we interact with these models. But can that work for models hosted for everyone to use at once?

        • By petesergeant 2025-06-127:381 reply

          There are an almost infinite number of things that can be hallucinated, though. You can't maintain a list of scientific papers or legal cases that don't exist! Hallucinations (almost certainly) aren't specific falsehoods that need to be erased...

          • By jsjohnst 2025-06-1211:58

            The level of hallucinations with o3 are no different than the level of hallucinations from most (all?) human sources in my experience. Yes, you definitely need to cross check, but yes, you need to do that for literally everything else, so it feels a bit redundant to keep preaching that as if it’s a failing of the model and not just an inherent property of all free sharing of information between two parties.

      • By varjag 2025-06-1210:111 reply

        Hallucination rate from o3 onward appear to be very low, to the point I rarely have to check.

        • By petesergeant 2025-06-1211:12

          This doesn't match my experience, so if I were you I'd absolutely keep checking.

    • By dyauspitr 2025-06-1020:501 reply

      Don’t they have a full fledged version of o4 somewhere internally at this point?

      • By ankit219 2025-06-1021:06

        They do it seems. o1 and o3 were based on the same base model. o4 is going to be based on a newer (and perhaps smarter) base model.

    • By bachittle 2025-06-1020:281 reply

      it's the same model as o3, just with thinking tokens turned up to the max.

      • By Tiberium 2025-06-1020:512 reply

        That's simply not true, it's not just "max thinking budget o3" just like o1-pro wasn't "max thinking budget o1". The specifics are unknown, but they might be doing multiple model generations and then somehow picking the best answer each time? Of course that's a gross simplification, but some assume that they do it this way.

        • By firejake308 2025-06-1021:45

          > "We also introduced OpenAI o3-pro in the API—a version of o3 that uses more compute to think harder and provide reliable answers to challenging problems"

          Sounds like it is just o3 with higher thinking budget to me

        • By cdblades 2025-06-1020:531 reply

          > That's simply not true, it's not just "max thinking budget o3"

          > The specifics are unknown, but they might...

          Hold up.

          > but some assume that they do it this way.

          Come on now.

          • By MallocVoidstar 2025-06-1021:072 reply

            Good luck finding the tweet (I can't) but at least one OpenAI engineer has said that o1-pro was not just 'o1 thinking longer'.

HackerNews