Gemini-2.5-pro-preview-06-05

2025-06-0516:44349230deepmind.google

Gemini 2.5 Pro is our most advanced model for complex tasks. With thinking built in, it showcases strong reasoning and coding capabilities.

Input price

$/1M tokens
(no caching)
$1.25
$2.50 > 200k tokens $10.00 $1.10 $15.00 $3.00 $0.55

Output price

$/1M tokens $10.00
$15.00 > 200k tokens $40.00 $4.40 $75.00 $15.00 $2.19

Reasoning & knowledge Humanity's Last Exam (no tools)

21.6% 20.3% 14.3% 10.7% — 14.0%*

Science GPQA diamond

single attempt 86.4% 83.3% 81.4% 79.6% 80.2% 81.0% multiple attempts — — — 83.3% 84.6% —

Mathematics AIME 2025

single attempt 88.0% 88.9% 92.7% 75.5% 77.3% 87.5% multiple attempts — — — 90.0% 93.3% —

Code generation LiveCodeBench (UI: 1/1/2025-5/1/2025)

single attempt 69.0% 72.0% 75.8% 51.1% — 70.5%

Code editing Aider Polyglot

82.2% diff-fenced

79.6% diff

72.0% diff

72.0% diff

53.3% diff

71.6%

Agentic coding SWE-bench Verified

single attempt 59.6% — — 72.5% — — multiple attempts 67.2% 49.4% 68.1% 79.4% — 57.6%

Factuality SimpleQA

54.0% 48.6% 19.3% — 43.6% 27.8%

Factuality FACTS grounding

87.8% 69.6% 62.1% 77.7% 74.8% —

Visual reasoning MMMU

single attempt 82.0% 82.9% 81.6% 76.5% 76.0% no MM support multiple attempts — — — — 78.0% no MM support

Image understanding Vibe-Eval (Reka)

67.2% — — — — no MM support

Video understanding VideoMMMU

83.6% — — — — no MM support

Long context MRCR v2 (8-needle)

128k (average) 58.0% 57.1% 36.3% — 34.0% — 1M (pointwise) 16.4% no support no support no support no support no support

Multilingual performance Global MMLU (Lite)

89.2% — — — — —

Read the original article

Comments

  • By johnfn 2025-06-0517:0214 reply

    Impressive seeing Google notch up another ~25 ELO on lmarena, on top of the previous #1, which was also Gemini!

    That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.

    Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.

    • By joshmlewis 2025-06-0517:387 reply

      o3 is still my favorite over even Opus 4 in most cases. I've spent hundreds of dollars on AI code gen tools in the last month alone and my ranking is:

      1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.

      2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.

      3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.

      4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.

      I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.

      • By spaceman_2020 2025-06-0518:221 reply

        I use o3 a lot for basic research and analysis. I also find the deep research tool really useful for even basic shopping research

        Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise

        • By jml78 2025-06-060:04

          Gemini deep research runs circles around OpenAI deep research. It goes way deeper and uses way more sources.

      • By throwaway314155 2025-06-0517:413 reply

        It's interesting you say that because o3, while being a considerable improvement over OpenAI's other models, still doesn't match the performance of Opus 4 and Gemini 2.5 Pro by a long shot for me.

        However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.

        • By svachalek 2025-06-0517:525 reply

          If you're coding through chat apps you're really behind the times. Try an agent IDE or plugin.

          • By joshmlewis 2025-06-0518:072 reply

            Yeah, exactly. For everyone who might not know, the chat apps add lots of complex system prompting to handle and shape personality, tone, general usability, etc. IDE's also do this (with Claude Code being one of the ones that are closest to "bare" model that you can get) but at they are at least guiding it's behavior to be really good at coding tasks. Another reason is using the Agent feature that IDE's have had for a few months now which gives it the ability to search/read/edit files across your codebase. You may not like the idea of this and it feels like losing control, but it's the future. After months of using it I've learned how to get it to do what I want but I think a lot of people who try it once and stop get frustrated that it does something dumb and just assume it's not good. That's a practice and skill problem not a model problem.

            • By jona777than 2025-06-0523:51

              This has been my experience. It has been something I’ve had to settle into. After some reps, it is becoming more difficult to imagine going back to regular old non-assisted coding sessions that aren’t purely for hobby.

              Your model rankings are spot on. I’m hesitant to make the jump to top tier premium models as daily drivers, so I hang out with sonnet 4 and/or Gemini 2.5 pro for most of the day (max mode in Cursor). I don’t want to get used to premium quality coming that easy, for some reason. I completely align with the concise, thoughtful code being worth it though. I’m having to do that myself using tier 2 models. I still use o3 periodically for getting clarity of thought or troubleshooting gnarly bugs that Claude gets caught looping on.

              How would you compare Cursor to Claude Code? I’m yet to try the latter.

            • By Workaccount2 2025-06-0518:243 reply

              IDE's are intimidating to non-tech people.

              I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.

              • By dragonwriter 2025-06-0518:29

                > I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.

                The major LLM chat interfaces often have code execution built in, so there kind of is, it just doesn't look like what an SWE thinks of as an IDE.

              • By joshmlewis 2025-06-0518:36

                I have not used them but I feel like there are tools like Replit, Lovable, etc that are for that audience. I totally agree IDE's are intimidating for non-technical people though. Claude Code is pretty cool in that way where it's one command to install and pretty easy to get started with.

              • By dieortin 2025-06-0623:52

          • By joshvm 2025-06-0521:28

            An important caveat here is yes, for coding. Apps are fine for coming up with one-liners, or doing other research. I haven't found the quality of IDE based code to be significantly better than what ChatGPT would suggest, but it's very useful to ask questions when the model has access to both the code and can prompt you to run tests which rely on local data (or even attached hardware). I really don't trust YOLO mode so I manually approve terminal calls.

            My impression (with Cursor) is that you need to practice some sort of LLM-first design to get the best out of it. Either vibe code your way from the start, or be brutal about limiting what changes the agent can make without your approval. It does force you to be very atomic about your requests, which isn't a bad thing, but writing a robust spec for the prompt is often slower than writing the code by hand and asking for a refactor. As soon as kipple, for lack of a better word, sneaks into the code, it's a reinforcing signal to the agent that it can add more.

            It's definitely worth paying the $20 and playing with a few different clients. The rabbit hole is pretty deep and there's still a ton of prompt engineering suggestions from the community. It encourages a lot of creative guardrails, like using pre-commit to provide negative feedback when the model does something silly like try to write a 200 word commit message. I haven't tried JetBrains' agent yet (Junie), but that seems like it would be a good one to explore as well since it presumably integrates directly with the tooling.

          • By throwaway314155 2025-06-0520:051 reply

            I think this is debatable. But I've used Cursor and various extensions for VS Code. They're all fine (but cursor can fuck all the way off for stealing the `code` shell integration from VS Code) but you don't _need_ an IDE as Claude Code has shown us (currently my primary method of vibe coding).

            It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.

            • By hirako2000 2025-06-0523:121 reply

              We are trading tokens and mental health for time?

              I used Cursor well over a year ago. It gave me a headache. It was very immature. Used cursor more recently: the headache intensity increased. It's not cursor it is the senseless loops hoping for the LLM to spit out something somewhat correct. Revisiting the prompt. Trying to become an elite in language protocols because we need that machine to understand us.

              Leaving aside the headache, its side effects. It isn't clear we haven't already maxed out on the productivity tools efficiency. Auto complete. Indexed and searchable doc a second screen rather than having to turn the pages of some reference book. Etc etc.

              I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.

              Engineers need to spend more time thinking.

              I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.

              For some reason the constant pressure to go faster eventually makes its mark.

              It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.

              Anyway. It can get expensive and this is by design.

              • By throwaway314155 2025-06-0523:531 reply

                > We are trading tokens and mental health for time?

                I have bipolar disorder. This makes programming incredibly difficult for me at times. Almost all the recent improvements to code generation tooling have been a tremendous boon for me. Coding is now no longer this test of how frustrated I can get over the most trivial of tasks. I just ask for what I want precisely and treat responses like a GitHub PR where mistakes may occur. In general (and for the trivial tasks I'm describing) Claude Code will generate correct, good code (I inform it very precisely of the style I want, and tell it to use linters/type-checkers/formatters after making changes) on the first attempt. No corrections needed.

                tl;dr - It's been nothing but a boon for this particular mentally ill person.

                • By hirako2000 2025-06-0620:21

                  If your handicap make coding difficult perhaps another profession would suit you better.

                  Now if Ai assistance allow you to perform well then that is a different story and I take my advice back of course.

                  There is a lot to say, positive things about how LLMs enables people to perform at tasks that would be impossible for them. Whether due to handicaps or simply lacking the abilities, or opportunity to train.

                  My comment was on the impact on "healthy" individuals who remain the majority of the population. And I only spoke for myself, I have no clue maybe it is just me or due to how I use the thing. Thanks for sharing your experience though, I had not considered what might be a concern for the majority with this might very well be an enabler.

          • By baw-bag 2025-06-0519:551 reply

            I am really struggling with this. I tried Cline with both OpenAI and Claude to very weird results. Often burning through credits to get no where or just running out of context. I just got Cursor for a try so can't say anything on that yet.

            • By joshmlewis 2025-06-0519:582 reply

              It's a skill that takes some persistence and trial and error. Happy to chat with you about it if you want to send me an email.

              • By Vetch 2025-06-0522:20

                There is skill to it but that's certainly not the only relevant variable involved. Other important factors are:

                Language: Syntax errors rise, and a common form is the syntax of a more common language bleeding through.

                Domain: Less so than what humans deem complex, quality is more strongly controlled by how much code and documentation there is for a domain. Interesting is that if in a less common subdomain, it will often revert to a more common approach (for example working on shaders for a game that takes place in a cylinder geometry requires a lot more hand-holding than on a plane). It's usually not that they can't do it, but that they require much more involved prompting to get the context appropriately set up and then managing drifting to default, more common patterns. Related is decisions with long term consequences. LLMs are pretty weak at this. In humans this one comes with experience, so it's rare and an instance of low coverage.

                Dates: Related is reverting to obsolete API patterns.

                Complexity: While not as dominant as domain coverage, complexity does play a role. With likelihood of error rising with complexity.

                This means if you're at the intersection of multiple of these (such as a low coverage problem in a functional language), agent mode will likely be too much of a waste for you. But interactive mode can still be highly productive.

              • By baw-bag 2025-06-0520:00

                I really appreciate that. I will see how I get on and may well give you a shout. Thank you!

          • By PeterStuer 2025-06-0610:56

            Depends. For devops chat is quite nice as the exploration/understanding is key, not just writing out the configs.

        • By jorvi 2025-06-0518:151 reply

          What's most annoying about Gemini 2.5 is that it is obnoxiously verbose compared to Opus 4. Both in explaining the code it wrote and the amount of lines it writes and comments it adds, to the point where the output is often 2-3x more than Opus 4.

          You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.

          • By joshmlewis 2025-06-0518:471 reply

            Yes this is what I mean by conciseness with o3. If prompted well it can produce extremely high level quality code that blows me away at times. I've also had several instances now where I gave it slightly wrong context and other models just butchered a solution with dozens of lines for the proposed fix which I could tell wasn't right and then after reverting and asking o3, it immediately went searching for another file I hadn't included and fixed it in one line. That kind of, dare I say independent thinking, is worth a lot when dealing with complex codebases.

            • By jorvi 2025-06-0611:59

              Personally I still am of the opinion current LLMs are more of a very advanced autocomplete.

              I have to think of the guy posting that he fed his entire project codebase to an AI, it refactored everything, modularizing it but still reducing the file count from 20 to 12. "It was glorious to see. Nothing worked of course, but glorious nonetheless".

              In the future I can certainly see it get better and better, especially because code is a hard science that reduces down to control flow logic which reduces down to math. It's a much more narrow problem space than, say, poetry or visuals.

        • By joshmlewis 2025-06-0517:452 reply

          What languages do you use it with and IDE? I use it in Cursor mainly with Max reasoning on. I spent around $300 on token based usage for o3 alone in May still only accepting around 33% of suggestions though. I made a post on X about this the other day but I expect that amount of rejections will go down significantly by the end of this year at the rate things are going.

          • By drawnwren 2025-06-0518:071 reply

            Very strange. I find reasoning has very narrow usefulness for me. It's great to get a project in context or to get oriented in the conversation, but on long conversations I find reasoning starts to add way too much extraneous stuff and get distracted from the task at hand.

            I think my coding model ranking is something like Claude Code > Claude 4 raw > Gemini > big gap > o4-mini > o3

            • By joshmlewis 2025-06-0518:341 reply

              Claude Code isn't a model in itself. By default it routes some to Opus 4 or Sonnet 4 but mostly Sonnet 4 unless you explicitly set it.

          • By throwaway314155 2025-06-0518:22

            i'm using with python, VS Code (not integrated with claude just basic copilot) and Claude Code. For Gemini i'm using AI studio with repomix to package my code into a single file. I copy files over manually in that workflow.

            All subscription based, not per token pricing. I'm currently using Claude Max. Can't see myself exhausting its usage at this rate but who knows.

      • By vendiddy 2025-06-0521:32

        I find o3 to be the clearest thinker as well.

        If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.

        If o3 was faster and cheaper I'd use it a lot more.

        I'm curious what your workflows are !

      • By monkpit 2025-06-0518:08

        Have you used Cline with opus+sonnet? Do you have opinions about Claude code vs cline+api? Curious to hear your thoughts!

      • By jonplackett 2025-06-0523:031 reply

        How do you find o3 vs o4-mini?

        • By joshmlewis 2025-06-063:071 reply

          For coding at least, I don't bother with anything less than the top thinking models. They do have their place for some tasks in agentic systems but time is money and I don't want to waste time trying to coral less skilled models when there are more powerful ones available.

          • By jonplackett 2025-06-069:00

            I have the same logic but opposite conclusion - o3 just takes SO LONG to respond that I often just use o4-mini

      • By pqdbr 2025-06-0518:252 reply

        How do you choose which model to use with Claude Code?

        • By joshmlewis 2025-06-0518:331 reply

          I have the Max $200 plan so I set it to Opus until it limits me to Sonnet 4 which has only happened in two out of a few dozen sessions so far. My rule of thumb in Cursor is it's worth paying for the Max reasoning models for pretty much every request unless it's stupid simple because it produces the best code each time without any funny business you get with cheaper models.

          • By sunshinerag 2025-06-0521:011 reply

            You can use the max plan in cursor? I thought it didn’t support calls via api and only worked in Claude code?

            • By symbolicAGI 2025-06-0523:35

              I launch Claude Code in VS Studio (similar to Cursor): > claude

              Then I use the /login command that opens a browser window to log into Claude Max.

              You can confirm Claude Max billing going forward in VS Studio/Claude Code: /cost

              "With your Claude Max subscription, no need to monitor cost — your subscription includes Claude Code usage"

        • By jasonjmcghee 2025-06-0614:36

          In case you're asking for the literal command...

          /model

      • By VeejayRampay 2025-06-0518:451 reply

        we need to stop it with the anecdotal evidence presented by one random dude

    • By batrat 2025-06-0520:36

      What I like about Gemini is the search function that is very very good compared to others. I was blown away when I asked to compose me an email for a company that was sending spam to our domain. It literally searched and found not only the abuse email of the hosting company but all the info about the domain and the host(mx servers, ip owners, datacenters, etc.). Also if you want to convert a research paper into a podcast it did it instantly for me and it's fun to listen.

    • By baq 2025-06-0518:481 reply

      I’ve been giving the same tasks to claude 4 and gemini 2.5 this week and gemini provided correct solutions and claude didn’t. These weren’t hard tasks either, they were e.g. comparing sql queries before/after rewrite - Gemini found legitimate issues where claude said all is ok.

    • By Szpadel 2025-06-0517:39

      in my experience this highly depends case by case. For some cases Gemini crushed my problem, but in next one stuck and couldn't figure out simple bug.

      the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)

      I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution

    • By varunneal 2025-06-0517:331 reply

      Have you tried o3 on those problems? I've found o3 to be much more impressive than Opus 4 for all of my use cases.

      • By johnfn 2025-06-0519:53

        To be honest, I haven't, because the "This model is extremely expensive" popup on Cursor makes me a bit anxious - but given the accolades here I'll have to give it a shot.

    • By cwbriscoe 2025-06-0520:58

      I haven't tried all of the favorites, just what is available with Jetbrains AI, but I can say that Gemini 2.5 is very good with Go. I guess that makes sense in a way.

    • By zamadatix 2025-06-0518:582 reply

      I think the only way to be particularly impressed with new leading models lately is to hold the opinion all of the benchmarks are inaccurate and/or irrelevant and it's vibes/anecdotes where the model is really light years ahead. Otherwise you look at the numbers on e.g. lmarena and see it's claiming a ~16% preference win rate for gpt-3.5-turbo from November of 2023 over this new world-leading model from Google.

      • By johnfn 2025-06-0519:581 reply

        Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is 1206, which is an 86% win rate. https://chatgpt.com/share/6841f69d-b2ec-800c-9f8c-3e802ebbc0...

        • By zamadatix 2025-06-071:08

          gpt-3.5-turbo-1106 from November 2023 was 1170, 1206 is for the March variant.

          Change that and you get ~84%, flip the order (i.e. the win rate of GPT-3.5 is ~16%). I.e. the point is a two year old model still wins far too often to be excited about each new top model for the last two years, not that the two year old model is better.

      • By Workaccount2 2025-06-0520:361 reply

        People can ask whatever they want on LMarena, so a question like "List some good snacks to bring to work" might elicit a win for a old/tiny/deprecated model simply because it lists the snack the user liked more.

        • By AstroBen 2025-06-0521:41

          are you saying that's a bad way to judge a model? Not sure why we'd want ones that choose bad snacks

    • By tempusalaria 2025-06-0517:31

      I agree I find claude easily the best model, at least for programming which is the only thing I use LLMs for

    • By lispisok 2025-06-0518:00

      >That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability

      Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.

    • By Alifatisk 2025-06-0519:241 reply

      > after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it

      No way, is there any way to see the dialog or recreate this scenario!?

      • By johnfn 2025-06-0519:48

        The chat was in Cursor, so I don't know a way to provide a public link, but here is the last paragraph that it output before I (and it) gave up. I honestly could have re-prompted it from scratch and maybe it would have gotten it, but at this point I was pretty sure that even if it did, it was going to make a total mess of things. Note that it was iterating on a test failure and had spun through multiple attempts at this point:

        > Given the persistence of the error despite multiple attempts to refine the type definitions, I'm unable to fix this specific TypeScript error without a more profound change to the type structure or potentially a workaround that might compromise type safety or accuracy elsewhere. The current type definitions are already quite complex.

        The two prior paragraphs, in case you're curious:

        > I suspect the issue might be a fundamental limitation or bug in how TypeScript is resolving these highly recursive and conditional types when they are deeply nested. The type system might be "giving up" or defaulting to a less specific type ({ __raw: T }) prematurely.

        > Since the runtime logic seems to be correctly hydrating the nested objects (as the builder.build method recursively calls hydrateHelper), the problem is confined to the type system's ability to represent this.

        I found, as you can see in the first of the prior two paragraphs, that Gemini often wanted to claim that the issue was on TypeScript's side for some of these more complex issues. As proven by Opus, this simply wasn't the case.

    • By AmazingTurtle 2025-06-0518:22

      for bulk data extraction on personal real life data I experienced that even gpt-4o-mini outperforms latest gemini models in both quality and cost. i would use reasoning models but their json schema response is different from the non-reasonig models, as in: they can not deal with union types for optional fields when using strict schemas... anyway.

      idk whats the hype about gemini, it's really not that good imho

    • By tymonPartyLate 2025-06-0519:293 reply

      I just realized that Opus 4 is the first model that produced "beautiful" code for me. Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional. I had my first "wow" moment with it in a while. That being said it occasionally does something absolutely stupid. Like completely dumb. And when I ask it "why did you do this stupid thing", it replies "oh yeah, you're right, this is super wrong, here is an actual working, smart solution" (proceeds to create brilliant code)

      I do not understand how those machines work.

      • By diggan 2025-06-0521:131 reply

        > Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional

        I get that with most of the better models I've tried, although I'd probably personally favor OpenAI's models overall. I think a good system prompt is probably the best way there, rather than relying in some "innate" "clean code" behavior of specific models. This is a snippet of what I use today for coding guidelines: https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313...

        > That being said it occasionally does something absolutely stupid. Like completely dumb

        That's a bit tougher, but you have to carefully read through exactly what you said, and try to figure out what might have led it down the wrong path, or what you could have said in the first place for it avoid that. Try to work it into your system prompt, then slowly build up your system prompt so every one-shot gets closer and closer to being perfect on every first try.

      • By Tostino 2025-06-0519:57

        My issue is that every time i've attempted to use Opus 4 to solve any problem, I would burn through my usage cap within a few min and not have solved the problem yet because it misunderstood things about the context and I didn't get the prompt quite right yet.

        With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.

      • By simon1ltd 2025-06-0520:51

        I've also experienced the same, except it produced the same stupid code all over again. I usually use one model (doesn't matter which) until it starts chasing it's tail, then I feed it to a different model to have it fix the mistakes by the first model.

    • By tomr75 2025-06-0521:45

      how does it have access to DOM? are you using it with cursor/browser MCP?

  • By chollida1 2025-06-0518:0811 reply

    I'd start to worry about OpenAI, from a valuation standpoint. The company has some serious competition now and is arguably no longer the leader.

    its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.

    If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.

    When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.

    What is new money coming into OpenAI getting now?

    At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

    Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.

    They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.

    Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.

    • By jstummbillig 2025-06-0521:206 reply

      There is some serious confusion about the strength of OpenAIs position.

      "chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).

      All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.

      • By aeyes 2025-06-0521:313 reply

        Google has a text input box on google.com, as soon as this gives similar responses there is no need for the average user to use ChatGPT anymore.

        I already see lots of normal people share screenshots of the AI Overview responses.

        • By jstummbillig 2025-06-0522:081 reply

          You are skipping over the part where you need to bring normal people, specially young normal people, back to google.com for them to see anything at all on google.com. Hundreds of millions of them don't go there anymore.

        • By paxys 2025-06-060:40

          > as soon as this gives similar responses

          And when is that going to be? Google clearly has the ability to convert google.com into a ChatGPT clone today if they wanted to. They already have a state of the art model. They have a dozen different AI assistants that no one uses. They have a pointless AI summary on top of search results that returns garbage data 99% of the time. It's been 3+ years and it is clear now that the company is simply too scared to rock the boat and disrupt its search revenue. There is zero appetite for risk, and soon it'll be too late to act.

        • By askafriend 2025-06-0522:10

          As the other poster mentioned, young people are not going there. What happens when they grow up?

      • By candiddevmike 2025-06-0521:41

        ChatGPT is going to be Kleenex'd. They wasted their first mover advantage. Replace ChatGPT's interface with any other LLM and most users won't be able to tell the difference.

      • By ComplexSystems 2025-06-0523:521 reply

        "People have no idea what claude or gemini are"

        One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.

        • By jstummbillig 2025-06-0611:521 reply

          If it were that simple to sway markets through marketing, we would see Pepsi/Coca-Cola or McDonalds/BurgerKing swing like crazy all the time from "one well-placed ad campaign" to the next. We do not.

          • By ComplexSystems 2025-06-084:43

            Thanks to well-placed ad campaigns, people have a very good idea what Pepsi, Coca-Cola, McDonald's and Burger King are. They also know what Siri is. And it would be similarly easy to establish Gemini as a household name.

      • By chollida1 2025-06-060:41

        Chatgpt has no moat of any kind though.

        I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.

        That means one stumble on the next foundational model and their market share drops in half in like 2 months.

        Now the same is true for the other llms as well.

      • By potatolicious 2025-06-0522:34

        I think this pretty substantially overstates ChatGPT's stickiness. Just because something is widely (if not universally) known doesn't mean it's universally used, or that such usage is sticky.

        For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.

        tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.

        It seems like the market is fine with seeking specific LLMs for specific kinds of tasks, as opposed to some omni-LLM one-stop-shop that does everything. The market has already and rapidly moved beyond from ChatGPT.

        Not to mention I am willing to bet that Gemini has radically more usage than OpenAI's models simply by virtue of being plugged into Google Search. There are distribution effects, I just don't think OpenAI has the strongest position!

        I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.

      • By lizardking 2025-06-0620:04

        Xerox was a verb too

    • By PantaloonFlames 2025-06-0522:57

      > At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

      Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.

      Probably your point still stands.

    • By jadbox 2025-06-0518:241 reply

      Currently I only find OpenAI to be clearly better for image generation: like illustrations, comics, or photo editing for home project ideation.

      • By bufferoverflow 2025-06-061:44

        And open-source Flux.1 Kontext is already better than it.

    • By energy123 2025-06-0518:401 reply

      Even if they're winning the AI race, their search business is still going to be cannibalized, and it's unclear if they'll be able to extract any economic rents from AI thanks to market competition. Of course they have no choice but to compete, but they probably would have preferred the pre-AI status quo of unquestioned monopoly and eyeballs on ads.

      • By xmprt 2025-06-0520:48

        Historically, every company has failed by not adapting to new technologies and trying to protect their core business (eg. Kodak, Blockbuster, Blackberry, Intel, etc). I applaud Google for going against their instincts and actively trying to disrupt their cash cow in order to gain an advantage in the AI race.

    • By orionsbelt 2025-06-0520:271 reply

      I think it’s too early to say they are not the leader given they have o3 pro and GPT 5 coming out within the next month or two. Only if those are not impressive would I start to consider that they have lost their edge.

      Although it does feel likely that at minimum, they are neck and neck with Google and others.

    • By sebzim4500 2025-06-0520:51

      >At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

      What? Apple has a revenue of 400B and a market cap of 3T

    • By Rudybega 2025-06-0518:151 reply

      I think OpenAI has projected 12.7B in revenue this year and 29.4B in 2026.

      Edit: I am dumb, ignore the second half of my post.

      • By eamag 2025-06-0518:181 reply

        isn't P/E about earnings, not revenue?

        • By Rudybega 2025-06-0518:21

          You are correct. I need some coffee.

    • By ketzo 2025-06-0518:123 reply

      OpenAI has already forecast $12B in revenue by the end of this year.

      I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway

      • By Workaccount2 2025-06-0518:402 reply

        The hurdle for OpenAI is going to be on the profit side. Google has their own hardware acceleration and their own data centers. OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers. Never mind that Google can customize it's hardware specifically for it's models.

        The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.

        • By geodel 2025-06-0519:003 reply

          OpenAI has now partnered with Jony Ive now and they are going to have thinnest data centers with thinnest servers mounted on thinnest racks. And since everything is so thin, servers can just whisper to each other instead of communicating via fat cables.

          I think that will be the game changer OpenAI will show us soon.

          • By falloon 2025-06-0521:01

            All servers will have a single thunderbolt port.

          • By gotoeleven 2025-06-0617:22

            Yep and I heard the servers will only have two USB-C ports for all I/O, but of course dongles will be available.

        • By diggan 2025-06-0521:161 reply

          > OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers.

          Don't they have a data center in progress as we speak? Seems by now they're planning on building not just one huge data center in Texas, but more in other countries too.

          • By geodel 2025-06-063:411 reply

            Well that data center is just going to be full of Nvidia GPUs hence "pay to monopolist" part.

            • By diggan 2025-06-0610:39

              Guess the part I put in quotes should have had a "or" instead of "and" there.

      • By chollida1 2025-06-0518:15

        Agreed, its the doubling of that each year for the next 4-5 years that I see as being difficult.

      • By VeejayRampay 2025-06-0518:471 reply

        the leeway comes from the grotesque fanboyism the company benefits from

        they haven't been number one for quite some time and still people can't stop presenting them as the leaders

        • By ketzo 2025-06-0521:161 reply

          People said much the same thing about Apple for decades, and they’re a $3T company; not a bad thing to have fans.

          Plus, it’s a consumer product; it doesn’t matter if people are “presenting them as leaders”, it matters if hundreds of millions of totally average people will open their computers and use the product. OpenAI has that.

          • By aryehof 2025-06-066:22

            Actually, their speculative value is about 3 trillion. Their book value is around 68 billion. Their speculative value might be halved (or more) overnight based on the whim of the economy, markets and opinion. A company isn't actually worth its speculative value.

    • By raincole 2025-06-0521:19

      > At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

      Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.

    • By qeternity 2025-06-0519:061 reply

      > At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

      Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.

      Revenue is not the metric by which these companies are valued...

      • By Yizahi 2025-06-0612:50

        The difference between Microsoft and OAI is that Microsoft can spend a lump sum of money on Excel and a fraction of that on its support and then sell it infinitely with almost no additional costs. MS can add a million of new Excel users tomorrow and that would be almost pure profit. (I'm very simplifying)

        OAI on the other hand must spend a lot of additional money for every single new user, both free and paid. Adding million new OAI users tomorrow would mean gigantic negative red hole in the profits (adding to the existing negative). OAI has no or almost no benefits of scale, unlike other industries.

        I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.

    • By Oleksa_dr 2025-06-0521:591 reply

      I was tempted by the ratings and immediately paid for a subscription to Gemini 2.5. Half an hour later, I canceled the subscription and got a refund. This is the laziest and stupidest LLM. What he had to do, he told me to do on my own. And also when analyzing simple short documents, he pulled up some completely strange documents from the Internet not related to the topic. Even local LLMs (3B) were not so stupid and lazy.

      • By sigmoid10 2025-06-0523:16

        Exactly my experience as well. I don't get why people here now seem to blindly take every new gamed benchmark as some harbinger of OpenAI's imminent downfall. Google is still way behind in day-to day personal and professional use for me.

  • By vthallam 2025-06-0516:575 reply

    As if 3 different preview versions of the same model is not confusing enough, the last two dates are 05-06 and 06-05. They could have held off for a day:)

    • By tomComb 2025-06-0517:141 reply

      Since those days are ambiguous anyway, they would have had to hold off until the 13th.

      In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.

      • By layer8 2025-06-0517:531 reply

        > they would have had to hold off until the 13th.

        06-06 is unambiguously after 05-06 regardless of date format.

        • By Sammi 2025-06-0610:01

          The problems is that I mentally just panic and abort without even trying when I see 06-06 and 05-06. The ambiguity just flips my brain off.

    • By dist-epoch 2025-06-0517:382 reply

      > the last two dates are 05-06 and 06-05

      they are clearly trolling OpenAI's 4o and o4 models.

      • By oezi 2025-06-0518:55

        Don't repeat the same mistake if you want to troll somebody.

        It makes you look even more stupid.

      • By fragmede 2025-06-0518:52

        ChatGPT itself suggests better names than that!

    • By declan_roberts 2025-06-0517:291 reply

      Engineers are surprisingly bad at naming things!

      • By jacob019 2025-06-0517:421 reply

        I rather like date codes as versions.

        • By atom058 2025-06-1119:46

          But it's not clear how to interpret the date code: 05-06 could be 5th June or 6th May; same sorry for 06-05. Very confusing due to American-style date formatting. Versions number are at least sequential, with a bigger number being a later version.

    • By UncleOxidant 2025-06-0518:56

      At what point will they move from Gemini 2.5 pro to Gemini 2.6 pro? I'd guess Gemini 3 will be a larger model.

HackerNews