Gemini 2.5 Flash

2025-04-1719:031076560developers.googleblog.com

Gemini / Google AI Studio

Gemini 2.5 Flash ai.dev

Today we are rolling out an early version of Gemini 2.5 Flash in preview through the Gemini API via Google AI Studio and Vertex AI. Building upon the popular foundation of 2.0 Flash, this new version delivers a major upgrade in reasoning capabilities, while still prioritizing speed and cost. Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off. The model also allows developers to set thinking budgets to find the right tradeoff between quality, cost, and latency. Even with thinking off, developers can maintain the fast speeds of 2.0 Flash, and improve performance.

Our Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding. Instead of immediately generating an output, the model can perform a "thinking" process to better understand the prompt, break down complex tasks, and plan a response. On complex tasks that require multiple steps of reasoning (like solving math problems or analyzing research questions), the thinking process allows the model to arrive at more accurate and comprehensive answers. In fact, Gemini 2.5 Flash performs strongly on Hard Prompts in LMArena, second only to 2.5 Pro.

Comparison table showing price and performance metrics for LLMs
2.5 Flash has comparable metrics to other leading models for a fraction of the cost and size.

2.5 Flash continues to lead as the model with the best price-to-performance ratio.

Gemini 2.5 Flash price-to-performance comparison
Gemini 2.5 Flash adds another model to Google’s pareto frontier of cost to quality.*

We know that different use cases have different tradeoffs in quality, cost, and latency. To give developers flexibility, we’ve enabled setting a thinking budget that offers fine-grained control over the maximum number of tokens a model can generate while thinking. A higher budget allows the model to reason further to improve quality. Importantly, though, the budget sets a cap on how much 2.5 Flash can think, but the model does not use the full budget if the prompt does not require it.

Plot graphs show improvements in reasoning quality as thinking budget increases
Improvements in reasoning quality as thinking budget increases.

The model is trained to know how long to think for a given prompt, and therefore automatically decides how much to think based on the perceived task complexity.

If you want to keep the lowest cost and latency while still improving performance over 2.0 Flash, set the thinking budget to 0. You can also choose to set a specific token budget for the thinking phase using a parameter in the API or the slider in Google AI Studio and in Vertex AI. The budget can range from 0 to 24576 tokens for 2.5 Flash.

The following prompts demonstrate how much reasoning may be used in the 2.5 Flash’s default mode.


Prompts requiring low reasoning:

Example 1: “Thank you” in Spanish

Example 2: How many provinces does Canada have?


Prompts requiring medium reasoning:

Example 1: You roll two dice. What’s the probability they add up to 7?

Example 2: My gym has pickup hours for basketball between 9-3pm on MWF and between 2-8pm on Tuesday and Saturday. If I work 9-6pm 5 days a week and want to play 5 hours of basketball on weekdays, create a schedule for me to make it all work.


Prompts requiring high reasoning:

Example 1: A cantilever beam of length L=3m has a rectangular cross-section (width b=0.1m, height h=0.2m) and is made of steel (E=200 GPa). It is subjected to a uniformly distributed load w=5 kN/m along its entire length and a point load P=10 kN at its free end. Calculate the maximum bending stress (σ_max).

Example 2: Write a function evaluate_cells(cells: Dict[str, str]) -> Dict[str, float] that computes the values of spreadsheet cells.

Each cell contains:

  • Or a formula like "=A1 + B1 * 2" using +, -, *,/ and other cells.

Requirements:

  • Resolve dependencies between cells.
  • Handle operator precedence (*/ before +-).
  • Detect cycles and raise ValueError("Cycle detected at <cell>").
  • No eval(). Use only built-in libraries.


Start building with Gemini 2.5 Flash today

Gemini 2.5 Flash with thinking capabilities is now available in preview via the Gemini API in Google AI Studio and in Vertex AI, and in a dedicated dropdown in the Gemini app. We encourage you to experiment with the thinking_budget parameter and explore how controllable reasoning can help you solve more complex problems.

from google import genai client = genai.Client(api_key="GEMINI_API_KEY") response = client.models.generate_content( model="gemini-2.5-flash-preview-04-17", contents="You roll two dice. What’s the probability they add up to 7?", config=genai.types.GenerateContentConfig( thinking_config=genai.types.ThinkingConfig( thinking_budget=1024 ) )
) print(response.text)

Find detailed API references and thinking guides in our developer docs or get started with code examples from the Gemini Cookbook.

We will continue to improve Gemini 2.5 Flash, with more coming soon, before we make it generally available for full production use.


*Model pricing is sourced from Artificial Analysis & Company Documentation


Read the original article

Comments

  • By zoogeny 2025-04-1721:0220 reply

    Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.

    Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's models in general. It not only is smarter than me on most of the subjects I engage with it, it also isn't completely obsequious. The model pushes back on me rather than contorting itself to find a way to agree.

    100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight. I am building new tools with the mind to optimize my usage to increase it's value to me.

    • By jeeeb 2025-04-1722:058 reply

      After comparing Gemini Pro and Claude Sonnet 3.7 coding answers side by side a few times, I decided to cancel my Anthropic subscription and just stick to Gemini.

      • By blueyes 2025-04-1723:059 reply

        One of the main advantages Anthropic currently has over Google is the tooling that comes with Claude Code. It may not generate better code, and it has a lower complexity ceiling, but it can automatically find and search files, and figure out how to fix a syntax error fast.

        • By bayarearefugee 2025-04-180:365 reply

          As another person that cancelled my Claude and switched to Gemini, I agree that Claude Code is very nice, but beyond some initial exploration I never felt comfortable using it for real work because Claude 3.7 is far too eager to overengineer half-baked solutions that extend far beyond what you asked it to do in the first place.

          Paying real API money for Claude to jump the gun on solutions invalidated the advantage of having a tool as nice as Claude Code, at least for me, I admit everyone's mileage will vary.

          • By neuah 2025-04-1813:192 reply

            Exactly my experience as well. Started out loving it but it almost moves too fast - building in functionality that i might want eventually but isn't yet appropriate for where the project is in terms of testing, or is just in completely the wrong place in the architecture. I try to give very direct and specific prompts but it still has the tendency to overreach. Of course it's likely that with more use i will learn better how to rein it in.

            • By Hugsun 2025-04-1814:001 reply

              I've experienced this a lot as well. I also just yesterday had an interesting argument with claude.

              It put an expensive API call inside a useEffect hook. I wanted the call elsewhere and it fought me on it pretty aggressively. Instead of removing the call, it started changing comments and function names to say that the call was just loading already fetched data from a cache (which was not true). I could not find a way to tell it to remove that API call from the useEffect hook, It just wrote more and more motivated excuses in the surrounding comments. It would have been very funny if it weren't so expensive.

              • By freedomben 2025-04-1814:232 reply

                Geez, I'm not one of the people who think AI is going to wake up and wipe us out, but experiences like yours do give me pause. Right now the AI isn't in the drivers seat and can only assert itself through verbal expression, but I know it's only a matter of time. We already saw Cursor themselves get a taste of this. To be clear I'm not suggesting the AI is sentient and malicious - I don't believe that at all. I think it's been trained/programmed/tuned to do this, though not intentionally, but the nature of these tools is they will surprise us

                • By Jensson 2025-04-1815:301 reply

                  > but the nature of these tools is they will surprise us

                  Models used to do this much much more than now, so what it did doesn't surprise us.

                  The nature of these tools is to copy what we have already written. It has seen many threads where developers argue and dig in, they try to train the AI not to do that but sometimes it still happens and then it just roleplays as the developer that refuses to listen to anything you say.

                  • By elcritch 2025-04-1911:09

                    I almost fear more that we'll create Bender from Futurama than some superintelligent enlightened AGI. It'll probably happen after Grok AI gets snuck some beer into its core cluster or something absurd.

                • By arrowsmith 2025-04-1814:281 reply

                  > We already saw Cursor themselves get a taste of this.

                  Sorry what do you mean by this?

                  • By tempoponet 2025-04-1814:55

                    Earlier this week a Cursor AI support agent told a user they could only use Cursor on one machine at a time, causing the user to cancel their subscription.

            • By davem007 2025-04-285:43

              agreed, no matter what prompt I try, including asking Claude to promise not to implement code unless we agree on requirements and design, and to repeat that promise regularly, it jumps the gun, and implements (actually hallucinates) solutions way to soon. I changed to Gemini as a result.

          • By roygbiv2 2025-04-182:49

            I wanted some powershell code to do some sharepoint uploading. It created a 1000 line logging module that allowed me to log things at different levels like info, debug, error etc. Not really what I wanted.

          • By tough 2025-04-1820:211 reply

            Open Codex (A codex fork) that supports gemini and openrouter providers https://github.com/ymichael/open-codex

            google models on cli are great.

            • By mark_l_watson 2025-04-1917:13

              +1 Open Codex is very nice. Yesterday I was using it with Gemini APIs and also using a local model using Ollama running on my laptop.

              I added a very short chapter on setting this up (direct link to my book online): https://leanpub.com/ollama/read#using-the-open-codex-command...

              This morning I tweaked my Open Codex config to also try gemma3:27b-it-qat - and Google’s olen source small is excellent: runs fast enough for a good dev experience, with very good functionality.

          • By btbuildem 2025-04-1815:12

            "Don't be a keener. Do not do anything I did not ask you to do" are def part of my prompts when using Claude

          • By Sonnigeszeug 2025-04-1815:28

            Whats your setup/workflow then?

            Any ide integration?

        • By igor47 2025-04-182:052 reply

          I've switched to aider with the --watch-files flag. Being able to use models in nvim with no additional tooling is pretty sweet

          • By aitchnyu 2025-04-186:37

            Typing `//use this as reference ai` in one file and `//copy this row to x ai!` and it will add those functions/files to context and act on both places. Altough I wish Aider would write `working on your request...` under my comment, now I have to keep Aider window in sight. Autocomplete and "add to context" and "enter your instructions" of other apps feel clunky.

          • By mediaman 2025-04-183:02

            That's really cool. I've been looking for a nicer solution to use with nvim.

        • By mrinterweb 2025-04-1817:481 reply

          I don't understand the appeal of investing in leaning and adapting your workflow to use an AI tool that is so tightly coupled to a single LLM provider, when there are other great AI tools available that are not locked to a single LLM provider. I would guess aider is the closest thing to claude code, but you can use pretty much any LLM.

          The LLM field is moving so fast that what is the leading frontier model today, may not be the same tomorrow.

          Pricing is another important consideration. https://aider.chat/docs/leaderboards/

          • By smallnamespace 2025-04-1819:33

            All the AI tools end up converging on a similar workflow: type what you want and interrupt if you're not getting what you want.

        • By vladmdgolam 2025-04-1817:251 reply

          There are at least 10 projects currently aiming to recreate Claude Code, but for Gemini. For example, geminicodes.co by NotebookLM’s founding PM Raiza Martin

          • By wcallahan 2025-04-195:39

            Tried Gemini Codes yesterday, as well as anon-kode and anon-codex. Gemini Codes is already broken and appears to be rather brittle (she disclosures as much), and the other two appear to still need some prompt improvements or someone adding vector embedding for them to be useful?

            Perhaps someone can merge the best of Aider and codex/claude code now. Looking forward to it.

        • By energy123 2025-04-180:463 reply

          Google need to fix their Gemini web app at a basic level. It's slow, gets stuck on Show Thinking, rejects 200k token prompts that are sent one shot. Aistudio is in much better shape.

          • By Graphon1 2025-04-181:331 reply

            But have you tried any other interfaces for Gemini? Like the Gemini Code Assistant in VSCode? Or Gemini-backed Aider?

            • By roygbiv2 2025-04-182:51

              Have you tried them? Which one is fairly simple but just works?

          • By shrisukhani 2025-04-1819:43

            +1 on this. Improving Gemini apps and live mode will go such a long way for them. Google actually has the best model line-up now but the apps and APIs hold them back so much.

          • By johnisgood 2025-04-1810:191 reply

            I hate how I can copy paste long text into Claude (becomes a pasted text) and it is accepted, but in Gemini it is limited.

            • By Workaccount2 2025-04-1813:491 reply

              You can paste it in a text file and upload that. A little annoying compared to claude, but does work.

              • By johnisgood 2025-04-1814:491 reply

                Thanks, will give it a try.

                • By xbmcuser 2025-04-1818:291 reply

                  Uploading files on google is now great. I uploaded my python script and the text data files I was using the script to process. I asked it how best to optimize the code. It actually ran the python code on the data files. Then recommended changes then when prompted ran the script again to show the new results. At first I was like maybe hallucinating but no the data was correct.

                  • By johnisgood 2025-04-1818:33

                    Yeah "they" run Python code now quite well. They generate some output using Python "internally" (albeit shows you the code).

        • By mogili 2025-04-181:571 reply

          I use roo code with Gemini to get similar results for free

          • By ssd532 2025-04-187:281 reply

            Does its agentic features work with any API? I had tried this or Cline and it was clear that they work effectively only with Claude's tooling support.

            • By piizeus 2025-04-217:47

              Yes. Any API Key is allowed, Also you can assign different LLMs for different modes. It is great for cost-optimization. Like architect, code, ask, debug etc.

        • By julianeon 2025-04-184:326 reply

          Related:

          Only Claude (to my knowledge) has a desktop app which can directly, and usually quite intelligently, modify files and create repos on your desktop. It's the only "agentic" option among the major players.

          "Claude, make me an app which will accept Stripe payments and sell an ebook about coding in Python; first create the app, then the ebook."

          It would take a few passes but Claude could do this; obviously you can't do that with an API alone. That capability alone is worth $30/month in my opinion.

          • By xvinci 2025-04-186:351 reply

            Maybe I am not understanding something here.

            But there are third party options availabe that to the very same thing (e.g. https://aider.chat/ ) which allow you to plug in a model (or even a combination thereof e.g. deepseek as architect and claude as code writer) of your choice.

            Therefore the advantage of the model provider providing such a thing doesn't matter, no?

            • By jm547ster 2025-04-1815:161 reply

              Aider is not agentic - it is interactive by design. Copilot agent mode and Cline would better comparisons.

              • By tough 2025-04-1820:27

                OpenAI launched codex 2 days ago, there's open forks already that support other providers too

                there's also claude code proxy's to run it on local llm's

                you can just do things

          • By int_19h 2025-04-187:17

            A first party app, sure, but there's no shortage of third party options. Cursor, Windsurf/Codeium etc. Even VSCode has agent mode now.

          • By dingnuts 2025-04-1814:39

            > first create the app, then the ebook."

            > It would take a few passes but Claude could do this;

            I'm sorry but absolutely nothing I've seen from using Claude indicates that you could give it a vague prompt like that and have it actually produce anything worth reading.

            Can it output a book's worth of bullshit with that prompt? Yes. But if you think "write a book about Python" is where we are in the state of the art in language models in terms of the prompt you need to get a coherent product, I want some of whatever you are smoking because that has got to be the good shit

          • By indexerror 2025-04-184:511 reply

            OpenAI just released Codex, which is basically the same as Claude Code.

            • By hiciu 2025-04-185:231 reply

              It looks the same, but for some reason Claude Code is much more capable. Codex got lost in my source code and hallucinated bunch of stuff, Claude on the same task just went to town, burned money and delivered.

              Of course, this is only my experience and codex is still very young. I really hope it becomes as capable as Claude.

              • By rockwotj 2025-04-187:281 reply

                Part of it is probably tgat claude is just better at coding than what openai has available. I am considering trying to hack in support for gemini into codex and play around with it.

          • By thrdbndndn 2025-04-184:58

            Copilot agent mode?

          • By breakitmakeit 2025-04-185:06

            [dead]

        • By WiSaGaN 2025-04-180:45

          Also the "project" feature in claude improves experience significantly for coder, where you can customize your workflow. Would be great if gemini has this feature.

        • By mdhb 2025-04-1818:58

          Firebase Studio is the Google equivalent

      • By onlyrealcuzzo 2025-04-1723:452 reply

        Yes, IME, Anthropic seemed to be ahead of Google by a decent amount with Sonnet 3.5 vs 1.5 Pro.

        However, Sonnet 3.7 seemed like a very small increase, whereas 2.5 Pro seemed like quite a leap.

        Now, IME, Google seems to be comfortably ahead.

        2.5 Pro is a little slow, though.

        I'm not sure which model Google uses for the AI answers on search, but I find myself using Search for a lot of things I might ask Gemini (via 2.5 Pro) if it was as fast as Search's AI answers.

        • By dmix 2025-04-180:041 reply

          How's is the speed of Gemini vs 3.7?

          • By benhurmarcel 2025-04-187:011 reply

            I use both, Gemini 2.5 Pro is significantly slower than Claude 3.7.

            • By rockwotj 2025-04-187:32

              Yeah I have read gemini pro 2.5 is a much bigger model.

      • By mamp 2025-04-1723:131 reply

        I've been using Gemini 2.5 and Claude 3.7 for Rust development and I have been very impressed with Claude, which wasn't the case for some architectural discussions where Gemini impressed with it's structure and scope. OpenAI 4.5 and o1 have been disappointing in both contexts.

        Gemini doesn't seem to be as keen to agree with me so I find it makes small improvements where Claude and OpenAI will go along with initial suggestions until specifically asked to make improvements.

        • By yousif_123123 2025-04-1812:151 reply

          I have noticed Gemini not accepting an instruction to "leave all other code the same but just modify this part" on a code that included use of an alpha API with a different interface than what Gemini knows is the correct current API. No matter how I promoted 2.5 pro, I couldn't get it to respect my use of the alpha API, it would just think I must be wrong.

          So I think patterns from the training data are still overriding some actual logic/intelligence in the model. Or the Google assistant fine-tuning is messing it up.

          • By Workaccount2 2025-04-1813:57

            I have been using gemini daily for coding for the last week, and I swear that they are pulling levers and A/B testing in the background. Which is a very google thing to do. They did the same thing with assistant, which I was a pretty heavy user of back in the day (I was driving a lot).

      • By jessep 2025-04-182:171 reply

        I have had a few epic refactoring failures with Gemini relative to Claude.

        For example: I asked both to change a bunch of code into functions to pass into a `pipe` type function, and Gemini truly seemed to have no idea what it was supposed to do, and Claude just did it.

        Maybe there was some user error or something, but after that I haven’t really used Gemini.

        I’m curious if people are using Gemini and loving it are using it mostly for one-shotting, or if they’re working with it more closely like a pair programmer? I could buy that it could maybe be good at one but bad at the other?

        • By Asraelite 2025-04-186:191 reply

          This has been my experience too. Gemini might be better for vibe coding or architecture or whatever, but Claude consistently feels better for serious coding. That is, when I know exactly how I want something implemented in a large existing codebase, and I go through the full cycle of implementation, refinement, bug fixing, and testing, guiding the AI along the way.

          It also seems to be better at incorporating knowledge from documentation and existing examples when provided.

          • By int_19h 2025-04-187:191 reply

            My experience has been exactly the opposite - Sonnet did fine on trivial tasks, but couldn't e.g. fix a bug end-to-end (from bug description in the tracker to implementing the fix and adding tests) properly because it couldn't understand how the relevant code worked, whereas Gemini would consistently figure out the root cause and write decent fix & tests.

            Perhaps this is down to specific tools and their prompts? In my case, this was Cursor used in agent mode.

            Or perhaps it's about the languages involved - my experiments were with TypeScript and C++.

            • By Asraelite 2025-04-187:262 reply

              > Gemini would consistently figure out the root cause and write decent fix & tests.

              I feel like you might be using it differently to me. I generally don't ask AI to find the cause of a bug, because it's quite bad at that. I use it to identify relevant parts of the code that could be involved in the bug, and then I come up with my own hypotheses for the cause. Then I use AI to help write tests to validate these hypotheses. I mostly use Rust.

              • By int_19h 2025-04-187:401 reply

                I used to use them mostly in "smart code completion" mode myself until very recently. But with all the AI IDEs adding agentic mode, I was curious to see how well that fares if I let it drive.

                And we aren't talking about trivial bugs here. For TypeScript, the most impressive bug it handled to date was an async race condition due to missing await causing a property to be overwritten with invalid value. For that one I actually had to do some manual debugging and tell it what I observed, but given that info, it was able to locate the problem in the code all by itself and fix it correctly and come up with a way to test it as well.

                For C++, the codebase in question was gdb, the bug was a test issue, and it correctly found problematic code based solely on the test log (but I had to prod it a bit in the right direction for the fix).

                I should note that this is Gemini Pro 2.5 specifically. When I tried Google's models previously (for all kinds of tasks), I was very unimpressed - it was noticeably worse than other SOTA models, so I was very skeptical going into this. Indeed, I started with Sonnet precisely because my past experience indicated that it was the best option, and I only tried Gemini after Sonnet fumbled.

                • By Asraelite 2025-04-188:101 reply

                  I use it for basically everything I can, not just code completion, including end-to-end bug fixes when it makes sense. But most of the time even the current Gemini and Claude models fail with the hard things.

                  It might be because most bugs that you would encounter in other languages don't occur in the first place in Rust because of the stronger type system. The race condition one you mentioned wouldn't be possible for example. If something like that would occur, it's a compiler error and the AI fixes it while still in the initial implementation stage by looking at the linter errors. I also put a lot of effort into trying to use coding patterns that do as much validation as possible within the type system. So in the end all that's left are the more difficult bugs where a human is needed to assist (for now at least, I'm confident that the models are only going to get better).

                  • By int_19h 2025-04-1812:35

                    Race conditions can span across processes (think async process communication).

                    That said I do wonder if the problems you're seeing are simply because there isn't that much Rust in the training set for the models - because, well, there's relatively little of it overall when you compare it to something like C++ or JS.

              • By elcritch 2025-04-1911:54

                I've found that I need to point it to the right bit of logs or test output and narrow its attention by selectively adding to it's context. Claude 3.7 at least works well this way. If you don't it'll fumble around. Gemini hasn't worked as well for me though.

                I partly wonder if different peoples prompt styles will lead to better results with different models.

      • By yieldcrv 2025-04-1815:21

        I also cancelled my Anthropic yesterday, not because of Gemini but because it was the absolute worst time for Anthropic to limit their Pro plan to upsell their Max plan when there is so much competition out there

        Manus.im also does code generation in a nice UI, but I’ll probably be using Gemini and Deepseek

        No Moat strikes again

      • By Graphon1 2025-04-181:391 reply

        Just curious, what tool do you use to interface with these LLMs? Cursor? or Aider? or...

        • By speedgoose 2025-04-185:411 reply

          I’m on GitHub Copilot with VsCode Insiders, mostly because I don’t have to subscribe to one more thing.

          They pretty quick to let you use the latest models nowadays.

          • By nicr_22 2025-04-1815:32

            I really like the open source Cline extension. It supports most of the model APIs, just need to copy/paste an API key.

      • By sleiben 2025-04-1813:19

        Same here. Especially for native app development with swift I had way better results and just sticked with Gemini-2.5-*

      • By wcarss 2025-04-1722:247 reply

        Google has killed so many amazing businesses -- entire industries, even, by giving people something expensive for free until the competition dies, and then they enshittify hard.

        It's cool to have access to it, but please be careful not to mistake corporate loss leaders for authentic products.

        • By gexla 2025-04-183:55

          It's not free. And it's legit one of the best models. And it was a Google employee who was among the authors of the paper that's most recognized as kicking all this off. They give somewhat limited access in AIStudio (I have only hit the limits via API access, so I don't know what the chat UI limits are.) Don't they all do this? Maybe harder limits and no free API access. But I think most people don't even know about AIStudio.

        • By JPKab 2025-04-1722:513 reply

          True. They are ONLY good when they have competition. The sense of complacency that creeps in is so obvious as a customer.

          To this day, the Google Home (or is it called Nest now?) speaker is the only physical product i've ever owned where it lost features over time. I used to be able to play the audio of a Youtube video (like a podcast) through it, but then Google decided that it was very very important that I only be able to play a Youtube video through a device with a screen, because it is imperative that I see a still image when I play a longform history podcast.

          Obviously, this is a silly and highly specific example, but it is emblematic of how they neglect or enshittify massive swathes of their products as soon as the executive team loses interest and puts their A team on some shiny new object.

          • By bitpush 2025-04-1723:311 reply

            The experience on Sonos is terrible. There are countless examples of people sinking 1000s of dollars into Sonos ecosystem, and the new app update has rendered them useless.

            • By nl 2025-04-1810:27

              It's mostly fixed now (5 room Sonos setup here). It's also a lot better at not dropping speakers off its network

          • By average_r_user 2025-04-189:33

            I'm experiencing the same problem with my Google Home ecosystem. One day I can turn off the living room lights with the simple phrase "Turn off Living Room Lights," and then randomly for two straight days it doesn't understand my command

          • By freedomben 2025-04-1814:28

            Preach it my friend. For years on the Google Home Hub (or Nest Hub or whatever) I could tell it to "favorite my photo" of what is on the screen. This allowed me to incrementally build a great list of my favorite photos on Google Photos and added a ton of value to my life. At some point that broke, and now it just says, "Sorry, I can't do that yet". Infuriating

        • By pdntspa 2025-04-181:57

          The usage limit for experimental gets used up pretty fast in a vibe-coding situation. I found myself setting up an API account with billing enabled just to keep going.

        • By bredren 2025-04-1722:58

          (Public) corporate loss leaders? Cause they are all likely corporate.

          Also, Anthropic is also subsidizing queries, no? The new “5x” plan illustrative of this?

          No doubt anthropic’s chat ux is the best right now, but it isn’t so far ahead on that or holding some UX moat that I can tell.

        • By lxgr 2025-04-1821:44

          How would I know if it’s useful to me without being able to trial it?

          Googles previous approach (Pro models available only to Gemini Advanced subscribers, and Advanced trials can’t be stacked with Google One paid storage, or rather they convert the already paid storage portion to a paid, much shorter Advanced subscription!) was mind-bogglingly stupid.

          Having a free tier on all models is the reasonable option here.

        • By mark_l_watson 2025-04-1722:581 reply

          In this case, Google is a large investor in Anthropic.

          I agree that giving away access to expensive models long term is not a good idea on several fronts. Personally, I subscribe to Gemini Advanced and I pay for using the Gemini APIs.

          EDIT: a very good deal, at $10/month is https://apps.abacus.ai/chatllm/ that gives you access to almost all commercial models as well as the best open weight models. I have never come close at all to using my monthly credits with them. If you like to experiment with many models the service is a lot of fun.

          • By F7F7F7 2025-04-180:013 reply

            The problem with tools like this is that somewhere in the chain between you and the LLM are token reducing “features”. Whether it’s the system prompt, a cheaper LLM middleman, or some other cost saving measure.

            You’ll never know what that something is. For me, I can’t help but think that I’m getting an inferior service.

            • By revnode 2025-04-181:381 reply

              You can self host something like https://big-agi.com/ and grab your own keys from various providers. You end up with the above, but without the pitfalls you mentioned.

              • By mark_l_watson 2025-04-1813:58

                BIG-AI does look cool, and supports a different use case. ABACUS.AI takes your $10/month and gives you credits that go towards their costs of using OpenAI, Anthropic, Gemini, etc. Use of smaller open models use very few credits.

                The also support an application development framework that looks interesting but I have never used it.

            • By mark_l_watson 2025-04-1813:54

              You might be correct about cost savings techniques in their processing pipeline. But they also add functionality: they bake web search into all models which is convenient. I have no affiliation with ABACUS.AI, I am just a happy customer. They currently let me play with 25 models.

            • By freedomben 2025-04-1814:29

              If anyone from Kagi is on, I'd love to know, does Kagi do that?

        • By bossyTeacher 2025-04-1819:28

          Just look at Chrome to see the bard/gemini's future. HN folks didn't care about Chrome then but cry about Google's increasingly hostile development of Chrome.

          Look at Android.

          HN behaviour is more like a kid who sees the candy, wants the candy and eats as much as it can without worrying about the damaging effect that sugar will have on their health. Then, the diabetes diagnosis arrives and they complain

    • By fsndz 2025-04-1722:223 reply

      More and more people are coming to the realisation that Google is actually winning at the model level right now.

      • By zaphirplane 2025-04-184:484 reply

        What’s with the Google cheer squad in this thread, usually it’s Google lost its way and is evil.

        Can’t be employees cause usually there is a disclaimer

        • By brailsafe 2025-04-197:51

          It's good to be aware of the likelihood of astroturfing. Everytime there's a newthread like this for one of the companies, there's a suspicious amount of similar plausible praise in an otherwise (sometimes brutally) skeptical forum.

          The best way to shift or create consensus is to make everyone think everyone else's opinion has already shifted, and that consensus is already there. Emperor's new clothes etc..

        • By pjerem 2025-04-185:18

          Google can be evil and release impressive language models. The same way as Apple releasing incredible hardware with good privacy while also being a totally insufferable and arrogant company.

        • By bitpush 2025-04-1823:04

          Gemini 2.5 is genuinely impressive.

        • By crowbahr 2025-04-1811:33

          Google employees only have to disclaimer when they're identified as Google employees.

          So shit like "as a googler" requires "my opinions are my own yadda yadda"

      • By MagicMoonlight 2025-04-1819:401 reply

        I haven’t met a single person that uses Gemini. Companies are using Copilot and individuals are using ChatGPT.

        Also, why would I want Google to spy on my AI usage? They’re evil.

        • By fsndz 2025-04-1820:311 reply

          why is Google more evil than say OpenAI ?

          • By nickserv 2025-04-197:54

            They're not more evil, it's probably a tie. What they do have is considerably more reach and power. As a result they're much more dangerous.

      • By orangesun 2025-04-2411:32

        Until they ditch it or make it ad-ridden like everything else.

    • By teleforce 2025-04-1723:284 reply

      >obsequious

      Thanks for the new word, I have to look it up.

      "obedient or attentive to an excessive or servile degree"

      Apparently it means an AI that mindlessly follow your logic and instructions without reasoning and articulation is not good enough.

      • By tkgally 2025-04-1723:473 reply

        Another useful word in this context is “sycophancy,” meaning excessive flattery or insincere agreement. Amanda Askell of Anthropic has used it to describe a trait they try to suppress in Claude:

        https://youtube.com/watch?v=ugvHCXCOmm4&t=10286

        • By davidsainez 2025-04-180:431 reply

          The second example she uses is really important. You (used to) see this a lot in stackoverflow where an inexperienced programmer asks how to do some convoluted thing. Sure, you can explain how to do the thing while maintaining their artificial constraints. But much more useful is to say "you probably want to approach the problem like this instead". It is surely a difficult problem and context dependent.

        • By snthpy 2025-04-186:571 reply

          Interesting that Americans appear to hold their AI models to a higher standard than their politicians.

          • By brookst 2025-04-1812:001 reply

            Different Americans.

            • By syndeo 2025-04-1817:341 reply

              Lots of folks in tech have different opinions than you may expect. Many will either keep quiet or play along to keep the peace/team cohesion, but you really never know if they actually agree deep down.

              Their career, livelihoods, ability to support their families, etc. are ultimately on the line, so they'll pay lip service if they have to. Consider it part of the job at that point; personal beliefs are often left at the door.

              • By brookst 2025-04-1913:38

                Not just tech. I spent some time on a cattle ranch (long story) and got to know some people pretty well. Quite a few confided interests and opinions they would never share at work, where the culture also has strong expectations of conformity.

      • By zoogeny 2025-04-1723:442 reply

        It's a bit of a fancy way to say "yes man". Like in corporations or politics, if a leader surrounds themselves with "yes men".

        A synonym would be sycophantic which would be "behaving or done in an obsequious way in order to gain advantage." The connotation is the other party misrepresents their own opinion in order to gain favor or avoid disapproval from someone of a higher status. Like when a subordinate tries to guess what their superior wants to hear instead of providing an unbiased response.

        I think that accurately describes my experience with some LLMs due to heavy handed RLHF towards agreeableness.

        In fact, I think obsequious is a better word since it doesn't have the cynical connotation of sycophant. LLMs don't have a motive and obsequious describes the behavior without specifying the intent.

        • By klondike_klive 2025-04-2021:05

          The only reason I know the meaning of both words is because they occur in the lyrics of the Motorhead song "Orgasmatron"

        • By teleforce 2025-04-180:301 reply

          Yes, that's the first two words that come to my mind when I read the meaning. The Gen Z word now I think is "simp".

          • By zoogeny 2025-04-181:161 reply

            Yeah, it is very close. But I feel simp has a bit of a sexual feel to it. Like a guy who does favors for a girl expecting affection in return, or donates a lot of money to an OnlyFans or Twitch streamer. I also see simp used where we used to call it white-knighting (e.g. "to simp for").

            Obsequious is a bit more general. You could imagine applying it to a waiter or valet who is annoyingly helpful. I don't think it would feel right to use the word simp in that case.

            In my day we would call it sucking up. A bit before my time (would sound old timey to me) people called it boot licking. In the novel "Catcher in the Rye", the protagonist uses the word "phony" in a similar way. This kind of behavior is universally disliked so there is a lot slang for it.

            • By snthpy 2025-04-186:55

              Thanks, as an old timer TIL about simp.

      • By sans_souse 2025-04-185:571 reply

        I wonder if anyone here will know this one; I learned the word "obsequious" over a decade ago while working the line of a restaurant. I used to listen to the 2p2 (2 plus 2) poker podcasts during prep and they had a regular feature with David Sklansky (iirc) giving tips, stories, advice etc. This particular one he simply gave the word "obsequious" and defined it later. I remember my sous chef and I were debating what it could mean and I guessed it right. I still can't remember what it had to do with poker, but that's besides the point.

        Maybe I can locate it

        • By sicromoft 2025-04-1816:07

          I didn't hear that one but I am a fan of Sklansky. And I also have a very vivid memory of learning the word, when I first heard the song Turn Around by They Might Be Giants. The connection with the song burned it into my memory.

      • By nemomarx 2025-04-1723:341 reply

        I think here it's referring to a common problem where the AI agrees with your position too easily, and/or changes it's answer if you tell it the answer is wrong instantly (therefore providing no stable true answer if you asked it something about a fact)?

        Also the slightly over cheery tone maybe.

        • By lylah69 2025-04-1811:22

          I like to do this with Claude. It takes 5 back & forths to get an uncertain answer.

          Is there a way to tackle this?

    • By m3kw9 2025-04-1723:04

      Using Claude code and Codex CLI and then Aider with Gemini 2.5 pro, Aider is much faster because you feed in the files instead of using tools to start doing all kinds of whole know what spending 10x the tokens. I tried a relatively simple refactor which needed around 7 files changed, only Aider with 2.5 got it and in the first shot. Where as both Codex and Claude code completely fumbled it

    • By dr_kiszonka 2025-04-1721:253 reply

      I was a big fan of that model but it has been replaced in AI Studio by its preview version, which, by comparison, is pretty bad. I hope Google makes the release version much closer to the experimental one.

      • By zoogeny 2025-04-1721:31

        I can confirm the model name in Run Settings has been updated to "Gemini 2.5 Pro Preview ..." when it used to be "Gemini 2.5 Pro (Experimental) ...".

        I cannot confirm if the quality is downgraded since I haven't had enough time with it. But if what you are saying is correct, I would be very sad. My big fear is the full-fat Gemini 2.5 Pro will be prohibitively expensive, but a dumbed down model (for the sake of cost) would also be saddening.

      • By gundmc 2025-04-184:13

        The AI Studio product lead said on Twitter that it is exactly the same model just renamed for clarity when pricing was announced

      • By dieortin 2025-04-180:42

        The preview version is exactly the same as the experimental one afaik

    • By PerusingAround 2025-04-1721:22

      This comment is exactly my experience, I feel like as if I had wrote it myself.

    • By jofzar 2025-04-180:511 reply

      My work doesn't have access to 2.5 pro and all these posts are just making me want it so much more.

      I hate how slow things are sometimes.

      • By basch 2025-04-182:111 reply

        Can’t you just go into aistudio with any free gmail account?

        • By sciurus 2025-04-182:172 reply

          For many workplaces, it's not just that that don't pay for a service, it's that using it is against policy. If I tried to paste some code into ChatGPT, for example, our data loss prevention spyware would block it and I'd soon be having an uncomfortable conversation with our security team.

          (We do have access to GitHub Copilot)

          • By Atotalnoob 2025-04-182:211 reply

            Good news then, your GitHub admins can enable Gemini for you without issue.

            • By d1sxeyes 2025-04-185:24

              “Without issue” is an optimistic perspective on how this works in many organisations.

          • By Anna234 2025-04-182:20

            [dead]

    • By rgoulter 2025-04-184:39

      The 1 million token context window also means you can just copy/paste so much source code or log output.

    • By _blk 2025-04-1814:182 reply

      Have you tried Grok 3? It's a bit verbose for my taste even when prompted to be brief but answers seem better/more researched and less opinionated. It's also more willing to answer questions where the other models block an answer.

      • By zoogeny 2025-04-1818:141 reply

        I have not tried any of the Grok models but that is probably because I am rarely on X.

        I have to admit I have a bias where I think Google is "business" while Grok is for lols. But I should probably take the time to asses it since I would prefer to have an opinion based on experience rather than vibes.

        • By _blk 2025-04-234:33

          Hehe, Google used to be cool when I studied but now it's just advertising and zero customer contact. But sure, business (or maybe big corp) fits now where it used to be more of a startup enterprise back in the late 2000. We worked with them on a project in 2018 and given the hard hiring processes they at least used to have, I was surprised to find that even inexperienced interns did better than what we got back. I'm sure it was a one-off case but the after-taste is still bitter. On the Grok side, I think of Tesla self driving. They've been doing AI for a long time. I used to own one and they never got to what they promised (self driving by the end of 2019) but it was really good for my taste. I still think of it as the best car I ever owned even though I went back to a gas guzzler when moving to the US. But hey, only 13mpg and it still uses less electricity than a Tesla;)

      • By fuzzylightbulb 2025-04-1816:341 reply

        A lot of people don't want to patronize the businesses of an unabashed Nazi sympathizer. There are more important things in life than model output quality.

        • By _blk 2025-04-231:22

          Awesome. Very mature. Moral supremacist

    • By UltraSane 2025-04-180:151 reply

      I had a very interesting long debate/discussion with Gemini 2.5 Pro about the Synapse-Evolve bank debacle among other things. It really feels like debating a very knowledgeable and smart human.

      • By rat9988 2025-04-1812:432 reply

        You didn't have a debate, you just researched a question.

        • By UltraSane 2025-04-1820:00

          All right Mr. Pednatic. Very complex linear algebra created a very convincing illusion of a debate. You happy now?

          But good LLMs will take a position and push back at your arguments.

        • By zoogeny 2025-04-1818:171 reply

          One mans debate is another mans research.

          • By rat9988 2025-04-1818:52

            Indeed, but a research isn't necessarily a debate. In this case, it was not.

    • By MetaWhirledPeas 2025-04-1816:122 reply

      > 100% of my casual AI usage is now in Gemini and I look forward to asking it questions on deep topics because it consistently provides me with insight.

      It's probably great for lots of things but it doesn't seem very good for recent news. I asked it about recent accusations around xAI and methane gas turbines and it had no clue what I was talking about. I asked the same question to Grok and it gave me all sorts of details.

      • By arizen 2025-04-1819:23

        This was my experience as well.

        Gemini performing the best on coding tasks, while giving underwhelming responses on recent news.

        While Grok was OK for coding tasks, but being linked to X, provided best response on recent events.

      • By ramesh31 2025-04-1816:152 reply

        >It's probably great for lots of things but it doesn't seem very good for recent news.

        You are missing the point here. The LLM is just the “reasoning engine” for agents now. Its corpus of facts are meaningless, and shouldn’t really be relied upon for anything. But in conjunction with a tool calling agentic process, with access to the web, what you described is now trivially doable. Single shot LLM usage is not really anything anyone should be doing anymore.

        • By MetaWhirledPeas 2025-04-1818:141 reply

          > You are missing the point here.

          I'm just discussing the GP's topic of casual use. Casual use implies heading over to an already-hosted prompt and typing in questions. Implementing my own 'agentic process' does not sound very casual to me.

          • By ramesh31 2025-04-1818:25

            > Implementing my own 'agentic process' does not sound very casual to me.

            It really is though. This can be as simple as using Claude desktop with a web search tool.

        • By darksaints 2025-04-1816:331 reply

          That’s all fine and dandy, but if you google anything related to llm agents, you get 1000 answers to 100 questions, companies hawking their new “visual programming” agent composers, and a ton of videos of douchebags trying to be the Steve Jobs of AI. The concept I’m sure is fine, but execution of agentic anything is still the Wild Wild West and nobody knows what they’re really doing.

    • By goshx 2025-04-1723:14

      Same here! It is borderline stubborn at times and I need to prove it wrong. Still, it is the best model to use with Cursor, in my experience.

    • By i_love_retros 2025-04-182:561 reply

      Why is it free / so cheap (I seem to be getting charged a few cents a day using it with aider so not free but still crazy cheap compared to sonnet)

      • By brendanfinan 2025-04-183:221 reply

        we know how Google makes money

        • By d1sxeyes 2025-04-185:261 reply

          Give it a few months and it will ignore all your questions and just ask if you’ve watched Rampart.

          • By disgruntledphd2 2025-04-187:16

            To be fair, Google do have a cost advantage here as they've built their own hardware.

    • By casey2 2025-04-1810:33

      It's a big deal, but not in the way that you think. A race to the bottom is humanities best defense against fast takeoff.

    • By redox99 2025-04-184:08

      I've had many disappointing results with gemini 2.5 pro. For general queries possibly involving search, chatgpt and grok work better for me.

      For code, gemini is very buggy in cursor, so I use Claude 3.7. But it might be partly cursor's fault.

    • By crossroadsguy 2025-04-184:45

      One difference, and imho that’s a big difference — you can’t use any of the Google’s chatbots/models without being logged in, unlike chatgpt.

    • By instagraham 2025-04-1811:59

      obsequious is such a nice word for this context, only possible in the AI age.

      i'd find the same word improper to describe human beings - other words like plaintive, obedient and compliant often do the job better and are less obscure.

      here it feels like a word whose time has come.

    • By cjohnson318 2025-04-1721:231 reply

      Yeah, my wife pays for ChatGPT, but Gemini is fine enough for me.

      • By qwertox 2025-04-1722:033 reply

        Just be aware that if you don't add a key (and set up billing) youre granting Google the right to train on your data. To have persons read them and decide how to use them for training.

        • By Graphon1 2025-04-181:401 reply

          > To have persons read them and decide how to use them for training.

          Not that I have any actual insight. but doesn't it seem more likely that it will not be a human, but a model? Models training models.

          • By qwertox 2025-04-186:54

            > To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services.

        • By energy123 2025-04-180:481 reply

          I thought if you turn off App Activity then that's good enough to protect your data?

        • By HDThoreaun 2025-04-1814:17

          Unless you have the enterprise sub of openAI theyre training on your data too

    • By thebiggening 2025-04-281:33

      Tremendous spam. Just wonderful. Comments too.

  • By simonw 2025-04-180:163 reply

    An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.

    My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini

      uv tool install llm
      llm install llm-gemini
      llm keys set gemini
      # paste key here
      llm -m gemini-2.5-flash-preview-04-17 \
        -o code_excution 1 \
        'render a mandelbrot fractal in ascii art'
    
    I ran that just now and got this: https://gist.github.com/simonw/cb431005c0e0535343d6977a7c470...

    They don't charge anything extra for code execution, you just pay for input and output tokens. The above example used 10 input, 1,531 output which is $0.15/million for input and $3.50/million output for Gemini 2.5 Flash with thinking enabled, so 0.536 cents (just over half a cent) for this prompt.

    • By pantsforbirds 2025-04-1818:09

      See a example full in a few commands using uv think "wow I bet that Simon guy from twitter would love this" ... it's already him.

    • By blahgeek 2025-04-180:392 reply

      > An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.

      Could you elaborate? I thought function calling is a common feature among models from different providers

      • By simonw 2025-04-181:001 reply

        The Gemini API runs the Python code for you as part of your single API call, without you having to handle the tool call request yourself.

        • By tempaccount420 2025-04-183:181 reply

          This is so much cheaper than re-prompting each tool use.

          I wish this was extended to things like: you could give the model an API endpoint that it can call to execute JS code, and the only requirement is that your API has to respond within 5 seconds (maybe less actually).

          I wonder if this is what OpenAI is planning to do in the upcoming API update to support tools in o3.

          • By danpalmer 2025-04-187:571 reply

            I imagine there wouldn’t bd much of a cost to the provider on the API call there so much longer times may be possible. It’s not like this would hold up the LLM in any way, execution would get suspended while the call is made and the TPU/GPU will serve another request.

            • By suchar 2025-04-1820:05

              They need to keep KV cache to avoid prompt reprocessing, so they would need to move it to ram/nvme during longer api calls to use gpu for another request

      • By WiSaGaN 2025-04-180:481 reply

        This common feature requires the user of the API to implement the tool, in this case, the user is responsible to run the code the API outputs. The post you replied suggests that Gemini will run the code for the user behind the API call.

        • By tempoponet 2025-04-1814:59

          That was how I read it as well, as if it had a built-in lambda type service in the cloud.

          If we're just talking about some API support to call python scripts, that's pretty basic to wire up with any model that supports tool use.

    • By throaway920181 2025-04-1818:22

      I wish Gemini could do this with Go. It generates plenty of junk/non-parseable code and I have to feed it the error messages and hope it properly corrects it.

  • By arnaudsm 2025-04-1719:3218 reply

    Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.

    Google is silently winning the AI race.

    • By Nihilartikel 2025-04-1722:174 reply

      100% agree. I had Gemini flash 2 chew through thousands of points of nasty unstructured client data and it did a 'better than human intern' level conversion into clean structured output for about $30 of API usage. I am sold. 2.5 pro experimental is a different league though for coding. I'm leveraging it for massive refactoring now and it is almost magical.

      • By jdthedisciple 2025-04-1722:3710 reply

        > thousands of points of nasty unstructured client data

        What I always wonder in these kinds of cases is: What makes you confident the AI actually did a good job since presumably you haven't looked at the thousands of client data yourself?

        For all you know it made up 50% of the result.

        • By mediaman 2025-04-183:201 reply

          This was solved a hundred years ago.

          It's the same problem factories have: they produce a lot of parts, and it's very expensive to put a full operator or more on a machine to do 100% part inspection. And the machines aren't perfect, so we can't just trust that they work.

          So starting in the 1920s Walter Shewhart and Edward Deming came up with Statistical Process Control. We accept the quality of the product produced based on the variance we see of samples, and how they measure against upper and lower control limits.

          Based on that, we can estimate a "good parts rate" (which later got used in ideas like Six Sigma to describe the probability of bad parts being passed).

          The software industry was built on determinism, but now software engineers will need to learn the statistical methods created by engineers who have forever lived in the stochastic world of making physical products.

          • By thawawaycold 2025-04-186:432 reply

            I hope you're being sarcastic. SPC is necessary because mechanical parts have physical tolerances and manufacturing processes are affected by unavoidable statistical variations; it is beyond idiotic to be provided with a machine that can execute deterministic, repeatable processes and then throw that all into the gutter for mere convenience, justifying that simply because "the time is ripe for SWE to learn statistics"

            • By int_19h 2025-04-187:221 reply

              We don't know how to implement a "deterministic, repeatable process" that can look at a bug in a repo and implement a fix end-to-end.

              • By thawawaycold 2025-04-187:351 reply

                that is not what OP was talking about though.

                • By rorytbyrne 2025-04-189:231 reply

                  LLMs are literally stochastic, so the point is the same no matter what the example application is.

                  • By warkdarrior 2025-04-1818:26

                    Humans are literally stochastic, so the point is the same no matter what the example application is.

            • By perching_aix 2025-04-1818:09

              The deterministic, repeatable process of human (and now machine) judgement and semantic processing?

        • By tominous 2025-04-1723:121 reply

          In my case I had hundreds of invoices in a not-very-consistent PDF format which I had contemporaneously tracked in spreadsheets. After data extraction (pdftotext + OpenAI API), I cross-checked against the spreadsheets, and for any discrepancies I reviewed the original PDFs and old bank statements.

          The main issue I had was it was surprisingly hard to get the model to consistently strip commas from dollar values, which broke the csv output I asked for. I gave up on prompt engineering it to perfection, and just looped around it with a regex check.

          Otherwise, accuracy was extremely good and it surfaced a few errors in my spreadsheets over the years.

          • By jofzar 2025-04-181:04

            I hope there is a future where csv comma's don't screw up data. I know it will never happen but it's a nightmare.

            Everyone has a story of a csv formatting nightmare

        • By Nihilartikel 2025-04-180:33

          For what it's worth, I did check over many hundreds of them. Formatted things for side by side comparison and ordered by some heuristics of data nastiness.

          It wasn't a one shot deal at all. I found the ambiguous modalities in the data and hand corrected examples to include in the prompt. After about 10 corrections and some exposition about the cases it seemed to misundestand, it got really good. Edit: not too different from a feedback loop with an intern ;)

        • By summerlight 2025-04-1723:161 reply

          Though the same logic can be applied to everywhere, right? Even if it's done by human interns, you need to audit everything to be 100% confident or just have some trust on them.

          • By andrei_says_ 2025-04-1818:341 reply

            Not the same logic because interns can make meaning out of the data - that’s built-in error correction.

            They also remember what they did - if you spot one misunderstanding, there’s a chance they’ll be able to check all similar scenarios.

            Comparing the mechanics of an LLM to human intelligence shows deep misunderstanding of one, the other, or both - if done in good faith of course.

            • By summerlight 2025-04-1822:141 reply

              Not sure why you're trying to conflate intellectual capability problems into this and complicate the argument? The problem layout is the same. You delegate the works to someone so you cannot understand all the details. This makes a fundamental tension between trust and confidence. Their parameters might be different due to intellectual capability, but whoever you're going to delegate, you cannot evade this trade-off.

              BTW, not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results? I've done that multiple times and don't trust anyone too much. This is why we typically develop review processes, guardrails etc etc.

              • By andrei_says_ 2025-04-222:47

                > not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results?

                Oh yes I have ;)

                Which is why I always explain the why behind the task.

        • By FooBarWidget 2025-04-185:05

          You can use AI to verify its own work. Last time I split a C++ header file into header + implementation file. I noticed some code got rewritten in a wrong manner, so I asked it to compare the new implementation file against the original header file, but to do so one method at a time. For each method, say whether the code is exactly the same and has the same behavior, ignoring superficial syntax changes and renames. Took me a few times to get the prompt right, though.

        • By golergka 2025-04-1723:08

          Many types of data have very easily checkable aggregates. Think accounting books.

        • By jofzar 2025-04-181:02

          It also depends on what you are using the data for, if it's for non (precise) data based decisions then it's fine. Specially if you looking for "vibe" based decisions before then dedicating time to "actually" process the data for confirmation.

          30$ to get an view into data that would take at least x many hours of someone's time is actually super cheap, specially if the decision of that result is then to invest or not invest the x many hours to confirm it.

        • By pamplemoose 2025-04-1723:11

          You take a sample and check

        • By visarga 2025-04-184:50

          In my professional opinion they can extract data at 85-95% accuracy.

      • By tcgv 2025-04-1816:20

        > I'm leveraging it for massive refactoring now and it is almost magical.

        Can you share more about your strategy for "massive refactoring" with Gemini?

        Like the steps in general for processing your codebase, and even your main goals for the refactoring.

      • By roygbiv2 2025-04-184:15

        Isn't it better to get gemini to create a tool to format the data? Or was it in such a state that that would have been impossible?

      • By cdelsolar 2025-04-182:452 reply

        what tool are you using 2.5-pro-exp through? Cline? Or the browser directly?

        • By Nihilartikel 2025-04-183:49

          For 2.5 pro exp I've been attaching files into AIStudio in the browser in some cases. In others, I have been using vscode's Gemini Code Assist which I believe recently started using 2.5 Pro. Though at one point I noticed that it was acting noticeably dumber, and over in the corner, sure enough it warned that it had reverted to 2.0 due to heavy traffic.

          For the bulk data processing I just used the python API and Jupyter notebooks to build things out, since it was a one-time effort.

        • By manmal 2025-04-186:10

          Copilot experimental (need VSCode Insiders) has it. I‘ve thought about trying aider —-watch-files though, also works with multiple files.

    • By statements 2025-04-1719:491 reply

      Absolutely agree. Granted, it is task dependent. But when it comes to classification and attribute extraction, I've been using 2.0 Flash with huge access across massive datasets. It would not be even viable cost wise with other models.

      • By sethkim 2025-04-1721:49

        How "huge" are these datasets? Did you build your own tooling to accomplish this?

    • By bhl 2025-04-1723:00

      It's cheap but also lazy. It sometimes generates empty strings or empty arrays for tool calls, and then I just re-route the request to a stronger model for the tool call.

      I've spent a lot of time on prompts and tool-calls to get Flash models to reason and execute well. When I give the same context to stronger models like 4o or Gemini 2.5 Pro, it's able to get to the same answers in less steps but at higher token cost.

      Which is to be expected: more guardrails for smaller, weaker models. But then it's a tradeoff; no easy way to pick which models to use.

      Instead of SQL optimization, it's now model optimization.

    • By spruce_tips 2025-04-1719:431 reply

      i have a high volume task i wrote an eval for and was pleasantly surprised at 2.0 flash's cost to value ratio especially compared to gpt4.1-mini/nano

      accuracy | input price | output price

      Gemini Flash 2.0 Lite: 67% | $0.075 | $0.30

      Gemini Flash 2.0: 93% | $0.10 | $0.40

      GPT-4.1-mini: 93% | $0.40 | $1.60

      GPT-4.1-nano: 43% | $0.10 | $0.40

      excited to to try out 2.5 flash

      • By jay_kyburz 2025-04-1720:426 reply

        Can I ask a serious question. What task are you writing where its ok to get 7% error rate. I can't get my head around how this can be used.

        • By 16bytes 2025-04-1721:53

          There are tons of AI/ML use-cases where 7% is acceptable.

          Historically speaking, if you had a 15% word error rate in speech recognition, it would generally be considered useful. 7% would be performing well, and <5% would be near the top of the market.

          Typically, your error rate just needs to be below the usefulness threshold and in many cases the cost of errors is pretty small.

        • By omneity 2025-04-1720:481 reply

          In my case, I have workloads like this where it’s possible to verify the correctness of the result after inference, so any success rate is better than 0 as it’s possible to identify the “good ones”.

          • By nonethewiser 2025-04-180:102 reply

            Aren’t you basically just saying you are able to measure the error rate? I mean that’s good, but already a given in this scenario where hes reporting the 7% error rate.

            • By jsnell 2025-04-181:02

              No. If you're able to verify correctness of individual items of work, you can accept the 93% of verified items as-is and send the remaining 7% to some more expensive slow path.

              That's very different from just knowing the aggregate error rate.

            • By yjftsjthsd-h 2025-04-1820:31

              No, it's anything that's harder to write than verify. A simple example is a logic puzzle; it's hard to come up with a solution, but once you have a possible answer it's really easy to check it. In fact, it can be easier to vet multiple answers and tell the machine to try again than solve it once manually.

        • By spruce_tips 2025-04-1720:46

          low stakes text classification but it's something that needs to be done and couldnt be done in reasonable time frames or at reasonable price points by humans

        • By muzani 2025-04-186:06

          I expect some manual correction after the work is done. I actually mentally counted all the times I pressed backspace while writing this paragraph, and it comes down to 45. I'm not counting the next paragraph or changing the number.

          Humans make a ton of errors as well. I didn't even notice how many I was making here until I started counting it. AI is super useful to just write get a first draft out, not for the final work.

        • By sroussey 2025-04-1820:15

          You could be OCRing a page that includes a summation line, then add up all the numbers and check against the sum.

        • By dist-epoch 2025-04-1721:001 reply

          [flagged]

          • By wavewrangler 2025-04-1722:15

            Yeah, general propaganda and psyops are actually more effective around 12% - 15%, we find it is more accurate to the user base, thus is questioned less for standing out more /s

    • By ramesh31 2025-04-1722:111 reply

      >”Google is silently winning the AI race.”

      It’s not surprising. What was surprising honestly was how they were caught off guard by OpenAI. It feels like in 2022 just about all the big players had a GPT-3 level system in the works internally, but SamA and co. knew they had a winning hand at the time, and just showed their cards first.

      • By wkat4242 2025-04-1722:502 reply

        True and their first mover advantage still works pretty well. Despite "ChatGPT" being a really uncool name in terms of marketing. People remember it because they were the first to wow them.

        • By kaoD 2025-04-1820:011 reply

          How is ChatGPT bad in terms of marketing? It's recognizable and rolls off the tongue in many many many languages.

          Gemini is what sucks from a marketing perspective. Generic-ass name.

          • By simonw 2025-04-1820:121 reply

            Generative Pre-trained Transformer is a horrible term to have an acronym for.

            • By kaoD 2025-04-1820:251 reply

              Do you think the mass market thinks GPT is an acronym? It's just a name. Currently synonymous with AI.

              Ask anyone outside the tech bubble about "Gemini" though. You'll get astrology.

              • By wkat4242 2025-04-1820:32

                True I guess they treat it just like SMS.

                I still think they'd have taken off more if they'd given it a catchy name from the start and made the interface a bit more consumer friendly.

        • By golergka 2025-04-1723:11

          It feels more authentically engineer-coded.

    • By ghurtado 2025-04-1720:40

      I know it's a single data point, but yesterday I showed it a diagram of my fairly complex micropython program, (including RP2 specific features, DMA and PIO) and it was able to describe in detail not just the structure of the program, but also exactly what it does and how it does it. This is before seeing a single like of code, just going by boxes and arrows.

      The other AIs I have shown the same diagram to, have all struggled to make sense of it.

    • By redbell 2025-04-1721:35

      > Google is silently winning the AI race

      Yep, I agree! This convinced me: https://news.ycombinator.com/item?id=43661235

    • By rvz 2025-04-1720:12

      Google always has been winning the AI race as soon as DeepMind was properly put to use to develop their AI models, instead of the ones that built Bard (Google AI team).

    • By Layvier 2025-04-1719:382 reply

      Absolutely. So many use cases for it, and it's so cheap/fast/reliable

      • By SparkyMcUnicorn 2025-04-1719:46

        And stellar OCR performance. Flash 2.0 is cheaper and more accurate than AWS Textract, Google Document AI, etc.

        Not only in benchmarks[0], but in my own production usage.

        [0] https://getomni.ai/ocr-benchmark

      • By danielbln 2025-04-1719:45

        I want to use these almost too cheap to meter models like Flash more, what are some interesting use cases for those?

    • By russellbeattie 2025-04-1722:12

      I have to say, I never doubted it would happen. They've been at the forefront of AI and ML for well over a decade. Their scientists were the authors of the "Attention is all you need" paper, among thousands of others. A Google Scholar search produces endless results. There just seemed to be a disconnect between the research and product areas of the company. I think they've got that worked out now.

      They're getting their ass kicked in court though, which might be making them much less aggressive than they would be otherwise, or at least quieter about it.

    • By no_wizard 2025-04-1722:352 reply

      I remember everyone saying its a two horse race between Google and OpenAI, then DeepSeek happened.

      Never count out the possibility of a dark horse competitor ripping the sod right out from under

    • By 42lux 2025-04-1719:441 reply

      The API is free, and it's great for everyday tasks. So yes there is no better bang for the buck.

      • By drusepth 2025-04-1719:475 reply

        Wait, the API is free? I thought you had to use their web interface for it to be free. How do you use the API for free?

        • By dcre 2025-04-1719:502 reply

          You can get an API key and they don't bill you. Free tier rate limits for some models (even decent ones like Gemini 2.0 Flash) are quite high.

          https://ai.google.dev/gemini-api/docs/pricing

          https://ai.google.dev/gemini-api/docs/rate-limits#free-tier

          • By NoahZuniga 2025-04-1721:011 reply

            The rate limits I've encountered with free api keys has been way lower than the limits advertised.

            • By jmacd 2025-04-1723:351 reply

              I agree. I found it unusable for anything but casual usage due to the rate limiting. I wonder if I am just missing something?

              • By tempthrow 2025-04-188:08

                I think it's the small TPM limits. I'll be way under the 10-30 requests per minute while using Cline, but it appears that the input tokens count towards the rate limit so I'll find myself limited to one message a minute if I let the conversation go on for too long, ironically due to Gemini's long context window. AFAIK Cline doesn't currently offer an option to limit the context explosion to lower than model capacity.

          • By nolok 2025-04-186:481 reply

            I'm pretty sure that's a google maps' level of free where once in control they will massively bill it

            • By dcre 2025-04-1812:44

              There is no reason to expect the other entrants in the market to drop out and give them monopoly power. The paid tier is also among the cheapest. People say it’s because they built their own their inference hardware and are genuinely able to serve it cheaper.

        • By spruce_tips 2025-04-1719:49

          create an api key and dont set up billing. pretty low rate limits and they use your data

        • By midasz 2025-04-1720:03

          I use Gemini 2.5 pro experimental via openrouter in my openwebui for free. Was using sonnet 3.7 but I don't notice much difference so just default to the free thing now.

        • By mlboss 2025-04-1719:49

          using aistudio.google.com

    • By GaggiX 2025-04-1720:19

      Flash models are really good even for an end user because how fast and good performance they have.

    • By xnx 2025-04-1719:52

      Shhhh. You're going to give away the secret weapon!

    • By paulcole 2025-04-1723:17

      > Google is silently winning the AI race.

      It’s not clear to me what either the “race” or “winning” is.

      I use ChatGPT for 99% of my personal and professional use. I’ve just gotten used to the interface and quirks. It’s a good consumer product that I like to pay $20/month for and use. My work doesn’t require much in the way of monthly tokens but I just pay for the OpenAI API and use that.

      Is that winning? Becoming the de facto “AI” tool for consumers?

      Or is the race to become what’s used by developers inside of apps and software?

      The race isn’t to have the best model (I don’t think) because it seems like the 3rd best model is very very good for many people’s uses.

    • By belter 2025-04-1719:371 reply

      > Google is silently winning the AI race.

      That is what we keep hearing here...The last Gemini I cancelled the account, and can't help notice the new one they are offering for free...

      • By arnaudsm 2025-04-1719:402 reply

        Sorry I was talking of B2B APIs for my YC startup. Gemini is still far behind for consumers indeed.

        • By JeremyNT 2025-04-1720:441 reply

          I use Gemini almost exclusively as a normal user. What am I missing out on that they are far behind on?

          It seems shockingly good and I've watched it get much better up to 2.5 Pro.

          • By arnaudsm 2025-04-1721:045 reply

            Mostly brand recognition and the earlier Geminis had more refusals.

            As a consumer, I also really miss the Advanced voice mode of ChatGPT, which is the most transformative tech in my daily life. It's the only frontier model with true audio-to-audio.

            • By jorvi 2025-04-180:45

              > and the earlier Geminis had more refusals.

              Its more so that almost every company is running a classifier on their web chat's output.

              It isn't actually the model refusing, but rather if the classifier hits a threshold, it'll swap the model's out with "Sorry, let's talk about something else."

              This is most apparent with DeepSeek. If you use their web chat with V3 and then jailbreak it, you'll get uncensored output but it is then swapped with "Let's talk about something else" halfway through the output. And if you ask the model, it has no idea its previous output got swapped and you can even ask it build on its previous answer. But if you use the API, you can push it pretty far with a simple jailbreak.

              These classifiers are virtually always ran on a separate track, meaning you cannot jailbreak them.

              If you use an API, you only have to deal with the inherent training data bias, neutering by tuning and neutering by pre-prompt. The last two are, depending on the model, fairly trivial to overcome.

              I still think the first big AI company that has the guts to say "our LLM is like a pen and brush, what you write or draw with it is on you" and publishes a completely unneutered model will be the one to take a huge slice of marketshare. If I had to bet on anyone doing that, it would be xAI with Grok. And by not neutering it, the model will perform better in SFW tasks too.

            • By Jensson 2025-04-1811:24

              > and the earlier Geminis had more refusals.

              You can turn off those, Google lets you decide how much it censors you can completely turn it off.

              It has separate sliders for sexually explicit, hate, dangerous and harassment. It is by far the best at this, since sometimes you want those refusals/filters.

            • By whistle650 2025-04-182:11

              Have you tried the Gemini Live audio-to-audio in the free Gemini iOS app? I find it feels far more natural than ChatGPT Advanced Voice Mode.

            • By wavewrangler 2025-04-1722:21

              What do you mean miss? You don’t have the budget to keep something you truly miss for $20? What am in missing here / I don’t mean to criticize I am just curious is all. I would reword but I have to go

            • By what_ever 2025-04-1723:54

              What is true audio-to-audio in this case?

        • By int_19h 2025-04-187:24

          They used to be, but not anymore, not since Gemini Pro 2.5. Their "deep research" offering is the best available on the market right now, IMO - better than both ChatGPT and Claude.

    • By Fairburn 2025-04-1719:42

      Sorry, but no. Gemini isn't the fastest horse, yet. And it's use within their ecosystem means it isn't geared to the masses outside of their bubble. They are not leading the race but they are a contender.

    • By gambiting 2025-04-1720:043 reply

      In my experience they are as dumb as a bag of bricks. The other day I asked "can you edit a picture if I upload one"

      And it replied "sure, here is a picture of a photo editing prompt:"

      https://g.co/gemini/share/5e298e7d7613

      It's like "baby's first AI". The only good thing about it is that it's free.

      • By ghurtado 2025-04-1720:451 reply

        > in my experience they are as dumb as a bag of bricks

        In my experience, anyone that describes LLMs using terms of actual human intelligence is bound to struggle using the tool.

        Sometimes I wonder if these people enjoy feeling "smarter" when the LLM fails to give them what they want.

        • By mdp2021 2025-04-1721:14

          If those people are a subset of those who demand actual intelligence, they will very often feel frustrated.

      • By JFingleton 2025-04-1720:243 reply

        Prompt engineering is a thing.

        Learning how to "speak llm" will give you great results. There's loads of online resources that will teach you. Think of it like learning a new API.

        • By gambiting 2025-04-185:44

          This was using Gemini on my phone - which both Samsung and Google advertise as "just talk to it".

        • By abletonlive 2025-04-180:34

          for now. one would hope that this is a transitory moment in llms and that we can just use intuition in the future.

        • By asadotzler 2025-04-180:365 reply

          LLM's whole thing is language. They make great translators and perform all kinds of other language tasks well, but somehow they can't interpret my English language prompts unless I go to school to learn how to speak LLM-flavored English?

          WTF?

          • By th0ma5 2025-04-1816:26

            You have the right perspective. All of these people hand waving away the core issue here don't realize their own biases. Some of the best these things tout as much as 97% accuracy on tasks but if a person was completely randomly wrong at 3% of what they say you'd call an ambulance and no doctor would be able to diagnose their condition (the kinds of errors that people make with brain injuries are a major diagnostic tool and the kinds of errors are known for major types of common injuries ... Conversely there is no way to tell within an LLM system if any specific token is actually correct or not and its incorrectness is not even categorizable.)

          • By pplante 2025-04-181:24

            I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.

          • By pplante 2025-04-181:24

            I like to think of my interactions with an LLM like I'm explaining a request to a junior engineer or non engineering person. You have to be more verbose to someone who has zero context in order for them to execute a task correctly. The LLM only has the context you provided so they fail hard like a junior engineer would at a complicated task with no experience.

          • By int_19h 2025-04-187:25

            It's a natural language processor, yes. It's not AGI. It has numerous limitations that have to be recognized and worked around to make use of it. Doesn't mean that it's not useful, though.

          • By JFingleton 2025-04-187:03

            They are not humans - so yeah I can totally see having to "go to school" to learn how to interact with them.

      • By nowittyusername 2025-04-1720:47

        Its because google hasn't realized the value of training the model on information about its own capabilities and metadata. My biggest pet peeve about google and the way they train these models.

HackerNews