Are LLM merge rates not getting better?

2026-03-1211:49163152entropicthoughts.com

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes…

no-swe-bench-improvement.jpg

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes.

As part of this they have included figures such as this one:

swebench-01.png

But there’s something about it that strikes me as odd. Let’s look only at the more valuable data, the merge rates.

swebench-02.png

What line best characterises this data? metr suggested something that slopes slightly upwards. But here’s what I see:

swebench-03.png

At some point toward the end of 2024 we may have had a step up in ability, but this plot shows no evidence of any actual improvement in merge rates since early 2025.

Fisher warns us against eyeballing plots, so let’s make it more formal. We’ll use leave-one-out cross-validation and compare the linear slope suggested by metr against the step function the plot hints at.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117

The Brier score is a form of squared error, thus lower is better. This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117
Constant function 0.0100

Stop and think about what this means: the two models that predict constant merge rates over the latter half of the plot are more accurate than the linear growth trend.1 And more accurate than a logistic trend, since linear in log-odds is nearly linear in probability for this range of values. This corroborates what we eyeballed in the plots: the merge rate has not increased in the latter half of this plot.

This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Post scriptum: I have heard claims that in the four months between the end of the metr plot and today there has been another step in capability (with newer Anthropic and Google models), but we have no clear evidence of that, because nobody has measured merge rates as carefully as metr has for the models past Sonnet 4.5.

There may have been a clear step up in capability. But on the other hand, people made the same claim throughout 2025 as well, and as we see now, it wasn’t true then. During 2025, the gap between buzz and actual performance was larger than we thought. Is the same true now? I don’t know. But that’s what I find interesting.


Read the original article

Comments

  • By aerhardt 2026-03-1213:046 reply

    I feel that two things are true at the same time:

    1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

    2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)

    Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.

    • By leoedin 2026-03-1310:06

      This matches my experience too. The models write code that would never pass a review normally. Mega functions, "copy and pasted" code with small changes, deep nested conditionals and loops. All the stuff we've spent a lot of time trying to minimise!

      You could argue it's OK because a model can always fix it later. But the problem comes when there's subtle logic bugs and its basically impossible to understand. Or fixing the bug in one place doesn't fix it in the 10 other places almost the same code exists.

      I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

    • By zx8080 2026-03-1310:081 reply

      > People say AI is “good at front end” but I see the worst kind of atrocities there

      It's commonly universal to say "AI is great in X", where one is not professional in X. It's because that's how AI is designed: to output tokens according to stats, not logic, not semantic, and not meaning: stats.

      • By contextfree 2026-03-1318:28

        Reading discussions online and comparing them to my own experience makes me feel crazy, because I've found today's LLMs and agents to be seemingly good at everything except writing code. Including everything else in software engineering around code (debugging, reviewing, reading code, brainstorming architecture, etc.) as well as discussing various questions in the humanities and sciences where I'm a dilettante. But whenever I've asked them to generate any substantial amount of code, beyond a few lines to demonstrate usage of some API I'm unfamiliar with, the results have always been terrible and I end up either throwing it out or rewriting almost all of it myself and spending more time than if I'd just written it myself from the start.

        It's occurred to me that maybe this just shows that I'm better at writing code and/or worse at everything else than I'd realized.

    • By jygg4 2026-03-1213:44

      The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.

    • By orwin 2026-03-1213:081 reply

      > People say AI is “good at front end”

      I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

      • By jygg4 2026-03-1213:551 reply

        As long as humans are needed to review code, it sounds your role evolves toward prompting and reviewing.

        Which is akin to driving a car - the motor vehicle itself doesn’t know where to go. It requires you to prompt via steering and braking etc, and then to review what is happening in response.

        That’s not necessarily a bad thing - reviewing code ultimately matters most. As long as what is produced is more often than not correct and legible.. now this is a different issue for which there isn’t a consensus across software engineer’s.

        • By cicko 2026-03-137:062 reply

          I don't think that reviewing code is so important as reviewing results. Nobody is reviewing the IL or assembly code when they write in higher level languages. It's the end result that matters in most cases.

          • By aix1 2026-03-138:132 reply

            But we don't evolve IL or assembly code as the system evolves. We regenerate it from scratch every time.

            It is therefore not important whether some intermediate version of that low-level code was completely impossible to understand.

            It is not so with LLM-written high-level code. More often than not, it does need to be understood and maintained by someone or something.

            These days, I mainly focus on two things in LLM code reviews:

            1. Making sure unit tests have good coverage of expected behaviours.

            2. Making sure the model is making sound architectural decisions, to avoid accumulating tech debt that'll need to be paid back later. It's very hard to check this with unit tests.

            • By nitwit005 2026-03-1318:00

              We get stuck reviewing the output assembly when it's broken, and that does happen from time to time. The reason that doesn't happen often is that generation of assembly follows strict rules, which people have tried their best to test. That's not the behavior we're going to get out of a LLM.

            • By contextfree 2026-03-1318:20

              Yes, prompts aren't analogous to higher-level code, they're analogous to wizards or something like that which were always rightly viewed with suspicion.

          • By rienbdj 2026-03-137:39

            But those are close to deterministic.

    • By naruhodo 2026-03-133:511 reply

      > 1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

      I have heard say that the change was better context management and compression.

      • By bbatha 2026-03-134:46

        A lot of enhancements came on the model side which in many ways enabled context engineering.

        200k and now 1M contexts. Better context management was enabled by improvements in structured outputs/tool calling at the model level. Also reasoning models really upped the game “plan” mode wouldn’t work well without them.

  • By wongarsu 2026-03-1212:572 reply

    I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

    And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

    1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

    • By yorwba 2026-03-1213:142 reply

      Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

      If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

      That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

      • By thesz 2026-03-130:042 reply

          > until all potential sources of error are close to being eliminated
        
        This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

          >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.
        
        This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

        This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

        What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

        What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.

        • By aspenmartin 2026-03-1313:21

          I don’t know it’s fair to characterize the LLM community as being ignorant and rediscovering PSP/TCP. I in fact see that as programmers rediscovering survival analysis, and most LLM folks I know have learned these perspectives from that lens. Could be wrong about PSP, maybe things are more nuanced? But what is there that isn’t already covered by foundational statistics?

        • By maest 2026-03-134:561 reply

          What is PSP/TSP?

          • By kqr 2026-03-135:08

            One of many ways people have branded the idea of process improvement for software engineering.

      • By Bombthecat 2026-03-1221:361 reply

        That's how the public perceive it though.

        It's useless and never gets better until it suddenly, unexpecty got good enough.

        • By ForHackernews 2026-03-1222:431 reply

          My robo-chauffer kept crashing into different things until one day he didn't.

          • By Mielin 2026-03-137:52

            Robot vacuum is allowed to crash into things and is still quite useful. You add bumpers, maybe some sort of proximity sensors to make the crash less damaging. It is safe by construction - cant harm humans because it is too small.

            Things have improved a bit? Now robot shelves becomes a possibility. Map everything, use more sensors, designate humans to a particular area only. Still quite useful. It is safe by design of areas, where humans rarely walk among robots.

            Improved further? Now we can do food delivery service robot. Slow down a bit, use much more sensors, think extra hard how to make it safer. Add a flag on a flagpole. Rounded body. Collisions are probably going to happen. Make the robot lighter than humans so that robot gets more damage than the human in a collision. Humans are vulnerable to falling over - make the robot hight just right to grab onto to regain balance, somewhere near waist hight.

            Something like that... Now I wish this would be an actual progress requirement for a robo taxy company to do before they start releasing robo taxies onto our streets. But at least we do it as mankind, algorithm improvements, safety solutuon still benefit the whole chain. And benefit to humanity grows despite it being not quite good enough for one particular task.

    • By roxolotl 2026-03-1213:05

      I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.

  • By curiouscube 2026-03-1212:441 reply

    There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes.

    There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

    In practice this still doesn't mean 50 % of white collar can't be automated though.

    • By lich_king 2026-03-1221:403 reply

      > In practice this still doesn't mean 50 % of white collar can't be automated though.

      Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

      Meta has nearly 80k employees to basically run two websites and three mobile apps. There were 18k people working at LinkedIn! Many big tech companies are massive job programs with some product on the side. Administrative business partners, program managers, tech writers, "stewards", "champions", "advocates", 10-layer-deep reporting chains... engineers writing cafe menu apps and pet programming languages... a team working on in-house typefaces... the list goes on.

      I can see AI producing shifts in the industry by reducing demand for meaningful work, but I doubt the outcome here is mass unemployment. There's an endless supply of bs jobs as long as the money is flowing.

      • By jmalicki 2026-03-1223:221 reply

        Meta has 80k employees to run the world's most massive engine of commerce through advertising and matching consumers to products.

        They build generative AI tools so people can make ads more easily.

        They have some of the most sophisticated tracking out there. They have shadow profiles on nearly everyone. Have you visited a website? You have a shadow profile even if you don't have a Facebook account. They know who your friends are based on who you are near. They know what stores you visit when.

        Large fractions of their staff are making imperceptible changes to ads tracking and feed ranking that are making billions of dollars of marginal revenue.

        What draws you in as a consumer is a tiny tip of the iceberg of what they actually do.

        • By kreyenborgi 2026-03-137:51

          So like parent said, mostly bs jobs that would improve the product if removed </s>

      • By ehnto 2026-03-130:01

        There are many reasons why we are seeing cuts economically, but the fact that it is possible to make such large cuts is because there were way too many people working at these companies. They had so much cheap money that they over-hired, now money isn't so cheap and they need to reduce headcount. AI need not enter the conversation to get to that point.

      • By suttontom 2026-03-1222:542 reply

        This is unfair and dismissive of many roles. Coordination in a massive, technically complex company that has to adhere to laws and regulations is a critical role. I don't get why people shit on certain roles (I'm a SWE). Our PgMs reduce friction and help us be more productive and focused. Technical writers produce customer-facing content and code, and have nothing to do with supporting internal bureaucracy. There are arguments against this in Bullshit Jobs but do you think companies pay PgMs or HR employees hundreds of thousands of dollars a year out of the goodness of their own hearts? Or maybe they actually help the business?

        • By lich_king 2026-03-1223:182 reply

          You realize that the reason you need to manage this organizational complexity is largely because the organization is so huge?...

          The reality is that you could run LinkedIn with far, far fewer people. You probably need fewer than 100 for core engineering, and likely less than 1,000 overall if you include compliance, sales, and so on - especially since a lot of overseas compliance stuff is outsourced to consulting firms, it's not like you have a team of lawyers in every country in the world.

          Before there was so much money in the system, we used to run companies that way. Two decades ago, I worked for a company that had tens of millions of users, maintained its own complex nationwide infra (no AWS back then), and had 400 full-time employees. That made coordination problems a lot easier too. We didn't need ten layers of people and project management because there just wasn't that many of us.

          • By jmalicki 2026-03-1223:241 reply

            When doubling the number of employees can triple your revenue, you do it.

            Keeping a website running with high uptime is not the goal. Maximizing revenue and profit is. The extra people aren't waste, they're what drive the incremental imperceptible changes that make these companies profitable.

            • By 0x3f 2026-03-138:171 reply

              This seems like a just-so story.

              • By jmalicki 2026-03-1313:271 reply

                You can see it happen in reverse with X/Twitter.

                Did reducing waste affect the user experience or uptime of Twitter? Not really.

                But advertising revenues plummeted, because those extra employees were mostly not about the user experience or keeping the website up, they were about servicing the advertisers that brought the company revenue.

                • By 0x3f 2026-03-1317:32

                  I thought advertising revenues plummeted mostly for content/optics/PR reasons, not ad-buyer-facing feature reasons.

          • By ahtihn 2026-03-137:241 reply

            And how much revenue did that company bring in compared to something like Meta?

            Maybe there's a correlation there?

            • By 0x3f 2026-03-138:181 reply

              I think the person you're replying to is perfectly aware of the correlation, considering it was a primary feature of their comment.

              • By ahtihn 2026-03-1316:391 reply

                Not really? The main point of their comment is that companies could be much smaller based on their experience at a much smaller company.

                I'm implying that big companies couldn't make as much money as they do without all the employees they have.

                • By 0x3f 2026-03-1322:03

                  Their last para seems to acknowledge the correlation, but flips your assumed causal direction. I.e. they seem to be implying that the that excess money causes the complexity.

        • By slopinthebag 2026-03-135:06

          It's also because as you increase organisational complexity, you need to manage it somehow, which generally means hiring more people to do that. And then you need to hire people to manage those new managers. Ad infinitum. The increased complexity begets more complexity.

          It sort of reminds me of The Collapse of Complex Societies by Joseph Tainter. These companies are their own microcosms of a complex society and I bet we will see mass layoffs in the future, not from AI but from those companies collapsing into a more sustainable state.

HackerNews