Measuring AI Ability to Complete Long Tasks

2025-07-0511:404745spectrum.ieee.org

By 2030, AI will greatly outperform humans in some complex intellectual tasks. Discover how LLMs are doubling their capabilities every seven months.

Benchmarking large language models presents some unusual challenges. For one, the main purpose of many LLMs is to provide compelling text that’s indistinguishable from human writing. And success in that task may not correlate with metrics traditionally used to judge processor performance, such as instruction execution rate.

But there are solid reasons to persevere in attempting to gauge the performance of LLMs. Otherwise, it’s impossible to know quantitatively how much better LLMs are becoming over time—and to estimate when they might be capable of completing substantial and useful projects by themselves.

 Scatter plot showing negative correlation between success rate and task-messiness score. Large Language Models are more challenged by tasks that have a high “messiness” score.Model Evaluation & Threat Research

That was a key motivation behind work at Model Evaluation & Threat Research (METR). The organization, based in Berkeley, Calif., “researches, develops, and runs evaluations of frontier AI systems’ ability to complete complex tasks without human input.” In March, the group released a paper called Measuring AI Ability to Complete Long Tasks, which reached a startling conclusion: According to a metric it devised, the capabilities of key LLMs are doubling every seven months. This realization leads to a second conclusion, equally stunning: By 2030, the most advanced LLMs should be able to complete, with 50 percent reliability, a software-based task that takes humans a full month of 40-hour workweeks. And the LLMs would likely be able to do many of these tasks much more quickly than humans, taking only days, or even just hours.

An LLM Might Write a Decent Novel by 2030

Such tasks might include starting up a company, writing a novel, or greatly improving an existing LLM. The availability of LLMs with that kind of capability “would come with enormous stakes, both in terms of potential benefits and potential risks,” AI researcher Zach Stein-Perlman wrote in a blog post.

At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above].

If the idea of LLMs improving themselves strikes you as having a certain singularity-robocalypse quality to it, Kinniment wouldn’t disagree with you. But she does add a caveat: “You could get acceleration that is quite intense and does make things meaningfully more difficult to control without it necessarily resulting in this massively explosive growth,” she says. It’s quite possible, she adds, that various factors could slow things down in practice. “Even if it were the case that we had very, very clever AIs, this pace of progress could still end up bottlenecked on things like hardware and robotics.”


Read the original article

Comments

  • By fendy3002 2025-07-0512:452 reply

    Because I always believe that Pareto Principle applies in most aspect of computing: https://en.wikipedia.org/wiki/Pareto_principle, I believe it'll also apply on this case too, and I find that it tracks with the progress of LLM/AIs.

    Breaking over 80% accuracy and solving the rest of 20% problem will be the main challenge of next-gen (or next-2gen) LLM, not to mention they still have tasks to bring down the computing costs.

    EDIT: that said, solving 80% of problems with 80% of accuracy with significant time saving is a solution that's worth to account, though we need to keep sceptical because the rest 20% may be gotten much worse because the 80% solved is in bad quality.

    • By Yoric 2025-07-0513:35

      There is a big difference between LLMs and most other tech improvements, though: with most technologies that I can think of that solve 80% of the problem, it's easy to find out whether the technology works. When you're working with an LLM, though, it's really hard to know whether the answer is correct/usable or not.

  • By timr 2025-07-0512:531 reply

    For those people who won’t read anything more than the headline, this is a silly paper based on a metric that considers only “task completion time” at “a specified degree of reliability, such as 50 percent” for “human programmers”.

    Then, in a truly genius stroke of AI science, the current article extrapolates this to infinity and beyond, while hand-waving away the problem of “messiness”, which clearly calls the extrapolation into question:

    > At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above]

    • By dang 2025-07-0513:221 reply

      What would be a more accurate and neutral headline?

      • By Y_Y 2025-07-0513:412 reply

        The paper and blog posts referenced are both called "Measuring AI Ability to Complete Long Tasks”, this might do better.

        • By timr 2025-07-0514:421 reply

          Agreed. Or "AI models are getting faster", which seems defensible.

          • By recursivecaveat 2025-07-0516:03

            They're not saying that the models are getting faster. They're saying that the models are becoming capable at all of completing tasks that take humans longer and longer. The task completion time for humans is a proxy for complexity of the task, or some notion of how far the model can get without human intervention.

        • By dang 2025-07-0611:57

          Ok, belatedly changed. Thanks!

  • By untitled2 2025-07-0512:52

    Classic mistake is that if 1 worker will produce 10 products a day, 10 workers will produce 100. Fact is what one software developer will do in a week, ten will do in a year. Copypasta can be fast and very inaccuare today -- it will be faster and much more inaccurate later.

HackerNews