
I know there are companies that are highly productive with AI including ours. However, AI skeptics ask for real studies and all of them available now show no real gains.
Many won't care unless you show them an actual study.
So my question is, are there any actual studies about the companies that actually make it work with AI?
Many won't care unless you show them an actual study.
So my question is, are there any actual studies about the companies that actually make it work with AI?
Dora released a report last year: https://dora.dev/research/2025/dora-report/
The gains are ~17% increase in individual effectiveness, but a ~9% of extra instability.
In my experience using AI assisted coding for a bit longer than 2 years, the benefit is close to what Dora reported (maybe a bit higher around 25%). Nothing close to an average of 2x, 5x, 10x. There's a 10x in some very specific tasks, but also a negative factor in others as seemingly trivial, but high impact bugs get to production that would have normally be caught very early in development on in code reviews.
Obviously depends what one does. Using AI to build a UI to share cat pictures has a different risk appetite than building a payments backend.
The full report can be found here: https://services.google.com/fh/files/misc/2025_state_of_ai_a...
That 17% increase is in self-reported effectiveness. The software delivery throughput only went up 3%, at a cost of that 9% extra instability. So you can build 3% faster with 9% more bugs, if I'm reading those numbers right.
Those aren't even percentage increases, but standardized effect sizes. So if you take an individual survey respondent and all you know is that they self-reported higher AI usage, you can guess their answers to the self-reported individual effectiveness slightly more accurately, but most of the variation will be due to unrelated factors.
The question that people are actually interested in, "After adopting this specific AI tool, will there be a noticeable impact on measures we care about?" is not addressed by this model at all, since they do not compare individual respondents' answers over time, nor is there any attempt to establish causality.
And 3% difference is at "the new coffee in office is kinda shit and developers are annoyed" level of difference
I think for myself, it's close to 25% if I only take my role as a dev. If I take my 'senior' role it's less, because I spend way more time in reviews or in prod incident meetings.
Three months ago, with opus4.5, I would have said that the productivity improvement was ~10% for my whole team.
I now have to contradict myself: juniors and even experienced new hires with little domain knowledge don't improve as fast as they used to. I still have to write new tasks/issue like I would have for someone we just hired, after 8 months. I still catch the same issues we caught in reviews three months ago.
Basically, experience doesn't improve productivity as fast as it used to. On easy stuff it doesn't matter (like frontend changes, the productivity gains are extremely high, probably 10x), and on specific subjects like red teaming where a quantity of small tools is better than an integrated solution I think it can be better than that.
But I'm in a netsec tooling team, we do hard automation work to solve hard engineering issues, and that is starting to be a problem if juniors don't level up fast.
For me it is a 2x or 5x or something, "but high impact bugs get to production that would have normally be caught very early in development on in code reviews" is what takes it back down to a 1.5x.
There are genuinely weeks where I go 5x though, and others where I go 0.5x.
It's not so valuable to assess the current state - what the impact of using AI is today. From personal experience it feels like overall impact on productivity was not positive a couple of years ago, might be positive now and will be positive in a couple of years. That means by assessing the current state of impact on product where just finding where we are on that change curve. If we accept that trend is happening then we know at some point it will (or has) pass the threshold where our companies will fall behind if they're not using it. We also know it takes a while to get up to speed and make sure we're making the most of it so the earlier we start the better. That's the counter arguement that we could wait for a later wave to jump on but that's risky and the only potential reward is a small percentage short-term productivity gain.
Of course, if stability is part of what you're supposed to be delivering, then you can't be 17% more effective.
Self-reported productivity does not equate to actual productivity. People have all sorts of biases that make such assessments fairly pointless. They only gauge how you feel about your productivity, which is not necessarily a bad thing, but it doesn't mean you're actually more productive.
To extend on this, the measures of productivity before LLMs were difficult for any kind of complex work, so there's no reason to think we would have better measures now.
You need broad economic measurements, not individual or company specific. And that takes a long time plus there's a lot of noise in the data right now (war, for example).
We're incapable of putting an accurate, standardized value on developer productivity, yet there often seems to be consensus between senior engineers on who are the high performers and the low performers. I certainly can tell this about the people I work with.
We are definitely not. Point at a problem, and measure the cost of solving it. That's developer productivity.
We only avoid doing it at scale because it's expensive. In particular if we want the measurement to generalise out of sample.
(In particular in this case, where once we're done, proponents will claim our data is too old to be a useful guide to tomorrow.)
> Point at a problem, and measure the cost of solving it.
The problem with this is that AI will create worse code that is going to cause more problems in the future, but the measurements won’t take that into account.
Yes.
If we could even measure teams, against themselves, others and some kind of baseline, but we don't AFAIK.
Lines of code pushed ... obviously /s
Unironically, ai evaluating the impact of those lines might be getting close to a metric that would measure output better than having everyone print out their last 6 months of work for the new boss to look at.
Or it might be horribly bad at it, as near every other problem people claim "AI might be good at it"