I’m furious. I’m really angry. I’m angry in a knocking down sandcastles and punching Daniel LaRusso in the face and talking smack about him to his girl kind of way.
I’m not an angry person generally, but I can’t stand what’s happening to my industry.
I know software development. I’ve been doing it for 25 years, maybe even 28 years if you count market research tabulation on amber monochrome screens. Yes, I’m old. I’m a middle-aged programming nerd. My entire life and personal identity are wrapped up in this programming thing for better or worse. I thrive off the dopamine hits from shipping cool things.
I was an early adopter of AI coding and a fan until maybe two months ago, when I read the METR study and suddenly got serious doubts. In that study, the authors discovered that developers were unreliable narrators of their own productivity. They thought AI was making them 20% faster, but it was actually making them 19% slower. This shocked me because I had just told someone the week before that I thought AI was only making me about 25% faster, and I was bummed it wasn’t a higher number. I was only off by 5% from the developer’s own incorrect estimates.
This was unsettling. It was impossible not to question if I too were an unreliable narrator of my own experience. Was I hoodwinked by the screens of code flying by and had no way of quantifying whether all that reading and reviewing of code actually took more time in the first place than just doing the thing myself?
So, I started testing my own productivity using a modified methodology from that study. I’d take a task and I’d estimate how long it would take to code if I were doing it by hand, and then I’d flip a coin, heads I’d use AI, and tails I’d just do it myself. Then I’d record when I started and when I ended. That would give me the delta, and I could use the delta to build AI vs no AI charts, and I’d see some trends. I ran that for six weeks, recording all that data, and do you know what I discovered?
I discovered that the data isn’t statistically significant at any meaningful level. That I would need to record new datapoints for another four months just to prove if AI was speeding me up or slowing me down at all. It’s too neck-in-neck.
That lack of differentiation between the groups is really interesting though. Yes, it’s a limited sample and could be chance, but also so far AI appears to slow me down by a median of 21%, exactly in line with the METR study. I can say definitively that I’m not seeing any massive increase in speed (i.e., 2x) using AI coding tools. If I were, the results would be statistically significant and the study would be over.
That’s really disappointing.
I wish the AI coding dream were true. I wish I could make every dumb coding idea I ever had a reality. I wish I could make a fretboard learning app on Monday, a Korean trainer on Wednesday, and a video game on Saturday. I’d release them all. I’d drown the world in a flood of shovelware like the world had never seen. Well, I would — if it worked.
It turns out, though, and I’ve collected a lot of data on this, it doesn’t just not work for me, it doesn’t work for anyone, and I’m going to prove that.
But first, let’s examine how extreme and widespread these productivity claims are. Cursor’s tagline is “Built to make you extraordinarily productive.” Claude Code’s is “Build Better Software Faster.” GitHub Copilot’s is “Delegate like a boss.” Google says their LLMs make their developers 25% faster. OpenAI makes their own bombastic claims about their coding efficiencies and studies. And my fellow developers themselves are no better, with 14% claiming they’re seeing a 10x increase in output due to AI.
“Delegate like a boss” – Github Copilot
These claims wouldn't matter if the topic weren't so deadly serious. Tech leaders everywhere are buying into the FOMO, convinced their competitors are getting massive gains they're missing out on. This drives them to rebrand as AI-First companies, justify layoffs with newfound productivity narratives, and lowball developer salaries under the assumption that AI has fundamentally changed the value equation.
And yet, despite the most widespread adoption one could imagine, these tools don’t work.