
At Octomind, we build AI agents - but our code is still mostly written by humans. We love LLMs and use them everywhere we can, from our product to our internal workflows. But despite all the hype,…
At Octomind, we build AI agents - but our code is still mostly written by humans. We love LLMs and use them everywhere we can, from our product to our internal workflows. But despite all the hype, we’re nowhere near “agents writing most of our code.”
We have good reasons for not jumping on the bandwagon of companies like Anthropic (generating 80%), Microsoft (30%) or Google (25%) just yet.

A few vital things are “not there” yet. Here’s why it matters - and what it takes to actually close the gap.
We’ve been using Cursor, Claude Code, and Windsurf for months, but none of us can honestly say they’ve boosted our productivity in a meaningful way (say, 20% or more). Sure, tab completions are often solid, and I’ve had some success getting them to generate unit tests - especially when there's existing tests to copy from (like new routes).
Still, that’s nowhere near the 80%+ efficiency gains others claim. So, driven by equal parts FOMO and curiosity, my colleague Fabio and I decided to spend the past week implementing a roadmap feature entirely with AI.
Before diving in, we combed through the documentation for our preferred tools to make sure we weren’t missing anything useful. We also updated our Cursor rules and CLAUDE.md file to inject updated knowledge about our product and our dev workflow, enabled BugBot for AI code reviews and went to work.
At Octomind, we build an agent-powered end-to-end testing platform. Our tests aren’t tied to branches - they live centrally in our system, which doesn’t support branch-specific versions of test cases. That works fine until you start using branch deployments.
Picture a SaaS app with three tests: login, create post, edit post. The app under test is developed with branch deployments for each pull request (PR). Now imagine a branch that changes the login flow - say, it adds 2FA. The existing login test (which only checks username + password) will now fail, blocking the pipeline for that PR.
At the moment, you’ve got two options:
Neither is great. One blocks others; the other breaks trust in your merge.
To combat this, we wanted to extend the concept of branches to our tests. When a branch is created, you can spawn a branch-specific copy of a test. That copy runs only for that branch and can be edited freely. When the branch merges, the copy becomes the new default.
We figured this feature should be doable in about a week with two developers.
As the first iteration, we let the agents roam. We did not expect this to work perfectly, but we wanted to see where it’s at.
We’ve got a decent-sized monorepo, so “just dump everything into context” isn’t an option. We take testing seriously and have guardrails the AI can use to check its own output.
So I wrote a detailed brief and attached the files it needed into the context. This wasn’t a ‘tiny prompt performing miracle’ thing - I iterated the prompt until it was as specific as possible. Within ~5 minutes, the agent produced a plan with reasonable 11 TODOs:

We hit `run` and that’s where it went sideways. The agent started cranking out lines, but stumbled on basics any dev would breeze through - like regenerating the Prisma client after a database schema change (yes, the Cursor rules spelled it out).
I went back and forth with it several times for clarification. It reported success with the message: “The feature should now work correctly! The fork button should be functional, and you'll be able to see forked test cases using the new filter. 🎉” while:
An incomplete list of misses that wouldn’t trip up a human dev here:
and I am not even talking about bugs here. These are just things that immediately stick out and would not have happened to a developer. Yes, we tried multiple rounds of “still doesn’t work - you forgot to actually use the new button” and similar nudges. The result: a 2,000-line PR that needs review and rework almost everywhere.
I decided to start over. We never expected this to just work - these agents are supposedly better at smaller features anyway. So I stepped back and thought through how I’d actually build it myself, step by step. This kind of ideation is where I usually do like using an LLM - not to code, but to bounce around approaches in planning mode.
Then I had it make a plan again, this time for just the first piece: loading the correct test from the database given an execution URL.
It produced another long, seemingly sensible plan. I let it run - expecting to step in more often this time. After running through several contexts worth of chat, I had another pull request.
About 1,200 lines of code, just for this one part. Does it work? Well… it typechecks. Beyond that, who knows.
In the spirit of the experiment, I handed the PR to BugBot for review. It left four comments - one of them pointing out that transaction handling was broken. That’s not great for data consistency, so I fed the comment back into the chat.
The AI’s response:
“Of course, massive oversight on my part, I apologize. Let me fix it: …”
Apology accepted, I guess. It did patch a few places, and when I asked if it saw any more issues, it replied:
Recommendation
- Fix Transaction Consistency: Ensure all database calls within getFullTestCasesWithPrerequisites use the same transaction client. (and … 7 more things)
All of this was in a single file. The only context it needed was the database schema - literally the thing it was just working on.
It also has a habit of making life easy for itself:

It’ll check for linter errors, but only by running head -30 and some regex filters, so it convinces itself everything’s fine.
Confidently marking half-done work as finished, apologizing for “massive oversights,” fixing things only to break others (look up the German word verschlimmbessern), and completely ignoring our existing designs and UX is not even the worst part.
Let’s say the agent can now ship medium-complexity features with little help. And let’s even assume we’ve fixed the “wait 3 minutes, review 1000 lines of output” problem by turning developers into orchestrators of agents instead of coders. That’s the dream many LinkedIn posts are selling.
Even then, a huge issue remains: I lose my mental model of the codebase.
Right now, I know how changing one part affects another, where bugs tend to hide, and how the data model behaves. That intuition disappears when the AI keeps dropping thousand-line PRs that might even get auto-merged. When a teammate does that, I can trust they made thoughtful trade-offs, and I’ll pick up context while reviewing or building on top of it. With AI, that learning loop is gone.
So when a tough bug or edge-case feature comes up - the kind the AI still can’t handle - it feels like I’m seeing the codebase for the first time. Maybe I skimmed some reviews of features I know the AI can do by itself (unless I ran auto-review tools like CodeRabbit), but that’s nowhere near the understanding that comes from actually interacting with the code yourself.
Until I can trust the AI completely, I need to keep my own mental model alive. Otherwise, every time I need to do something myself feels like joining a new company.
AI is currently bad at judging its own performance. Constant replies like these are only annoying if you let them get to you, but boy, is it hard not to:


I asked it to do so before implementing the feature, to which it responded with:
MODERATE CONFIDENCE - I can implement this correctly, but with some caveats:
Overall Assessment: I can implement this task correctly, but it would require careful attention to detail, thorough testing, and potentially some clarification on the missing methods and business logic. The complexity is manageable but not trivial.
That sounds like a human engineer’s self-assessment, because the model is parroting human training data. The problem is that it shouldn’t use the training that is made by humans (at least until recently) to judge its own ability, because it is not a human.
And that’s the core issue: the model has no concept of its limits. You only find out if it can do the task by letting it try. An intern can say, “I’ve never done this.” An LLM doing so is very unlikely.
Worse, on our smaller follow-up task it graded itself even higher:
COMPLEXITY: HIGH - This is a substantial refactoring
MY ABILITY: HIGH CONFIDENCE - I can definitely implement this correctly because:
The task is definitely implementable and I have high confidence I can complete it correctly - it's just a matter of following the detailed plan step by step and implementing all the missing pieces.
AI absolutely has a place in a developer’s toolbox. I use ChatGPT or Cursor’s Ask mode daily - to brainstorm, debug, or get unstuck on smaller problems. Tab completions? They’re right about 80% of the time, which is good enough to keep them on. I even let AI handle things like writing unit tests for clean interfaces or refactoring small snippets. Wrapping a loop in `Promise.allSettled` is boring for me - but trivial and instant for AI. It’s also great at recreating well-known patterns from scratch - like traversing a tree structure.
And for non-technical users, automation powered by AI can be a massive unlock. That’s literally what we work on at Octomind: automating a technical task, but within well-defined boundaries using specialized agents. They don’t write the whole codebase; they handle narrow, observable parts of it where output constraints keep them in check.
Other focused tools can deliver similar value. And sure, maybe one day AI will truly handle everything it’s being credited with today (whether that’s LLMs or something beyond them).
But we’re not there yet - and more people are starting to admit it.
Every time I read a post about this, none of the prompts are shared, and when I review the actual commands and how the AI is working it makes me realize that the person who is driving is not experienced in doing so. AI's will do a best attempt, you can see this by looking at the reasoning / thinking output, additionally, the temperature, is usually pretty moderate (4.5-8) and so you'll have heavy "creative liberties" taken. So you need to account for that, you have to show it the right and wrong way to do things. I don't usually use agents or AI for things that are one-offs but not copy & paste, or for deep thinking / critical tasks that require human thought where AI wouldn't be able to do it.
For all the other trivial things, I can delegate those out to it, and expect junior results when I give it sub-optimal guidance, however through nominal and or extreme guidance I can get adequate / near-perfect results.
Another dimension that really matters here is the actual model used, not every model is the same.
Also, if the AI does something wrong, have it assess why things went wrong, revert back to the previous checkpoint and integrate that into the plan.
You're driving, you are ultimately, in control, learn to drive. It's a tool, it can be adjusted, you can modify the output, you can revert, you can also just not use it. But, if you do actually learn how to use it you'll find it can speed up your process. It is not a cure-all though, it's good in certain situations, just like a hammer.
On the other hand, when people who claim success with AI share their prompts, I see all the same misses and flaws that keep me from fully buying in. For the person though, it seems like they gloss over these errors and claim wild success. Their prompts never actually seem that different from the ones that fail me as well.
It seems like “you’re not doing it correctly” is just a rationalization to protect the pro-AI person’s established opinion.
Nobody does it correctly—ai or not.
It’s about breaking the problem down into epics, tasks, and acceptance criteria that is reviewed. Review the written code and adjust as needed.
Tests… a lot of tests.
I had a similar experience a couple of months ago where I decided to give it a go and "vibe code" a small TUI to get a feel for the workflow.
I used Claude Code and while the end result works (kinda) I noticed I was less satisfied with the process and, more importantly, I now had to review "someone else's" code instead of writing it myself, I had no idea of the internal workings of the application and it felt like starting at day one on a new codebase. It shifted my way of working from thinking/writing into reviewing/giving feedback which for me personally is way less mentally stimulating and rewarding.
There were def. some "a-ha" moments where CC came up with certain suggestions I wouldn't have thought of myself but those were only a small fraction of the total output and there's def. a dopamine hit from seeing all that code being spit out so fast.
Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.
For now I've decided to stick to code completion, writing of unit tests, commit messages, refactoring short snippets, CHANGELOG updates, it does fairly well on all of those small very focused tasks and the saved time on those end up being net positive.
> Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.
This would be amazing. I think claude code is a great prototyping tool, but I agree, you don't really learn your code base. But I think, that is okay for a prototype if you just want to see if the idea works at all. Then you can restart as you say with some scaffolding to implement it better.
This has always been true. The difference is that now more people are admitting it. While you could argue that LLMs have junior level capabilities, they definitely do not have junior level self reflection or self awareness or self anything. It fundamentally doesn't learn where learning means being significantly less likely to fail at a task class x after being taught about it. And even just the ability of asking for help. These agents just choose to generate unusable code over stopping and asking for help or guidance and this implies that they are unable to tell their limits skill wise, knowledge wise, etc.
Frankly, I have been highly concerned seeing all the transformer hype in here when the gains people claims cannot be reliably replicated everywhere.
The financial incentives to make transformer tech work as it is being sold (even when it might not be cost effective) need to be paid close attention because to me, it looks a bit too much like blockchain or big data.