Levels of Agentic Engineering

2026-03-108:48274129www.bassimeledath.com

AI's coding ability is outpacing our ability to wield it effectively. That gap closes in levels — 8 of them. Here's the progression from tab complete to autonomous agent teams.

Show article

AI's coding ability is outpacing our ability to wield it effectively. That's why all the SWE-bench score maxxing isn't syncing with the productivity metrics engineering leadership actually cares about. When Anthropic's team ships a product like Cowork in 10 days and another team can't move past a broken POC using the same models, the difference is that one team has closed the gap between capability and practice and the other hasn't.

That gap doesn't close overnight. It closes in levels. 8 of them. Most of you reading this are likely past the first few, and you should be eager to reach the next one because each subsequent level is a huge leap in output, and every improvement in model capability amplifies those gains further.

The other reason you should care is the multiplayer effect. Your output depends more than you'd think on the level of your teammates. Say you're a level 7 wizard, raising several solid PRs with your background agents while you sleep. If your repo requires a colleague's approval before merge, and that colleague is on level 2, still manually reviewing PRs, that stifles your throughput. So it is in your best interest to pull your team up.

From talking to several teams and individuals practicing AI-assisted coding, here's the progression of levels I've seen play out, imperfectly sequential:

Levels 1 & 2: Tab Complete and Agent IDE

I'll address these two zippily, mostly for posterity. Skim freely.

Tab completion is where it started. GitHub Copilot kicked off the movement. Click tab, autocomplete code. Probably long forgotten by many and skipped entirely by new entrants to agentic engineering. It favored experienced devs who could adeptly skeleton their code before AI filled in the blanks.

AI-focused IDEs like Cursor changed the game by connecting chat to your codebase, making multi-file edits dramatically easier. But the ceiling was always context. The model could only help with what it could see, and annoyingly often, it was either not seeing the right context or seeing too much of the wrong context.

Most people at this level are also experimenting with plan mode in their coding agent of choice: translating a rough idea into a structured step-by-step plan for the LLM, iterating on that plan, and then triggering the implementation. It works well at this stage, and it's a reasonable way to maintain control. Though we'll see in later levels less of a dependence on plan mode.

Level 3: Context Engineering

Now the fun stuff. Buzz phrase of the year in 2025, context engineering became a thing when models got reliably good at following a reasonable number of instructions with just the right amount of context. Noisy context was just as bad as underspecified context, so the effort was in improving the information density of each token. "Every token needs to fight for its place in the prompt" was the mantra.

Same message, fewer tokens — information density was the name of the game (source: humanlayer/12-factor-agents)

In practice, context engineering touches more surface area than people realize. It's your system prompt and rules files (.cursorrules, CLAUDE.md). It's how you describe your tools, because the model reads those descriptions to decide which ones to call. It's managing conversation history so a long-running agent doesn't lose the plot ten turns in. It's deciding which tools to even expose per turn, because too many options overwhelm the model just like they overwhelm people.

You don't hear as much about context engineering these days. The scale has tipped in favor of models that forgive noisier context and reason through messier terrain (larger context windows help too). Still, being mindful of what eats up context remains relevant. A few examples of where it still bites:

Smaller models are more context-sensitive. Voice applications often use smaller models, and context size also correlates with time to first token, which affects latency.
Token-heavy tools and modalities. MCPs like Playwright and image inputs burn through tokens fast, pushing you into "compact session" state in Claude Code way sooner than you'd expect.
Agents with access to dozens of tools, where the model spends more tokens parsing tool schemas than doing useful work.

The broader point is that context engineering hasn't gone away, it's just evolved. The focus has shifted from filtering out bad context to making sure the right context is present at the right time. That shift is what sets up level 4.

Level 4: Compounding Engineering

Context engineering improves the current session. Compounding engineering improves every session after it. Popularized by Kieran Klaassen, compounding engineering was an inflection point for not only me but many others that "vibe coding" could do far more than just prototyping.

It's a plan, delegate, assess, codify loop. You plan the task with enough context for the LLM to succeed. You delegate it. You assess the output. And then, crucially, you codify what you learned: what worked, what broke, what pattern to follow next time.

The compounding loop: plan, delegate, assess, codify — each cycle makes the next one better

The magic is in that codify step. LLMs are stateless. If they re-introduce a dependency you explicitly removed yesterday, they'll do it again tomorrow unless you tell them not to. The most common way to close that loop is updating your CLAUDE.md (or equivalent rules file) so the lesson is baked into every future session. A word of caution: the instinct to codify everything into your rules file can backfire (too many instructions is as good as none). The better move is to create a setting where the LLM can easily discover useful context on its own, for example by maintaining an up-to-date docs/ folder (more on this in Level 7).

Practitioners of compounding engineering are usually hyper-aware of the context being fed to their LLM. When an LLM makes a mistake, they instinctively think about missing context before blaming the model's competence. That instinct is what makes levels 5 through 8 possible.

Level 5: MCP and Skills

Levels 3 and 4 solve for context. Level 5 solves for capability. MCPs and custom skills give your LLM access to your database, your APIs, your CI pipeline, your design system, Playwright for browser testing, Slack for notifications. Instead of just thinking about your codebase, the model can now act on it.

There's no shortage of good material on MCPs and skills already, so I won't rehash what they are. But here are some examples of how I use them: my team shares a PR review skill that we've all iterated on (and still do) that conditionally launches subagents depending on the nature of the PR. One handles integration safety with the database. Another runs complexity analysis to flag redundancies or overengineering. Another checks prompt health to ensure our prompts follow the team's standard format. It also runs linters and Ruff.

Why invest this much in a review skill? Because as agents start producing PRs at volume, human review becomes the bottleneck, not the quality gate. Latent Space makes a compelling case that code review as we know it is dead. Automated, consistent, skill-driven review is what replaces it.

On the MCP side, I use the Braintrust MCP so my LLM can query evaluation logs and make changes directly. I use DeepWiki MCP to give my agent access to documentation for any open-source repo without manually pulling it into context.

Once multiple people on your team are writing their own versions of the same skill, it's worth consolidating into a shared registry. Block (my condolences) has a great write-up on this: they built an internal skills marketplace with over 100 skills and curated bundles for specific roles and teams. Skills get the same treatment as code: pull requests, reviews, version history.

One more trend worth calling out: it's becoming common for LLMs to use CLI tools instead of MCPs (and it seems like every company is shipping one: Google Workspace CLI, Braintrust is launching one soon). The reason is token efficiency. MCP servers inject full tool schemas into context on every turn whether the agent uses them or not. CLIs flip this: the agent runs a targeted command, and only the relevant output enters the context window. I use agent-browser heavily for exactly this reason versus using the Playwright MCP.

Quick pause before we go further. Levels 3 through 5 are the building blocks for everything that follows. LLMs are unpredictably good at some things and bad at others, and you need to develop an intuition for where those edges are before stacking more automation on top. If your context is noisy, your prompts are under- or misspecified, or your tools are poorly described, levels 6 through 8 just amplify the mess.

Level 6: Harness Engineering & Automated Feedback Loops

This is where the rocket really starts to ship.

Context engineering is about curating what the model sees. Harness engineering is about building the entire environment, tooling, and feedback loops that let agents do reliable work without you intervening. Give the agent the feedback loop, not just the editor.

OpenAI's Codex harness — a full observability stack wired into the agent so it can query, correlate, and reason about its own output (source: OpenAI)

OpenAI's Codex team wired Chrome DevTools, observability tooling, and browser navigation into the agent runtime so it could take screenshots, drive UI paths, query logs, and validate its own fixes. Given a single prompt, the agent can reproduce a bug, record a video, and implement a fix. Then it validates by driving the app, opens a PR, responds to review feedback, and merges, escalating only when judgment is required. The agent doesn't just write code. It can see what the code produces and iterate on it, the same way a human would.

My team builds voice and chat agents for tech troubleshooting, so I built a CLI tool called converse that lets any LLM chat with our backend endpoint and have turn-by-turn conversations. The LLM makes code changes, uses converse to test conversations against the live system, and iterates. Sometimes these self-improvement loops run for several hours on end. This is especially powerful when the outcome is verifiable: the conversation must follow this flow, or call these tools in these situations (e.g., escalation to a human agent).

The concept that enables this is backpressure: automated feedback mechanisms (type systems, tests, linters, pre-commit hooks) that let agents detect and correct mistakes without human intervention. If you want autonomy, you need backpressure. Otherwise you end up with a slop machine. This extends to security too. Vercel's CTO makes the case that agents, the code they generate, and your secrets should live in separate trust domains, because a prompt injection buried in a log file can trick an agent into exfiltrating your credentials if everything shares one security context. Security boundaries are backpressure: they constrain what an agent can do when it goes off the rails, not just what it should do.

Two principles that sharpen this further:

Design for throughput, not perfection. When perfection is required per commit, agents pile on the same bug and overwrite each other's fixes. Better to tolerate small non-blocking errors and do a final quality pass before release. We do the same for our human colleagues.
Constraints > instructions. Step-by-step prompting ("do A, then B, then C") is increasingly outdated. In my experience, defining boundaries works better than giving checklists, because agents fixate on the list and ignore anything not on it. The better prompt is "here's what I want, work on it until you pass all these tests."

The other half of harness engineering is making sure the agent can navigate your repo without you. OpenAI's approach: keep AGENTS.md to roughly 100 lines that serve as a table of contents pointing to structured docs elsewhere, and make documentation freshness part of CI rather than relying on ad hoc updates that go stale.

Once you've built all of this, a natural question emerges: if the agent can verify its own work, navigate the repo, and correct its mistakes without you, why do you need to be in the chair at all?

Heads up, for folks in the early levels, this next section may sound alien (but hey, bookmark and come back to it).

Level 7: Background Agents

Hot take: plan mode is dying.

Boris Cherny, creator of Claude Code, still starts 80% of tasks in plan mode today. But with each new model generation, the one-shot success rate after planning keeps climbing. I think we're approaching the point where plan mode as a separate human-in-the-loop step fades away. Not because planning doesn't matter, but because models are getting good enough to plan well on their own. Big caveat: this only works if you've done the work in levels 3 through 6. If your context is clean, your constraints are explicit, your tools are well-described, and your feedback loops are tight, the model can plan reliably without you reviewing it first. If you haven't done that work, you'll still need to babysit the plan.

To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.

This matters because it's exactly what unlocks background agents. If an agent can generate a solid plan and execute without needing you to sign off, it can run asynchronously while you do something else. That's the critical shift from "multiple tabs I'm juggling" to "work that's happening without me."

The Ralph loop is the popular entry point: an autonomous agent loop that runs a coding CLI repeatedly until all PRD items are complete, where each iteration spawns a fresh instance with clean context. In my experience, getting the Ralph loop right is hard and any under/misspecification of the PRD comes back to bite. It's a little too fire-and-forget.

You can run multiple Ralph loops in parallel, but the more agents you spin up, the more you notice where your time actually goes: coordinating them, sequencing work, checking output, nudging things along. You're not writing code anymore. You've become a middle manager. You need an orchestrator agent that handles the dispatch so you can stay focused on intent, not logistics.

Dispatch launching 5 workers across 3 models in parallel — your session stays lean while agents do the work

The tool I've been using heavily for this is Dispatch, a Claude Code skill I built that turns your session into a command center. You stay in one clean session while workers do the heavy lifting in isolated contexts. The dispatcher plans, delegates, and tracks, so your main context window is preserved for orchestration. When a worker gets stuck, it surfaces a clarifying question rather than silently failing.

Dispatch runs locally, which makes it ideal for rapid development where you want to stay close to the work: faster feedback, easier to debug interactively, and no infrastructure overhead. Ramp's Inspect is the complementary approach for longer-running, more autonomous work: each agent session spins up in a cloud-hosted sandboxed VM with the full development environment. A PM spots a UI bug, flags it in Slack, and Inspect picks it up and runs with it while your laptop is closed. The tradeoff is operational complexity (infrastructure, snapshotting, security), but you get scale and reproducibility that local agents can't match. I'd say use both (local and cloud background agents).

One pattern that's been surprisingly powerful at this level: use different models for different jobs. The best engineering teams aren't staffed with clones. They're staffed with people who think differently, trained by different experiences, bringing different strengths. The same logic applies to LLMs. These models were post-trained differently and have meaningfully different dispositions. I routinely dispatch Opus for implementation, Gemini for exploratory research, and Codex for review, and the cumulative output is stronger than any single model working alone. Think wisdom of crowds, but for code.

Critically, you also need to decouple the implementer from the reviewer. I've learned this the hard way too many times: if the same model instance implements and evaluates its own work, it's biased. It will gloss over issues and tell you all tasks are complete when they aren't. It's not malice, it's the same reason you don't grade your own exam. Have a different model (or a different instance with a review-specific prompt) do the review pass. Your signal quality goes way up.

Background agents also open the floodgates for combining your CI with AI. Once agents can run without a human in the chair, trigger them from your existing infrastructure. A docs bot that regenerates documentation on every merge and raises a PR to update CLAUDE.md (we do this and it's a huge time saver). A security reviewer that scans PRs and opens fixes. A dependency bot that actually upgrades packages and runs the test suite rather than just flagging them. Good context, compounding rules, capable tools, and automated feedback loops, now running autonomously.

Level 8: Autonomous Agent Teams

Nobody has mastered this level yet, though a few are pushing into it. It's the active frontier.

In Level 7, you have an orchestrator LLM dispatching work to worker LLMs in a hub-and-spoke pattern. Level 8 removes that bottleneck. Agents coordinate with each other directly, claiming tasks, sharing findings, flagging dependencies, and resolving conflicts without routing everything through a single orchestrator.

Claude Code's experimental Agent Teams feature is an early implementation: multiple instances work in parallel on a shared codebase, where teammates operate in their own context windows and communicate directly with each other. Anthropic used 16 parallel agents to build a C compiler from scratch that can compile Linux. Cursor ran hundreds of concurrent agents for weeks to build a web browser from scratch and migrate their own codebase from Solid to React.

But look closely and you'll see the seams. Cursor found that without hierarchy, agents became risk-averse and churned without progress. Anthropic's agents kept breaking existing functionality until a CI pipeline was added to prevent regressions. Everyone experimenting at this level says the same thing: multi-agent coordination is a hard problem and nobody is near optimal yet.

I honestly don't think the models are ready for this level of autonomy for most tasks. And even if they were smart enough, they're still too slow and too token-hungry for it to be economical outside of moonshot projects like compilers and browser builds (impressive, but far from clean). For the work most of us do day to day, Level 7 is where the leverage is. I wouldn't be surprised if Level 8 becomes the prevailing pattern eventually, but right now Level 7 is where I'd put my energy (unless you're Cursor and the breakthrough is the business).

Level ?

Inevitable what's next question.

Once you're adept at orchestrating agent teams without much friction, there's no reason the interface has to stay text-only. Voice-to-voice (thought-to-thought, maybe?) interaction with your coding agent — conversational Claude Code, not just voice-to-text input — is a natural next step. Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.

There's a crowd chasing the perfect one-shot: state what you want and the AI composes it flawlessly in a single pass. The problem is that this presupposes we humans know exactly what we want. We don't. We never have. Software has always been iterative, and I think it always will be. It's just going to get much easier, stretch well beyond plain text interactions, and be a heck of a lot faster.

So: what level are you on? And what are you doing to get to the next one?

How do you typically start a coding task with AI?

Read the original article

Comments

By krackers 2026-03-114:127 reply

What level is copy pasting snippets into the chatgpt window? Grug brained level 0? I sort of prefer it that way (using it as an amped up stackoverflow) since it forces me to decompose things in terms of natural boundaries (manual context management as it were) and allows me to think in terms of "what properties do I need this function to have" rather than just letting copilot take the wheel and glob the entire project in the context window.

By ddxv 2026-03-1112:371 reply

I still do this too for tough projects in languages I know. Too many times getting burned thinking 'wow it one shot that!' only to end up debugging later.

I let agents run wild on frontend JS because I don't know it well and trust them (and an output I can look at).

By tracker1 2026-03-1117:221 reply

IMO, the front end results are REALLY hit and miss... I mostly use it to scaffold if I don't really care because the full UI is just there to test a component, or I do a fair amount of the work mixed. I wish it was better at working with some of the UI component libraries with mixed environments. Describing complex UX and having it work right are really not there yet.

By ddxv 2026-03-133:51

Yes, I think what I left off my sentence was that I trust AI on frontend more than myself. Backend and data processing where I know more, I can't handle it's constant hallucinations. I also feel like hallucinations in data pipelines are way more problematic for me. They take a long time to "fix" and can be quite easy to miss, imagine a mean of a mean or something that is 'mostly' right (thus harder to catch) but factually incorrect.

By Lord-Jobo 2026-03-1114:471 reply

1.8: chat ide the slow way :)

This is also where I do most of my AI use. It’s the safe spot where I’m not going to accidentally send proprietary info to an unknown number of eyeballs(computer or human).

It’s also just cumbersome enough that I’m not relying on it too much and stunting my personal ability growth. But I’m way more novice than most on here.

By tracker1 2026-03-1117:19

I've found it's easy enough to have AI scaffold a working demo environment around a single component/class that I'm actually working on, then I can copy the working class/component into my "real" application. I'm in a pretty locked down environment, so using a separate computer and letting the AI scaffold everything around what I'm working on is pretty damned nice, since I cannot use it in the environment or on the project itself.

For personal projects, I'm able to use it a bit more directly, but would say I'm using it around 5/6 level as defined here... I've leaned on it a bit for planning stages, which helps a lot... not sure I trust swarms of automated agents, though it's pretty much the only way you're going to use the $200 level on Claude effectively... I've hit the limits on the $100 only twice in the past month, I downgraded after my first month. And even then, it just forced me to take a break for an hour.

By giancarlostoro 2026-03-1115:01

I think you bring up a good point, it falls under Chat IDE, but its the "lowest" tier if you will. Nothing wrong with it, a LOT of us started this way.

By vorticalbox 2026-03-1112:50

I do this too with the chatgpt mac app. It has a "pop out" feature it binds to option + space then i just ask away.

By antonvs 2026-03-119:03

I find the CLI agents a decent middle ground between the extremes you describe. There’s a reason they’ve gained some popularity.

By branoco 2026-03-1112:37

anything, if it brings the results

By waynesonfire 2026-03-114:45

Your techinque doesn't keep the kool-aid flowing. Shut up. /s

The more I try to use these tools to push up this "ladder" the more it becomes clear the technology is a no more than a 10x better Google search.

By mzg 2026-03-1018:386 reply

As a lowly level 2 who remains skeptical of these software “dark factories” described at the top of this ladder, what I don’t understand is this:

If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)

By 2001zhaozhao 2026-03-1020:244 reply

Why sell the factory when you can create automated software cloner companies that make millions off of instantly copying promising startups as soon as they come out of stealth?

If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it

By jochem9 2026-03-127:37

ASML has a near monopoly on the most advanced chip machines. They maintain that by 'just' being the most advanced and having lots of patents.

They haven't branched off into making chips themselves. They keep their focus on selling the factories.

I think they haven't, because ASML itself doesn't have production lines. Every machine is one off. It even gets delivered with a team of engineers to keep it running.

The same probably holds true for software factories: the best ones are assembled by the smartest people (wielding AI in ways most of us don't). They are not in the business to produce software at scale, they are in the business to ensure others can do that using increasingly advanced software factories.

This relies on the premise that such a factory cannot produce a more advanced factory without significant human intervention (e.g. high ingenuity and/or lots of elbow grease). If this doesn't hold true, then we are in for some interesting times x100.

By antonvs 2026-03-1021:483 reply

Producing the software is only a small part of the picture when it comes to generating revenue.

So far, we haven’t seen much to suggest that LLMs can (yet) replace sales and most of the related functions.

By DrScientist 2026-03-1110:101 reply

Was listening to a radio programme recently with 3 entrepreneurs talking about being entrepreneurs.

In relation to sales, there were two gems. For direct to consumer type companies - influencers are where it's at right now especially during bootstrap phase - and they were talking about trying to keep marketing budget under 20% of sales.

Another, who is mostly in the VC business, finds the best way to gain traction for his startups is to create controversy - ie anything to be talked about.

In both cases you are trying to be talked about - either by directly paying for people to do that, or by providing entertainment value so people talk about you.

You could argue that both of those activities are already been automated - and the nice thing about sales is there is that fairly direct feedback loop you can actively learn from.

By AdamN 2026-03-1111:57

Yeah I really would like to know how many bots are on reddit (and on particular subreddits/threads) and also how many are here!

The interesting thing though is that the bots are just cheaper versions of real human influencers. So nothing has changed aside from scale (and speed) - the underlying mechanisms of paying for word of mouth is the same as it's been for a long time.

By jillesvangurp 2026-03-117:191 reply

You can do a lot of work with agents to remove a lot of manual work around the sales process. Sales is a lot of grinding on leads, contacts, follow ups, etc. And a lot of that is preparation work (background research, figuring out who to talk to, who the customer is, etc.), making sure follow ups are scheduled appropriately, etc.

You still should talk to people yourself and be very careful with communicating AI slop, cold outreach and other things that piss off more people than they get into your funnel. But a lot of stuff around this can be automated or at least supported by LLMs.

Most of the success with sales is actually having something that people want to buy. That sounds easy. But it's actually the hardest part of selling software. Getting there is a bit of a journey.

I've built a lot of stuff that did not sell well. These are hard earned lessons. I see a lot of startups fall into this trap. You can waste years on product development and many people do. Until it starts selling, it won't matter. Sales is not a person you hire to do it for you: you have to be able to sell it yourself. If you can't, nobody else will be able to either. Founder sales is crucial. Step back from that once it runs smoothly, not before.

Use AI to your advantage here. We use it for almost everything. SEO, wording stuff on our website, competitor analysis, staying on top of our funnel, analyzing and sharpening our pitches, preparing responses to customer questions and demands, criticizing and roasting our own pitches and ideas, etc. Confirmation bias is going to your biggest blindspot. And we also use LLMs to work on the actual product. This stuff is a lot of work. If you can afford a ten person team to work on this, great. But many startups have to prove themselves before the funding for that happens. And when it does, hiring lots of people isn't necessarily a good allocation of resources given you can automate a lot of it now. I'd recommend to hire fewer but better people.

By antonvs 2026-03-118:571 reply

Your points are all valid, but it doesn’t really change the situation that was being discussed: an AI company trying to enter completely new markets just because they can write software for it is hardly some sort of automatic win. They’re much more likely to fail than succeed.

I mentioned sales and marketing but there’s a whole lot more as well. Basically, it involves creating an entire subsidiary. Perhaps the time will come when that can be mostly done by a team of AI agents, but right now that’s a big hurdle in practice.

By DrScientist 2026-03-1110:322 reply

It does raise the question of where in the future will companies compete.

What's the balance going to be between, 'connecting customers to product' and 'making differentiated product'?

In theory, if customers have perfect information ( ignoring a very large part of sales is emotional ), then the former part will disappear. However the rise of the internet, and perhaps AI agents shopping on your behalf, hasn't really made much of a dent there [1] - marketing, in all it's forms, is still huge business - and you could argue still expanding ( cf google ).

[1] Perhaps because of the huge importance of the emotional component. Perhaps also because in many areas of manufacturing you've reached a product plateau already - is there much space to make a better cup and plate?

By majormajor 2026-03-1115:52

There's also a world where "all companies have access to the software factory so sales and entrepreneurship in software disappears entirely."

But in that scenario it's hard to see where the unwinding stops. What are these other companies doing and which parts of it actually need humans if the "agents" are that good? Marketing? No. Talking to customers? No. Support? No. Financial planning and admin? No. Manufacturing? Some, for now. Shipping physical goods? For now. What else...

At some point where even are your customers?

By pixl97 2026-03-1114:33

>It does raise the question of where in the future will companies compete.

Exactly where current companies compete, rent seeking, IP control, and legal machinations.

Hence you'll see a few giant lumbering dinosaurs control most of the market, and a few more nimble companies make successful releases until they either get crushed by, get snapped up by the larger companies, or become a large company themselves.

By bandrami 2026-03-118:01

I mean, until we've at least been through a full lifecycle with its TCO we can't really say LLMs have replaced producing the software

By tkiolp4 2026-03-1023:04

That’s not true. Even if we assume LLMs can generate the code needed to support the next Facebook, one still has to: buy/rent tons of hardware (virtual or baremetal), put tons of money in marketing, break the network effect, pay for 3rd party services for monitoring, alerting and what not. That’s money, and LLMs don’t help with that

By whattheheckheck 2026-03-1021:11

Too bad they cant

By hakanderyal 2026-03-1019:16

We are not there yet. While there are teams applying dark factory models to specific domains with self-reported success, it's yet to be proven, or generalizable enough to apply everywhere.

By glhast 2026-03-1019:442 reply

Also a measly level 2er. I'm curious what kind of project truly needs an autonomous agent team Ralph looping out 10,000 LOCs per hour? Seems like harness-maxxing is a competitive pursuit in its own right existing outside the task of delivering software to customers.

Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.

By maxdo 2026-03-117:141 reply

essentially any enterprise software for example, surprisingly, that needs to be custom tailored and not scaled for millions of views. e.g. anything that has a high context.

Youtube's of this world will not enjoy it, they will use rules of scale for billions of users.

Every Dashboard Chart, Security review system, Jira, ERP, CRM, LMS, chatbot, you name it. The problem that will win from a customization per smaller unit ( company, group of people or even more so an indvidual, like CEO, or CxO group) will win from such software.

The level 6 and and 7 is essentially death of enterprise software.

By pixl97 2026-03-1114:39

>The level 6 and and 7 is essentially death of enterprise software

Enterprise software that you sell, or enterprise software you use internally?

The amount of self created, self used software in enterprises is staggering, that software will still exist, and still have a massive maintenance cost. So maybe we need a better definition of enterprise software here, like externally sold software? Also a huge amount of that software still has regulatory requirements, so someone will have to sign off on it. Maybe it will be internal certification, but very often there is separation of duties on things like that where it's easier to come from a different company.

By cheevly 2026-03-1023:001 reply

Software that is otherwise not feasible for humans to build by hand.

By draftsman 2026-03-1023:17

Example?

By pydry 2026-03-1018:50

I have the same question about people who sell "get rich with real estate" seminars.

By dist-epoch 2026-03-1019:19

Codex and Claude Code are these (proto)factories you talk about - almost every programmer uses them now.

And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.

By vidimitrov 2026-03-1020:188 reply

Level 4 is where I see the most interesting design decisions get made, and also where most practitioners take a shortcut that compounds badly later.

When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.

"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.

The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.

At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.

By sd9 2026-03-1023:051 reply

I had a hunch that this comment was LLM-generated, and the last paragraph confirmed it. Kudos for managing to get so many upvotes though.

"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.

By solarkraft 2026-03-1023:331 reply

It’s still an insightful and well written comment, but the LLM-ness does make me wonder whether this part was actually human-intended or just LLM filler:

> The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory

Because I somewhat agree that discipline may be missing, but I don’t believe it to be a groundbreaking revelation that it’s actually quite easy to tell the LLM to put key reasoning that you give it throughout the conversation into the commits and issue it works on.

By svstoyanovv 2026-03-110:103 reply

Suppose you spend months deeply researching a niche topic. You make your own discoveries, structure your own insights, and feed all of this tightly curated, highly specific context into an LLM. You essentially build a custom knowledge base and train the model on your exact mental framework.

Is this fundamentally different from using a ghostwriter, an editor, or a highly advanced compiler? If I am doing the heavy lifting of context engineering and knowledge discovery, it feels restrictive to say I shouldn't utilize an LLM to structure the final output. Yet, the internet still largely views any AI-generated text as inherently "un-human" or low-effort.

By nothrabannosir 2026-03-1111:101 reply

I would ignore any HN content written by a ghost writer or editor. I guess I would flag compiler output but I’m not sure we’re talking about the same thing?

I’m on the internet for human beings. I already read a newspaper for editors and books for ghostwriters.

Not for long though, HN is dying. Just hanging around here waiting for the next thing , I guess…

By pixl97 2026-03-1114:48

Sorry man, the internet has died and is not being replaced by anything but an authoritarian nightmare.

My only guess is if you want actual humans, you'll have to do this IRL. Of course we has humans have got used to the 24/7 availability and scale of the internet so this is going to be a problem as these meetings won't provide the hyperactive environment we want.

Any other digital system will be gamed in one way or another.

By sd9 2026-03-110:151 reply

The problem is: the structure of LLM outputs generally make everything sound profound. It’s very hard to understand quickly whether a comment has actual signal or it’s just well written bullshit.

And because the cost of generating the comments is so low, there’s no longer an implicit stamp of approval from the author. It used to be the case that you could kind of engage with a comment in good faith, because you knew somebody had spent effort creating it so they must believe it’s worth time. Even on a semi-anonymous forum like HN, that used to be a reliable signal.

So a lot of the old heuristics just don’t work on LLM-generated comments, and in my experience 99% of them turn out to be worthless. So the new heuristic is to avoid them and point them out to help others avoid them.

I would much rather just read the prompt.

By blackcatsec 2026-03-110:24

I hadn't considered this so eloquently with LLM text output, but you're right. "LLMs make everything sound profound" and "well-written bullshit".

This has severe ramifications for internet communications in general on forums like HN and others, where it seems LLM-written comments are sneaking in pretty much everywhere.

It's also very, very dangerous :/ Because the structure of the writing falsely implies authority and trust where there shouldn't be, or where it's not applicable.

By solarkraft 2026-03-128:501 reply

I really don’t mind this in principle (in fact it could help me out a lot). The problem is that the LLM often skews meaning by making up filler-phrases and it becomes hard to tell what you actually mean and what your LLM made up.

By svstoyanovv 2026-03-1212:10

Yeah, you need to be aware of hallucinations for sure. Today, for example, I was doing my one linear, I've used all the curated knowledge to make some structure to it, see examples of deep research, brainstorm ideas around it, but I am the verifier and the steerer. 99% of the ideas were total BS IMO, but it inspired me on wording, what to use, and how to combine them to achieve something simple and understandable.

One idea that I haven't tried but will do is to create a soul.md dumping my writing style, etc.. to see the result (which will be an interesting experiment)

But if you think about it, LLM's are good on generic stuff, then you start curating context, you start using context engineering to structure and give form to that context, but then this is your expertise, your knowledge, and your insights. (if they are not synthetics tho) So now you have something tailored to your needs, something that can be used for brainstorming, idea generation, filtration, (if we see this as a pyramid starting from the most expanded and generic, going to specifics and things that only you can take and merge as solutions on your mental main branch). So now you have data, knowledge, which is feeding and training the responses that will be generated for you for the current session, by the LLM, and maybe with the harness as well(they are not doing a great job so far in being real connectors).

Of course, we are away from AI taking and working on autopilot with my knowledge, but now I have become faster at: generating ideas, forming new knowledge, testing it, verifying it, deciding if this will be something synthetic or should I go deeper to discover more & explore the cases of it to form a new deep connection.

So, is this something that I used LLM just to generate the content for me? Or have i amplified myself and used LLM to structure the response, (maybe if this is not my primary language and I need to use it in order to form more in-depth sentences)?

By redhale 2026-03-1110:07

It is for this reason that I usually keep an "adr" folder in my repo to capture Architecture Decision Record documents in markdown. These allow the agent to get the "why" when it needs to. Useful for humans too.

The challenge is really crafting your main agent prompt such that the agent only reads the ADRs when absolutely necessary. Otherwise they muddy the context for simple inside-the-box tasks.

By sirtaj 2026-03-114:271 reply

I have a skill and template for adding ADRs to the documentation for this purpose.

By mwcz 2026-03-120:44

Is it a mad dream to wish that was never gets DOM access, and instead there is invented a less memory-hungry dynamic representation of web pages that's usable only by wasm? Yeah, it's a mad dream. But it's also maddening that I can effortlessly open a 100 MB PDF but browser can barely handle a 10 MB html document.

By smallnix 2026-03-1022:281 reply

A good rule would then be to capture such reasoning, at least when made during the session with the agent, in the commit messages the agent creates.

By vidimitrov 2026-03-1022:40

That’s exactly the direction I went with. Working on a spec for exactly this - planning to post it here soon:

https://github.com/berserkdisruptors/contextual-commits

By hkonte 2026-03-113:56

[dead]

By oliver_dr 2026-03-1110:01

[dead]

By alexey-pelykh 2026-03-1113:17

[dead]

bombastic311

Comments

By krackers 2026-03-114:127 reply

By ddxv 2026-03-1112:371 reply

By tracker1 2026-03-1117:221 reply

By ddxv 2026-03-133:51

By Lord-Jobo 2026-03-1114:471 reply

By tracker1 2026-03-1117:19

By giancarlostoro 2026-03-1115:01

By vorticalbox 2026-03-1112:50

By antonvs 2026-03-119:03

By branoco 2026-03-1112:37

By waynesonfire 2026-03-114:45

By mzg 2026-03-1018:386 reply

By 2001zhaozhao 2026-03-1020:244 reply

By jochem9 2026-03-127:37

By antonvs 2026-03-1021:483 reply

By DrScientist 2026-03-1110:101 reply

By AdamN 2026-03-1111:57

By jillesvangurp 2026-03-117:191 reply

By antonvs 2026-03-118:571 reply

By DrScientist 2026-03-1110:322 reply

By majormajor 2026-03-1115:52

By pixl97 2026-03-1114:33

By bandrami 2026-03-118:01

By tkiolp4 2026-03-1023:04

By whattheheckheck 2026-03-1021:11

By hakanderyal 2026-03-1019:16

By glhast 2026-03-1019:442 reply

By maxdo 2026-03-117:141 reply

By pixl97 2026-03-1114:39

By cheevly 2026-03-1023:001 reply

By draftsman 2026-03-1023:17

By pydry 2026-03-1018:50

By dist-epoch 2026-03-1019:19

By vidimitrov 2026-03-1020:188 reply

By sd9 2026-03-1023:051 reply

By solarkraft 2026-03-1023:331 reply

By svstoyanovv 2026-03-110:103 reply

By nothrabannosir 2026-03-1111:101 reply

By pixl97 2026-03-1114:48

By sd9 2026-03-110:151 reply

By blackcatsec 2026-03-110:24

By solarkraft 2026-03-128:501 reply

By svstoyanovv 2026-03-1212:10

By redhale 2026-03-1110:07

By sirtaj 2026-03-114:271 reply

By mwcz 2026-03-120:44

By smallnix 2026-03-1022:281 reply

By vidimitrov 2026-03-1022:40

By hkonte 2026-03-113:56

By oliver_dr 2026-03-1110:01

By alexey-pelykh 2026-03-1113:17