
Kiro, Spec-kit, Bmad, Tessl, and other SDD frameworks turn business analysts into Markdown reviewers. Isn't there a more agile way to use Coding Agents?
Spec-Driven Development (SDD) revives the old idea of heavy documentation before coding — an echo of the Waterfall era. While it promises structure for AI-driven programming, it risks burying agility under layers of Markdown. This post explores why a more iterative, natural-language approach may better fit modern development.
Coding assistants are intimidating: instead of an IDE full of familiar menus and buttons, developers are left with a simple chat input. How can we ensure that the code is correct with so little guidance?

To help people write good software with coding assistants, the open-source community designed a clever way to guide a coding agent. Based on an initial prompt and a few instructions, an LLM generates product specifications, an implementation plan, and a detailed list of tasks. Each document depends on the previous one, and users can edit the documents to refine the spec.
Eventually, these documents are handed over to a coding agent (Claude Code, Cursor, Copilot, you name it). The agent, now properly guided, should write solid code that satisfies the business requirements.
This approach is called Spec-Driven Development (SDD), and several toolkits can get you started. To name a few:
If you want a comparison of these tools, I recommend the excellent article Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl by Birgitta Böckeler.
How does a spec look? It’s essentially a bunch of Markdown files. Here’s an example using GitHub’s spec-kit, where a developer wanted to display the current date on a time-tracking app, resulting in 8 files and 1,300 lines of text:
Here’s another example using Kiro for a small feature (adding a “referred by” field to contacts in Atomic CRM):
Requirements.md - Design.md - Tasks.md
At first glance, these documents look relevant. But the devil is in the details. Once you start using SDD, a few shortcomings become clear:
Most coding agents already have a plan mode and a task list. In most cases, SDD adds little benefit. Sometimes, it even increases the cost of feature development.
To be fair, SDD helps agents stay on task and occasionally spots corner cases developers might miss. But the trade-off (spending 80% of your time reading instead of thinking) is, in my opinion, not worth it.
Maybe SDD doesn’t help much today because the toolkits are still young and the document prompts need refinement. If that’s the case, we just need to wait a few months until they improve.
But my personal opinion is that SDD is a step in the wrong direction. It tries to solve a faulty challenge:
“How do we remove developers from software development?”
It does so by replacing developers with coding agents and guarding those agents with meticulous planning.
In that sense, SDD reminds me of the Waterfall model, which required massive documentation before coding so that developers could simply translate specifications into code.

But developers haven’t been mere executors for a long time, and Big Design Up Front has proven to fail most of the time because it piles up hypotheses. Software development is fundamentally a non-deterministic process, so planning doesn’t eliminate uncertainty (see the classic No Silver Bullet paper).
Also, who is SDD really for? You must be a business analyst to catch errors during the requirements phase, and a developer to catch errors during design. As such, it doesn’t solve the problem it claims to address (removing developers), and it can only be used by the rare individuals who master both trades. SDD repeats the same mistake as No Code tools, which promise a “no developer” experience but actually require developers to use them.
Agile methodologies solved the problem of non-deterministic development by trading predictability for adaptability. I believe they show us a path where coding agents can help us build reliable software, without drowning in Markdown.
Give a coding agent a simple enough problem, and it won’t go off the rails. Instead of translating complex requirements into complex design documents, we should split complex requirements into multiple simple ones.
I’ve successfully used coding agents to build fairly complex software without ever looking at the code, by following a simple approach inspired by the Lean Startup methodology:
Here’s an example: this 3D sculpting tool with adaptive mesh, which I built with Claude Code in about 10 hours:
I didn’t write any spec. I just added small features one by one, correcting the software when the agent misunderstood me or when my own idea didn’t work well. You can see my instructions in the coding session logs: they’re often short and vague, and sometimes lead to dead ends, but that’s fine. When implementing simple ideas is cheap, building in small increments is the fastest way to converge toward a good product.
Agile methodologies freed us from the bureaucracy of waterfall. They showed that close collaboration between product managers and developers eliminates the need for design documents. Coding agents supercharge Agile, because we can literally write the product backlog and see it being built in real time—no mockups needed!
This approach has one drawback compared to Spec-Driven Development: it doesn’t have a name. “Vibe coding” sounds dismissive, so let’s call it Natural Language Development.
I do have one frustration, though: coding agents use text, not visuals. Sometimes I want to point to a specific zone, but browser automation tools aren’t good enough (I’m looking at you, Playwright MCP Server). So if we need new tools to make coding agents more powerful, I think the focus should be on richer visual interactions.
Agile methodologies killed the specification document long ago. Do we really need to bring it back from the dead?
Spec-Driven Development seems born from the minds of CS graduates who know their project management textbooks by heart and dream of removing developers from the loop. I think it’s a missed opportunity to use coding agents to empower a new breed of developers, those who use natural language and build software iteratively.
Let me end with an analogy: coding agents are like the invention of the combustion engine. Spec-Driven Development keeps them confined to locomotives, when we should be building cars, planes, and everything in between. Oh, and just like combustion engines, we should use coding agents sparingly if we care about the environment.
One of the biggest productivity improvements I've had as a developer was to make a habit of planning all my work upfront. Specifically, when I pick up a ticket, I break it down into a big bullet point list of TODOs. Doing it this way leads to a better design, to dealing with inter-ticket dependencies upfront, to clarifying the spec upfront (which yes, is part of your job as a senior developer), and most valuable of all it allows me to get into flow state much more regularly when I am programming.
Its not a surprise to me that this approach also helps AI coding agents to work more effectively, as in-depth planning is essentially moving the thinking upfront.
(I wrote more about this here: https://liampulles.com/jira-tickets.html)
Waterfall gets an unnecessarily bad rap.
Nobody who delivers any system professionally thinks it’s a bad thing to plan out and codify every piece of the problem you’re trying to solve.
That’s part of what waterfall advocates for. Write a spec, and decompose to tasks until you can implement each piece in code.
Where the model breaks - and what software developers rightly hate - is unnecessarily rigid specifications.
If your project’s acceptance criteria are bound by a spec that has tasked you with the impossible, while simultaneously being impossible to change, then you, the dev, are screwed. This is doubly true in cases where you might not get to implementing the spec until months after the spec has been written - in which case, the spec has calcified into something immutable in stakeholders’ minds.
Agile is frequently used by weak product people and lousy project managers as an excuse to “figure it out when we get there”. It puts off any kind of strategic planning or decision making until the last possible second.
I’ve lost track of the number of times that this has caused rework in projects I’ve worked on.
>That’s part of what waterfall advocates for. Write a spec, and decompose to tasks until you can implement each piece in code.
That's what agile advocates for too. The difference is purely in how much spec you write before you start implementing.
Waterfall says specify the whole milestone up front before developing. Agile says create the minimum viable spec before implementing and then getting back to iterating on the spec again straight after putting it into a customer's hands.
Waterfall doesnt really get a bad rap it doesnt deserve. The longer those feedback loops are the more scope you have for fucking up and not dealing with it quickly enough.
I don’t think this whole distinction between waterfall and agile really exists. They are more like caricatures of what really happens. You have always had leader who could guide a project in a reasonable way, plan as much as necessary, respond to changes and keep everything on track. And you have people who did the opposite. There are plenty of agile teams that refuse to respond to changes because “the sprint is already planned” which then causes other teams to get stuck waiting for the changes they need. or you have the next 8 sprints planned out in detail with no way to make changes.
In the end you there is project management that can keep a project on track while also being able to adapt to change and others that aren’t able to do so and choose to hide behind some bureaucratic process. Has always existed and will keep existing no matter how you call it.
>The difference is purely in how much spec you write before you start implementing.
Ah, and therein lies the problem.
I’ve seen companies frequently elect “none at all” as the right amount of spec to write.
I’d rather have far too many specs than none.
> last possible second
Most of the people you describe here will try to start changes at the last possible second, and since our estimates are always wrong, and preemptions always happen, then they start all changes too late to avoid the consequences of waiting too long. It is the worst of all worlds. Because the solution and the remidiatiom are both rushed, leading to tech debt piling up instead of being paid down.
No battle plan survives contact with the enemy. But waterfall is not just a battle plan, it’s an entire campaign. And the problem comes both from trying to define problems we have little in house experience with, and then the sunk cost fallacy of having to redo all that “work” of project definition when reality and the customers end up not working the way we planned.
And BTW, trying to maintain the illusion of that plan results in many abstractions leaking. It creates impedance mismatches in the code and those always end up multiplying the difficulty of implementing new features. This is a major source of Business and Product not understanding why implementing a feature is so hard. It seems like it should just fit in with the existing features, but those features are all a house of cards built on an abstraction that is an outright fabrication.
I wrote this, a few years ago, about being careful to avoid "concrete galoshes"[0].
I've found that it's a balancing act; like so many things in software development. We can't rush in, willy-nilly, but it's also possible to kill the project by spending too much time, preparing (think "The Anal-Retentive Chef" skits, from Saturday Night Live).
Also, I have found that "It Depends" is an excellent mantra for life, in general, and software development, in specific.
I think having LLM-managed specs might be a good idea, as it reduces the overhead required to maintain them.
> Also, I have found that "It Depends" is an excellent mantra for life, in general, and software development, in specific.
Yeah, agree! Also something like "moderation is key" or alike, as many things can be enjoyed but enjoy them too much, or do it too much, and it kind of stops being so effective/good. The Swedes even have a specific word for something that isn't too much, and isn't too little; "Lagom".
Can be applied to almost anything in life, from TTD, to extreme programming, and "waterfall vs agile", or engaging in kinks, or consumption of drugs, or...
I feel like things go bad when people either do nothing of something, or way too much. Finding the balance, the sweet-spot, that's where the magic happens.
I for one really like developing and co-managing specs with an LLM.
I think it’s a great conversational tool for evaluating and shaking out weak points in a design at an early phase.
> One of the biggest productivity improvements I've had as a developer was to make a habit of planning all my work upfront. Specifically, when I pick up a ticket, I break it down into a big bullet point list of TODOs.
You're describing Agile.
How an individual developer chooses to process a single ticket is completely unrelated to agile or waterfall. Agile is about structuring your work over a project so that you get a happy customer who ends up with what they actually needed and not what they thought they wanted when they signed the contract and that turned out to be completely not what they needed after two months.
I agree with you. I think this is the Plan step in a "Plan-Act-Reflect" loop.
We usually do up front design for individual epics. And if we cannot predict what much of the work will be, first we do a spike to sit with the code and any new libraries long enough to guess what the surface area is.
All of these guesses will be wrong of course, but you want to end up with a diagram of how things were, how we would like them to be if time and existing code were no object, and what we decided to do given the limitations of the existing system, and resources (skill and time).
That second diagram informs what you do any time the third fails to match reality. If you fall back to what is, you will likely paint yourself into an architectural corner that will last for years. If you move toward the ideal or at least perpendicular, then there is no backpedaling.
Just to further clarify what I said above: What I talk about here specifically is for the developer who picks up a ticket to flesh out and plan the ticket before diving into code.
This doesn't say anything about what is appropriate for larger project planning. I don't have much experience doing project planning, so I'd look to others for opinions on that.
This is how I approach my stories as well. I used to call this “action plan” way before it became fashionable with the rise of AI agents.
It helps me not only reduce the complexity into more manageable chunks but go back to business team for smoothening the rough edges which would otherwise require a rework after review.
It doesn’t seem like you read the article, which argues against this sort of pre-planning.
The article itself advocates that one should "split complex requirements into multiple simple ones." So I don't disagree here, at least I don't think I do.
If we have a differing interpretation of what the article is motivating for, then please take the opportunity to contemplate an additional perspective and let it enrich your own.
This all seems like a re-hash of the top-down vs bottom-up arguments from before the 90’s (which were never resolved either).
There are two extremes, having everything you do planned up front, and literally planning nothing and just doing stuff.
The power of agile is supposed to be "I don't need to figure this out now, I'll figure it out based on experimentation" which doesn't mean nothing at all is planned.
If you're not planning a mission to Jupiter, you don't need every step planned out before you start. But in broad strokes it's also good to have a plan.
The optimum is to have some recorded shape of the work to come but to give yourself space to change your mind based on your experiences and to plan the work so you can change the plan.
The backlash against waterfall is the result of coming up with very detailed plans before you start, having those plans change constantly during the whole project requiring you to throw away large amounts of completed work, and when you find things that need to change, not being able to because management has decided on The Plan (which they will decide something new on later, but you can't change a thing).
For some decisions, the best time to plan is up front, for other decisions the best time to design is while you're implementing. There's a balance and these things need to be understood by everybody, but they are generally not.
I like to split out exploration and discovery (research) as a third step (the first step) in the process. Before a plan can be devised research needs to be conducted. The more time between research, planning, and execution increases the likelihood of rework or failure.
The best time to plan is dependent on how stable/unstable the environment is.
There are some things that just can't be discovered until you're in the middle of things. Try it one way and discover that another way is much better.
That is, without spending 10x to 100x the time up front to get it right the first time. But if you're not building space ships or nuclear reactors, it's so much faster and better to just do it and figure out things along the way. So much time spent planning and guessing about the future is time wasted and that's why early stage startups can do something in a week that would take an old enterprise 5 years.
"no plan survives first contact with the enemy"
And that's the source of agile and why too much planning is just wasting time for management to have something to do. No I don't know exactly what I'm going to do or when it'll be done, and if you leave it like that I'll get it done faster.
I agree. "Plans are worthless, but planning is invaluable." Its the process of thinking thru things, identifying risks, and having a few possible backup plans.
I vibe coded for months but switched to spec driven development in the last 6 months
I'm also old enough to have started my career learning the rational unified process and then progressed through XP, agile, scrum etc
My process is I spend 2-3 hours writing a "spec" focusing on acceptance criteria and then by the end of the day I have a working, tested next version of a feature that I push to production.
I don't see how using a spec has made me less agile. My iteration takes 8 hours.
However, I see tons of useless specs. A spec is not a prompt. It's an actual definition of how to tell if something is behaving as intended or not.
People are notoriously bad at thinking about correctness in each scenario which is why vibe coding is so big.
People defer thinking about what correct and incorrect actually looks like for a whole wide scope of scenarios and instead choose to discover through trial and error.
I get 20x ROI on well defined, comprehensive, end to end acceptance tests that the AI can run. They fix everything from big picture functionality to minor logic errors.
Seems like you are all just redefining what spec and waterfall means.
A spec was from a customer where it would detail every feature. They would be huge, but usually lack enough detail or be ambiguous. They would be signed off by the customer and then you'd deliver to the spec.
It would contain months, if not years, worth of work. Then after all this work the end product would not meet the actual customer needs.
A day's work is not a spec. It's a ticket's worth of work, which is agile.
Agile is an iterative process where you deliver small chunks of work and the customer course corrects as regular intervals. Commonly 3/4 week sprints, made up of many tickets that take hours or days, per course correct.
Generally each sprint had a spec, and each ticket had a spec. But it sounds like until now you've just been winging it, with vague definitions per feature. It's very common, especially where the PO or PM are bad at their job. Or the developer is informally acting as PO.
Now you're making specs per ticket, you're just now doing what many development teams already do. You're just bizarrely calling it a new process.
It's like watching someone point at a bicycle and insist it's a rocketship.
A customer generally provides requirements (the system should do...) which are translated into a spec (the module/function/method should do...). The set of specs map to requirements. Requirements may be derived from or represented by user stories and specs may or may not by developed in an agile way or written down ahead of time. Whether you have or derive requirements and specs is entirely orthogonal to development methodology. People need to get away from the idea that having specs is any more than a formal description of what the code should do.
The approach we take is the specs are developed from the tests and tests exercise the spec point in its entirety. That is, a test and a spec are semantically synonymous within the code base. Any interesting thing we're playing with is using the specs alongside the signatures to have an LLM determine when the spec is incomplete.
A spec consists of three different kinds of requirements: functional requirements, non-functional requirements, and constraints. It’s supposed to fully describe how the product responds to the context and the desires of stakeholders.
The problem I see a lot with Agile is that people over-focus on functional requirements in the form of user stories. Which in your case would be statements like “X should do…”
I don't necessarily disagree, but can you give an example of a non functional requirement that influences the design?
I always find the distinction between the two fuzzy (because many non-functional requirements can be argued to be functional requirements) but the list here is useful for the discussion: https://en.wikipedia.org/wiki/Non-functional_requirement
Take things like "capacity". When building a system, you may have a functional requirement like "User can retrieve imagery data if authorized" (that is the function of the system). A non-functional requirement might be how many concurrent users the system can handle at a time. This will influence your design because different system architectures/designs will support different levels of usage, even though the usage (the task of getting imagery to analyze or whatever) is the same whether it handles one user at a time or one million.
Yeah, that aligns with my thinking that such a view has rather a narrow view of a "function".
I'll probably be proven wrong eventually, but my main thought about spec driven dev with llms is that it introduces an unreliable compiler. It will produced different results every time it is run and it's up to the developer to review the changes. Which just seems like a laborious error prone task.
You don't need this type of work to be deterministic. It doesn't really matter if the LLM names a function "IsEven" vs "IsNumberEvent".
Have you ever written the EXACT same code twice?
> it introduces an unreliable compiler.
So then by definition so our humans. If compiling is "taking text and converting it to code" that's literally us.
> it's up to the developer to review the changes. Which just seems like a laborious error prone task.
There are trade-offs to everything. Have you ever worked with an off-shore team? They tend to produce worse code and have 1% of the context the LLM does. I'd much rather review LLM-written code than "I'm not even the person you hired because we're scamming the system" developers.
You want it to be as close to deterministic as possible to reduce the risk of the LLM doing something crazy like deleting a feature or functionality. Sure, the idea is for reviews to catch it but it's easier to miss there when there is a lot of noise. I agree that it's very similar to an offshore team that's just focused on cranking out code versus caring about what it does.
Why would you want to rerun it? In that context a human is also an unreliable compiler. Put two humans on the task and you will get two different results. Even putting the same human on the same task again will yield something different. LLMs producing unreliable output that can't be reproduced is definitely a problem but not in this case.
Humans are unreliable compilers but good devs are able to "think outside of the box" in terms of using creative ways to protect against their human foibles while LLMs cant.
If I get a nonsensical requirement i push back. If i see some risky code i will think of some way to make it less risky.
Might be misunderstanding the workflow here, but I think if a change request comes and I alter the spec, I'd need to re run the llm bit that generates the code?
You'd want to have the alteration reference existing guides to the current implementation.
I haven't jumped in headfirst to the "AI revolution", but I have been systematically evaluating the tooling against various use cases.
The approach that tends to have the best result for me combines a collection of `RFI` (request for implementation) markdown documents to describe the work to be done, as well as "guide" documents.
The guide documents need to keep getting updated as the code changes. I do this manually but probably the more enthusiastic AI workflow users would make this an automated part of their AI workflow.
It's important to keep the guides brief. If they get too long they eat context for no good reason. When LLMs write for humans, they tend to be very descriptive. When generating the guide documents, I always add an instruction to tell the LLM to "be succinct and terse", followed by "don't be verbose". This makes the guides into valuable high-density context documents.
The RFIs are then used in a process. For complex problems, I first get the LLM to generate a design doc, then an implementation plan from that design doc, then finally I ask it to implement it while referencing the RFI, design doc, impl doc, and relevant guide docs as context.
If you're altering the spec, you wouldn't ask it to regen from scratch, but use the guide documents to compute the changes needed to implement the alteration.
I'm using claude code primarily.
Hm, maybe it's me who misunderstands the workflow. In that case I agree with you.
That said, I think the non-determinism when rerunning a coding task is actually pretty useful when you're trying to brainstorm solutions. I quite often rerun the same prompt multiple times (with slight modifications or using different models) and write down the implementation details that I like before writing the final prompt. When I'm not happy with the throwaway solutions at all I reconsider the overall specification.
However, the same non-determinism has also made me "lose" a solution that I threw out and where the real prompt actually performed worse. So nowadays I try to make it a habit to stash the throwaway solutions just in case. There's probably something in Cursor where you can dig out things you backtracked on but I'm not a power user.
You would need to rerun the LLM, but you wouldn't necessarily need to rebuild the codebase from scratch.
You can provide the existing spec, the new spec, and the existing codebase all as context, then have the LLM modify the codebase according to the updates to the spec.
"Look at the differences between v1 and v2 of spec.md and make a plan to implement the changes" would be another pretty common approach
No, this is the right take. Spec driven development is good, but having loose markdown "specs" that leave a bunch up to the discretion of the LLM is bad. The right approach is a project spec DSL that agents write, which can be compiled via codegen in a more controlled way.
Could I see one of your specs as an example?
Same. I fancy myself a decent technical communicator and architect. I write specs which consists of giant lists of acceptance criteria, on my phone, laying in bed...
Kick that over to some agents to bash on, check in and review here and there, maybe a little mix of vibe and careful corrections by me, and it's done!
Usually in less time, but! any time an agent is working on work shit, Im working on my race car... so its a win win win to me. Im still using my brain, no longer slogging through awful "human centered" programming languages, more time my hobbies.
Isn't that the dream?
Now, to crack this research around generative gibber-lang programming... 90% of our generative code problems are related to the programming languages themselves. Intended for humans, optimized for human interaction, speed, and parsing. Let the AIs design, speak, write, and run the code. All I care about is that the program passes my tests and does what I intended. I do not care if it has indents, or other stupid dogmatic aspects of what makes one language equally usable to any other, but no "my programming language is better!", who cares. Loving this era.
People defer thinking about what correct and incorrect actually
looks like for a whole wide scope of scenarios and instead choose
to discover through trial and error.
LLMs are _still_ terrible at deriving even the simplest of logical
entailment. I've had the latest and greatest Claude and GPT derive 'B
instead of '(not B) from '(and A (not B)) when 'A and 'B are anything
but the simplest of English sentences.I shudder to think what they decide the correct interpretations of a spec written in prose is.
I would love to see a prompt where it fails such a thing. Do you have an example?
Lisp quotes are confusing in prose.
Still better than my coworkers ...
This article if for those who already made up their mind that "spec-based-development" isn't for them.
I believe (and practice) that spec-based development is one of the future methodologis for developing projects with LLMs. At least it will be one of the niches.
Author thinks about specs as waterfalls. I think about them as a context entrypoint for LLMs. Giving enough info about the project (including user stories, tech design requirements, filesystem structure and meaning, core interfaces/models, functions, etc) LLM will be able to build sufficient initial context for the solution to expand it by reading files and grepping text. And the most interesting is that you can make LLM to keep the context/spec/projetc file updated each time LLM updates the project. Viola: now you are in agile again: just keep iterating on the context/spec/project
This is the key, with test driven dev sprinkled in.
You provide basic specs and can work with LLMs to create thorough test suites that cover the specs. Once specs are captured as tests, the LLM can no longer hallucinate.
I model this as "grounding". Just like you need to ground an electrical system, you need to ground the LLM to reality. The tests do this, so they are REQUIRED for all LLM coding.
Once a framework is established, you require tests for everything. No code is written without tests. These can also be perf tests. They need solid metrics in order to output quality.
The tests provide context and documentation for future LLM runs.
This is also the same way I'd handle foreign teams, that at no fault of their own, would often output subpar code. It was mainly because of a lack of cultural context, communication misunderstandings, and no solid metrics to measure against.
Our main job with LLMs now as software engineers is a strange sort of manager, with a mix of solutions architect, QA director, and patterns expertise. It is actually a lot of work and requires a lot of human people to manage, but the results are real.
I have been experimenting with how meta I can get with this, and the results have been exciting. At one point, I had well over 10 agents working on the same project in parallel, following several design patterns, and they worked so fast I could no longer follow the code. But with layers of tests, layers of agents auditing each other, and isolated domains with well defined interfaces (just as I would expect in a large scale project with multiple human teams), the results speak for themselves.
I write all this to encourage people to take a different approach. Treat the LLMs like they are junior devs or a foreign team speaking a different language. Remember all the design patterns used to get effective use out of people regardless of these barriers. Use them with the LLMs. It works.
> You provide basic specs and can work with LLMs to create thorough test suites that cover the specs. Once specs are captured as tests, the LLM can no longer hallucinate.
Except when it decides to remove all the tests, change their meaning to make them pass or write something not in the spec. Hallucinations are not a problem of the input given, it’s in the foundations of LLMs and so far nobody have solved it. Thinking it won’t happen can and will have really bad outcomes.
It doesn't matter because use of version control is mandatory. When you see things missing or bypassed, audit-instructed LLMs detect these issues and roll-back changes.
I like to keep domains with their own isolated workspaces and git repos. I am not there yet, but I plan on making a sort of local-first gitflow where agents have to pull the codebase, make a new branch, make changes, and submit pull requests to the main codebase.
I would ultimately like to make this a oneliner for agents, where new agents are sandboxed with specific tools and permissions cloning the main codebase.
Fresh-context agents then can function as code reviewers, with escalation to higher tier agents (higher tier = higher token count = more expensive to run) as needed.
In my experience, with correct prompting, LLMs will self-correct when exposed to auditors.
If mistakes do make it through, it is all version controlled, so rolling back isn't hard.
This is the right flow. As agents get better, work will move from devs orchestrating in ides/tuis to reactive, event driven orchestration surfaced in VCS with developers on the loop. It cuts out the middleman and lets teams collaboratively orchestrate and steer.
You can solve this easily by having a separate agent write the tests, and not giving the implementing agent write permission on test files.
> Once specs are captured as tests, the LLM can no longer hallucinate.
Tests are not a correctness proof. I can’t trust LLMs to correctly reason about their code, and tests are merely a sanity check, they can’t verify that the code was correctly reasoned.
They do not need to be correctness proofs. With appropriate prompting and auditing, the tests allow the LLM see if the code functions as expected and iterates. It also serves as functionality documentation and audit documentation.
I also actually do not care if it reasons properly. I care about results that eventually stabilizes on a valid solution. These results do not need to based on "thinking," it can be experimentally derived. Agents can own whatever domain they work in, and acquire results with whatever methods they choose given constraints they are subject to. I measure results by validating via e2e tests, penetration testing, and human testing.
I also measure via architecture agents and code review agents that validate adherence to standards. If standards are violated a deeper audit is conducted, if it becomes a pattern, the agent is modified until it stabilizes again.
This is more like numerical methods of relaxation. You set the edge conditions / constraints, then iterate the system until it stabilizes on a solution. The solution in this case, however, is meta, because you are stabilizing on a set of agents that can stabilize on a solution.
Agents don't "reason" or "think", and I don't need to trust them. I trust only results.
The point is that tests generally only test specific inputs and circumstances. They are a heuristic, but don’t generalize to all possible states and inputs. It’s like probing a mathematical function on some points, where the results being correct on the probed points doesn’t mean the function will yield the desired result on all points of its domain. If the tests are the only measure, they become the target.
The value of a good developer is that they generalize over all possible inputs and states. That’s something current LLMs can’t be trusted to do (yet?).
Not relevant.
Hallucinations don't matter if the mechanics of the pipeline mitigate them. In other words, at a systems level, you can mitigate hallucinations. The agent level noise is not a concern.
This is no different from CPU design or any other noisy system. Transistors are not perfect and there is always error, so you need error correction. At a transistor level, CPUs are unreliable. At a systems level, they are clean and reliable.
This is no different. The stochastic noisiness of individual agents can be mitigated with redundancy, constraints, and error correction at a systems level.
What is this tripe? It even reads exactly like a response I would expect from a bad AI prompt.
I think that your sort of thinking will not age well. I wish you luck.
[flagged]
But do you understand the problem and its context well enough to write tests for the solution?
Take prolog and logic programming. It's all about describing the problem and its context and let the solver find the solution. Try writing your specs in pseudo-prolog code and you will be surprised with all the missing information you're leaving up to chance.
I am not writing the tests, LLMs are.
My objective is to write prompts for LLMs that can write prompts for LLMs that can write code.
When there is a problem downstream the descendant hierarchy, it is a failure of parent LLM's prompts, so I correct it at the highest level and allow it to trickle down.
This eventually resolves into a stable configuration with domain expertise towards whatever function I require, in whatever language is best suited for the task.
If I have to write tests manually, I have already failed. It doesn't matter how skilled I am at coding or capable I am at testing. It is irrelevant. Everything that can be automated should be automated, because it is a force amplifier.
> Giving enough info about the project (including user stories, tech design requirements, filesystem structure and meaning, core interfaces/models, functions, etc)
What's not waterfall about this is lost on me.
Sounds to me like you're arguing waterfall is fine if each full run is fast/cheap enough, which could happen with LLMs and simple enough projects. [0]
Agile was offering incremental spec production , which had the tremendous advantage of accumulating knowledge incrementally as well. It might not be a good fit for LLMs, but revising the definition to make it fit doesn't help IMHO.
[0] Reminds me that reducing the project scopes to smaller runs was also a well established way to make waterfall bearable.
Waterfall with short iteration time is not possible by definition.
You might as well say agile is still waterfall, what are sprints if not waterfall with a 2 week iteration time. And Kanbal is just a collection of indepent waterfalls... It's not a useful definition of waterfall.
Just as most agile projects aren't Agile, most waterfall projects weren't strict Waterfall as it was preached.
That being said, when for instance you had a project that should take 2 years and involve a dozen team, you'd try to cut it in 3 or 4 phases, to even if it would only be "released" and fully tested at the end of it all. At least if your goal was to have it see the light in a reasonable time frame.
Where I worked we also did integration runs at given checkpoints to be able to iron out issues earlier in the process.
PS: on agile, the main specificity I'm seeing is the ability to infinitely extend a project as the scope and specs are typically set on the go. Which is a feature if you're a contractor for a project. you can't do that with waterfall.
Most shops have a mix of pre-planning and on-the go specing to get a realistic process.
> Waterfall with short iteration time is not possible by definition.
What definition would that be?
Regardless, at this point it's all semantics. What I care about is how you do stuff, not the label you assign and in my book writing specs to ground the LLM is a good idea. And I don't even like specs, but in this instance, it works.
> What's not waterfall about this is lost on me.
Exactly. There is a spec, but there is no waterfall required to work and maintain it. Author from the article dismissed spec-based development exactly because they saw resemblance with waterfall. But waterfall isn't required for spec-centric development.
> There is a spec, but there is no waterfall required to work and maintain it.
The problem with waterfall is not that you have to maintain the spec, but that a spec is the wrong way to build a solution. So, it doesn't matter if the spec is written by humans or by LLMs.
I don't see the point of maintaining a spec for LLMs to use as context. They should be able to grep and understand the code itself. A simple readme or a design document, which already should exist for humans, should be enough.
> I don't see the point of maintaining a spec for LLMs to use as context. They should be able to grep and understand the code itself.
“I don’t see the point of maintaining a documentation for developers. They should be able to grep and understand the code itself”
“I don’t see the point of maintaining tests for developers. They should be able to grep and understand the code itself”
“I don’t see the point of compilers/linters for developers. They should be able to grep and find issues themselves”
The thing is that the parallels you are drawing is for things that is very explicitly not the source of the code, but exists alongside it. Code is the ultimate truth. Documentation is a more humane way to describe it. Tests are there to ensure that what is there is what we want. And linters are there to warn us of specific errors. None of these create code.
To go from spec to code requires a lot of decisions (each introducing technical debt). Automating the process remove control over those decisions and over the ultimate truth that is the code. But why can't the LLM retains the trace of the decisions so that it presents control point to alter the results. Instead, it's always a rewrite from scratch.
> “I don’t see the point of maintaining a documentation for developers. They should be able to grep and understand the code itself”
I cannot think that this comment is done in good faith, when I clearly wrote above that documentation should already exist for humans:
> A simple readme or a design document, which already should exist for humans, should be enough.
I see rapid, iterative Waterfall.
The downfall of Waterfall is that there are too many unproven assumptions in too long of a design cycle. You don't get to find out where you were wrong until testing.
If you break a waterfall project into multiple, smaller, iterative Waterfall processes (a sprint-like iteration), and limit the scope of each, you start to realize some of the benefits of Agile while providing a rich context for directing LLM use during development.
Comparing this to agile is missing the point a bit. The goal isn't to replace agile, it's to find a way that brings context and structure to vibe coding to keep the LLM focused.
"rapid, iterative Waterfall" is a contradiction. Waterfall means only one iteration. If you change the spec after implementation has started, then it's not waterfall. You can't change the requirements, you can't iterate.
Then again, Waterfall was never a real methodology; it was a straw man description of early software development. A hyperbole created only to highlight why we should iterate.
> Then again, Waterfall was never a real methodology; it was a straw man description of early software development. A hyperbole created only to highlight why we should iterate.
If only this were accurate. Royce's chart (at the beginning of the paper, what became Waterfall, but not what he recommended by the end of the paper) has been adopted by the DOD. They're slowly moving away from it, but it's used on many real-world projects and fails about as spectacularly as you'd expect. If projects deliver on-time, it's because they blow up their budget and have people work long days and weekends for months or years at a time. If it delivers on budget, it's because they deliver late or cut out features. Either way, the pretty plan put into the presentations is not met.
People really do (and did) think that the chart Royce started with was a good idea, they're not competent, but somehow they got into positions in management to force this stupidity.
I would maybe argue that there is a sweet spot of how much you feed in (with some variability depending on task). I tend to keep my initial instructions succinct, then build them up iteratively. Others write small novels of instructions before they start, which personally don't like as much. I don't always know what I don't know, so speccing ahead in great detail can sometimes be detrimental.
Agree. I don't use term "spec" as it was with "spec-based development" before llms. There details were required to be defined upfront. With LLMs you can start with vague spec, missing some sections and clarify it with iterations.
Sweet spot will be a moving target. LLMs build-in assumptions, ways to expand concepts will be chaning with LLMs development. So best practices will change with change of the LLMs capabilities. The same set of instructions, not too detailed, were so much better handled by sonnet 4 than sonnet 3 in my experience. Sonnet 3.5 was for me a breaking point which showed that context-based llm development is a feasible strategy.
Yes I think specs as the context entry point is a great framing.
The word "spec" is a bit overloaded and I think we're all using it to define many things. There's a high-level spec and there are detailed component-level specs all of which kind of co-exist.
I would simply replace LLM by agent in your reasoning, in the sense that you'll need a strong preprocessing step and multiple iterations to exploit such complete specs.
There is sense in your words. Especially in the context of the modern day vocabulary.
I though about the concept of this ort of methodology before "agent" (which I would define as "sideeffects with LLM integration") was marketed into community vocabulary. And I'm still rigidly sticking to what I consider "basics". Hope that does not impede understanding.
I had a small embedded project and I did it > 70% using LLM's. This is exactly how I did it. Specs are great for grounding the LLM. Coding with LLM's is going to mean relying more on process since you can't fully trust them. It means writing specs, writing small models to validate, writing tests and a lot of code review to understand what the heck it's doing.