
The research-plan-implement workflow I use to build software with Claude Code, and why I never let it write code until I've approved a written plan.
I’ve been using Claude Code as my primary development tool for approx 9 months, and the workflow I’ve settled into is radically different from what most people do with AI coding tools. Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. The more terminally online are stitching together ralph loops, mcps, gas towns (remember those?), etc. The results in both cases are a mess that completely falls apart for anything non-trivial.
The workflow I’m going to describe has one core principle: never let Claude write code until you’ve reviewed and approved a written plan. This separation of planning and execution is the single most important thing I do. It prevents wasted effort, keeps me in control of architecture decisions, and produces significantly better results with minimal token usage than jumping straight to code.
flowchart LR
R[Research] --> P[Plan]
P --> A[Annotate]
A -->|repeat 1-6x| A
A --> T[Todo List]
T --> I[Implement]
I --> F[Feedback & Iterate]
Every meaningful task starts with a deep-read directive. I ask Claude to thoroughly understand the relevant part of the codebase before doing anything else. And I always require the findings to be written into a persistent markdown file, never just a verbal summary in the chat.
read this folder in depth, understand how it works deeply, what it does and all its specificities. when that’s done, write a detailed report of your learnings and findings in research.md
study the notification system in great details, understand the intricacies of it and write a detailed research.md document with everything there is to know about how notifications work
go through the task scheduling flow, understand it deeply and look for potential bugs. there definitely are bugs in the system as it sometimes runs tasks that should have been cancelled. keep researching the flow until you find all the bugs, don’t stop until all the bugs are found. when you’re done, write a detailed report of your findings in research.md
Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.
The written artifact (research.md) is critical. It’s not about making Claude do homework. It’s my review surface. I can read it, verify Claude actually understood the system, and correct misunderstandings before any planning happens. If the research is wrong, the plan will be wrong, and the implementation will be wrong. Garbage in, garbage out.
This is the most expensive failure mode with AI-assisted coding, and it’s not wrong syntax or bad logic. It’s implementations that work in isolation but break the surrounding system. A function that ignores an existing caching layer. A migration that doesn’t account for the ORM’s conventions. An API endpoint that duplicates logic that already exists elsewhere. The research phase prevents all of this.
Once I’ve reviewed the research, I ask for a detailed implementation plan in a separate markdown file.
I want to build a new feature <name and description> that extends the system to perform <business outcome>. write a detailed plan.md document outlining how to implement this. include code snippets
the list endpoint should support cursor-based pagination instead of offset. write a detailed plan.md for how to achieve this. read source files before suggesting changes, base the plan on the actual codebase
The generated plan always includes a detailed explanation of the approach, code snippets showing the actual changes, file paths that will be modified, and considerations and trade-offs.
I use my own .md plan files rather than Claude Code’s built-in plan mode. The built-in plan mode sucks. My markdown file gives me full control. I can edit it in my editor, add inline notes, and it persists as a real artifact in the project.
One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.
But the plan document itself isn’t the interesting part. The interesting part is what happens next.
This is the most distinctive part of my workflow, and the part where I add the most value.
flowchart TD
W[Claude writes plan.md] --> R[I review in my editor]
R --> N[I add inline notes]
N --> S[Send Claude back to the document]
S --> U[Claude updates plan]
U --> D{Satisfied?}
D -->|No| R
D -->|Yes| T[Request todo list]
After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn’t have.
The notes vary wildly in length. Sometimes a note is two words: “not optional” next to a parameter Claude marked as optional. Other times it’s a paragraph explaining a business constraint or pasting a code snippet showing the data shape I expect.
Some real examples of notes I’d add:
Then I send Claude back to the document:
I added a few notes to the document, address all the notes and update the document accordingly. don’t implement yet
This cycle repeats 1 to 6 times. The explicit “don’t implement yet” guard is essential. Without it, Claude will jump to code the moment it thinks the plan is good enough. It’s not good enough until I say it is.
The markdown file acts as shared mutable state between me and Claude. I can think at my own pace, annotate precisely where something is wrong, and re-engage without losing context. I’m not trying to explain everything in a chat message. I’m pointing at the exact spot in the document where the issue is and writing my correction right there.
This is fundamentally different from trying to steer implementation through chat messages. The plan is a structured, complete specification I can review holistically. A chat conversation is something I’d have to scroll through to reconstruct decisions. The plan wins every time.
Three rounds of “I added notes, update the plan” can transform a generic implementation plan into one that fits perfectly into the existing system. Claude is excellent at understanding code, proposing solutions, and writing implementations. But it doesn’t know my product priorities, my users’ pain points, or the engineering trade-offs I’m willing to make. The annotation cycle is how I inject that judgement.
Before implementation starts, I always request a granular task breakdown:
add a detailed todo list to the plan, with all the phases and individual tasks necessary to complete the plan - don’t implement yet
This creates a checklist that serves as a progress tracker during implementation. Claude marks items as completed as it goes, so I can glance at the plan at any point and see exactly where things stand. Especially valuable in sessions that run for hours.
When the plan is ready, I issue the implementation command. I’ve refined this into a standard prompt I reuse across sessions:
implement it all. when you’re done with a task or phase, mark it as completed in the plan document. do not stop until all tasks and phases are completed. do not add unnecessary comments or jsdocs, do not use any or unknown types. continuously run typecheck to make sure you’re not introducing new issues.
This single prompt encodes everything that matters:
I use this exact phrasing (with minor variations) in virtually every implementation session. By the time I say “implement it all,” every decision has been made and validated. The implementation becomes mechanical, not creative. This is deliberate. I want implementation to be boring. The creative work happened in the annotation cycles. Once the plan is right, execution should be straightforward.
Without the planning phase, what typically happens is Claude makes a reasonable-but-wrong assumption early on, builds on top of it for 15 minutes, and then I have to unwind a chain of changes. The “don’t implement yet” guard eliminates this entirely.
Once Claude is executing the plan, my role shifts from architect to supervisor. My prompts become dramatically shorter.
flowchart LR
I[Claude implements] --> R[I review / test]
R --> C{Correct?}
C -->|No| F[Terse correction]
F --> I
C -->|Yes| N{More tasks?}
N -->|Yes| I
N -->|No| D[Done]
Where a planning note might be a paragraph, an implementation correction is often a single sentence:
deduplicateByTitle function.”Claude has the full context of the plan and the ongoing session, so terse corrections are enough.
Frontend work is the most iterative part. I test in the browser and fire off rapid corrections:
For visual issues, I sometimes attach screenshots. A screenshot of a misaligned table communicates the problem faster than describing it.
I also reference existing code constantly:
This is far more precise than describing a design from scratch. Most features in a mature codebase are variations on existing patterns. A new settings page should look like the existing settings pages. Pointing to the reference communicates all the implicit requirements without spelling them out. Claude would typically read the reference file(s) before making the correction.
When something goes in a wrong direction, I don’t try to patch it. I revert and re-scope by discarding the git changes:
Narrowing scope after a revert almost always produces better results than trying to incrementally fix a bad approach.
Even though I delegate execution to Claude, I never give it total autonomy over what gets built. I do the vast majority of the active steering in the plan.md documents.
This matters because Claude will sometimes propose solutions that are technically correct but wrong for the project. Maybe the approach is over-engineered, or it changes a public API signature that other parts of the system depend on, or it picks a more complex option when a simpler one would do. I have context about the broader system, the product direction, and the engineering culture that Claude doesn’t.
flowchart TD
P[Claude proposes changes] --> E[I evaluate each item]
E --> A[Accept as-is]
E --> M[Modify approach]
E --> S[Skip / remove]
E --> O[Override technical choice]
A & M & S & O --> R[Refined implementation scope]
Cherry-picking from proposals: When Claude identifies multiple issues, I go through them one by one: “for the first one, just use Promise.all, don’t make it overly complicated; for the third one, extract it into a separate function for readability; ignore the fourth and fifth ones, they’re not worth the complexity.” I’m making item-level decisions based on my knowledge of what matters right now.
Trimming scope: When the plan includes nice-to-haves, I actively cut them. “remove the download feature from the plan, I don’t want to implement this now.” This prevents scope creep.
Protecting existing interfaces: I set hard constraints when I know something shouldn’t change: “the signatures of these three functions should not change, the caller should adapt, not the library.”
Overriding technical choices: Sometimes I have a specific preference Claude wouldn’t know about: “use this model instead of that one” or “use this library’s built-in method instead of writing a custom one.” Fast, direct overrides.
Claude handles the mechanical execution, while I make the judgement calls. The plan captures the big decisions upfront, and selective guidance handles the smaller ones that emerge during implementation.
I run research, planning, and implementation in a single long session rather than splitting them across separate sessions. A single session might start with deep-reading a folder, go through three rounds of plan annotation, then run the full implementation, all in one continuous conversation.
I am not seeing the performance degradation everyone talks about after 50% context window. Actually, by the time I say “implement it all,” Claude has spent the entire session building understanding: reading files during research, refining its mental model during annotation cycles, absorbing my domain knowledge corrections.
When the context window fills up, Claude’s auto-compaction maintains enough context to keep going. And the plan document, the persistent artifact, survives compaction in full fidelity. I can point Claude to it at any point in time.
Read deeply, write a plan, annotate the plan until it’s right, then let Claude execute the whole thing without stopping, checking types along the way.
That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing. The research prevents Claude from making ignorant changes. The plan prevents it from making wrong changes. The annotation cycle injects my judgement. And the implementation command lets it run without interruption once every decision has been made.
Try my workflow, you’ll wonder how you ever shipped anything with coding agents without an annotated plan document sitting between you and the code.
I'm also building nominal.dev, because nobody should be on-call in 2026.
> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.
This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.
It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.
Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.
I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.
Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.
Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.
IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable.
Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.
Just a theory.
Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.
So if you send a python code then the first one in function can be one expert, second another expert and so on.
Can you back this up with documentation? I don't believe that this is the case.
The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
It's not only per token, but also each layer has its own router and can choose different experts. https://huggingface.co/blog/moe#what-is-a-mixture-of-experts...
Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.
This is a very useful take, thank you. Really helped me to adjust my mental model without "antropomorphising" the machinery. Upvoted.
If I may, I would re-phrase/expand the last sentence of yours in a way that makes it even more useful for me, personally. Maybe it could help other people too. I think it is fair to say that in presence of hints like "Pretend you are X" or "Take a deeper look" the inference mechanism (driven by it's training weights, and now influenced by those hints via "attention math") is not "satisfied" until it pulls more relevant tokens into "working context" ("more" and "relevant" being modulated by the particular hint).
This is such a good explanation. Thanks
>> Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts.
This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.
A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.
I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task.
It’s not cargo culting, it does make a difference and there are papers on arxiv discussing it. The trouble is that it’s hard to tell whether it’ll help or hurt - telling it to act as an expert in one field may improve your result, or may make it lose some of the other perspectives it has which might be more important for solving the problem.
Papers I read on arxiv mostly say that persona prompting changes tone of voice and attitude, but not the quality of the answer, at least statistically.
For example: https://arxiv.org/abs/2512.05858
You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.
These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.
That’s because it’s superstition.
Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.
Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.
The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.
If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
I actually have a prompt optimizer skill that does exactly this.
https://github.com/solatis/claude-config
It’s based entirely off academic research, and a LOT of research has been done in this area.
One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.
“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”
Thanks for sharing! I've been gravitating towards this sort of workflow already - just seems like the right approach for these tools.
> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."
So I'm 90% sure this is already happening on some level.
But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "
Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.
> That’s because it’s superstition.
This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.
It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]
> This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
Oh, the blasphemy!
So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)
The superstitious bits are more like people thinking that code goes faster if they use different variable names while programming in the same language.
And the horror is, once in a long while it is true. E.g. where perverse incentives cause an optimizing compiler vendor to inject special cases.
i suppose we will just have to write an English to pedantry compiler.
A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.
I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".
Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.
Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running
(1 Outside of some core ML developers at the big model companies)
It’s like playing a fretless instrument to me.
Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.
Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.
Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes.
Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!
How do you feel about your current levels of dakka? </40k>
For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions.
Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.
this is psychotic why is this how this works lol
Speculation only obviously: highly-charged conversations cause the discussion to be channelled to general human mitigation techniques and for the 'thinking agent' to be diverted to continuations from text concerned with the general human emotional experience.
Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on.
Why daydream? ChatGPT has a voice assistant mode.
wait seriously? lmfao
thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.
if there's a source for that i'd love to read about it.
If you think about where in the training data there is positivity vs negativity it really becomes equivalent to having a positive or negative mindset regarding a standing and outcome in life.
I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore.
For awhile(maybe a year ago?) it seemed like verbal abuse was the best way to make Claude pay attention. In my head, it was impacting how important it deemed the instruction. And it definitely did seem that way.
i make claude grovel at my feet and tell me in detail why my code is better than its code
Consciousness is off the table but they absolutely respond to environmental stimulus and vibes.
See, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe have a shot at running claude while playing Enya albums on loop.
/s (??)
i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness
sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?
in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.
all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.
these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.
How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology.
Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...
It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).
It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.
I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this.
Anthropic got rid of controlling the thinking budget by parsing your prompt - now it's a setting in /config.
They never parsed your prompt. The magic word reduces the probability that the token corresponding to the end of chain-of-thought will be emitted, which increases test-time compute.
The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.
I'm not being sarcastic. This is absolutely incredible.
And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working.
Nice to hear someone say it. Like what are we even doing? It's exhausting.
We have tests and benchmarks to measure it though.
Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!
That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.
Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.
These coding agents are literally Language Models. The way you structure your prompting language affect the actual output.
If you read the transformer paper, or get any book on NLP, you will see that this is not magic incantation; it's purely the attention mechanism at work. Or you can just ask Gemini or Claude why these prompts work.
But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works.
If you think like a hammer, everything will look like a nail.
I know why it works, to varying and unmeasurable degrees of success. Just like if I poke a bull with a sharp stick, I know it's gonna get it's attention. It might choose to run away from me in one of any number of directions, or it might decide to turn around and gore me to death. I can't answer that question with any certainty then you can.
The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.
> The system is inherently non-deterministic.
The system isn't randomly non-deterministic; it is statistically probabilistic.
The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds.
Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.
A lot of words to say that for all intents and purposes... it's nondeterministic.
> Yes, it's a departure from the fully deterministic systems we're used to.
A system either produces the same output given the same input[1], or doesn't.
LLMs are nondeterministic by design. Sure, you can configure them with a zero temperature, a static seed, and so on, but they're of no use to anyone in that configuration. The nondeterminism is what gives them the illusion of "creativity", and other useful properties.
Classical computers, compilers, and programming languages are deterministic by design, even if they do contain complex logic that may affect their output in unpredictable ways. There's a world of difference.
[1]: Barring misbehavior due to malfunction, corruption or freak events of nature (cosmic rays, etc.).
But we can predict the outcomes, though. That's what we're saying, and it's true. Maybe not 100% of the time, but maybe it helps a significant amount of the time and that's what matters.
Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.
> But we can predict the outcomes [...] Maybe not 100% of the time
So 60% of the time, it works every time.
... This fucking industry.
Again, it's called management. You're managing something unpredictable: the LLM.
This is nothing new, at all. Do you have a strategy that makes other engineers do what you want exactly 100% of the time?
Do you actively use LLMs to do semi-complex coding work? Because if not, it will sound mumbo-jumbo to you. Everyone else can nod along and read on, as they’ve experienced all of it first hand.
You've missed the point. This isn't engineering, it's gambling.
You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).
That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.
I think that's a pretty bold claim, that it'd be different every time. I'd think the output would converge on a small set of functionally equivalent designs, given sufficiently rigorous requirements.
And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?
> You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time
This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.
these sort-of-lies might help:
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.
i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.
i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.
Probably better could have described it as the distance between the pegs being the model weight, and the prompt defines the shape of the coin you’re dropping down.
I was half asleep when I wrote it the first time and knew it wasn’t what I wanted to say but couldn’t remember what analogy I was looking for.
Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:
- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...
You will notice a clear improvement in the quality of the generated artifacts.
Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully).
I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.
Maybe, but forcing code generation in a certain way could ruin hello worlds and simpler code generation.
Sometimes the user just wants something simple instead of enterprise grade.
I think that Claude can figure that out though - through the conversation Claude should figure out if this is an enterprise app, a hobby app or whatever, as it decides on how to code the thing.
My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/
Haha, this reminds me of all the stable diffusion "in the style of X artist" incantations.
That's different. You are pulling the model, semantically, closer to the problem domain you want it to attack.
That's very different from "think deeper". I'm just curious about this case in specific :)
I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).
Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.
Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!
You bumped the token predictor into the latent space where it knew what it was doing : )
Is parenting making us better at prompt engineering, or is it the other way around?
The little language model that could.
If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.
To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.
Why do you think that? Given how the attention and optimization works on training and inference it makes sense that these kind of words trigger deeper analysis (more steps, introducing more thinking/reasoning steps which wield indeed yield less problems. Even if you make model to spend more time on token outputting you will have more opportunity to emerge better reasoning in between.
At least this is how I understand it how LLMs work.
Possibly can be confirmed something with tools this : https://www.neuronpedia.org/
It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles.
Where can I find those?
This analysis is a good starting point: https://southbridge-research.notion.site/Prompt-Engineering-...
The LLM will do what you ask it to unless you don't get nuanced about it. Myself and others have noticed that LLM's work better when your codebase is not full of code smells like massive godclass files, if your codebase is discrete and broken up in a way that makes sense, and fits in your head, it will fit in the models head.
Maybe the training data that included the words like "skim" also provided shallower analysis than training that was close to the words "in great detail", so the LLM is just reproducing those respective words distribution when prompted with directions to do either.
One of the well defined failure modes for AI agents/models is "laziness." Yes, models can be "lazy" and that is an actual term used when reviewing them.
I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.
I've encountered this failure mode, and the opposite of it: thinking too much. A behaviour I've come to see as some sort of pseudo-neuroticism.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
yeah i experienced this the other day when asking claude code to build an http proxy using an afsk modem software to communicate over the computers sound card. it had an absolute fit tuning the system and would loop for hours trying and doubling back. eventually after some change in prompt direction to think more deeply and test more comprehensively it figured it out. i certainly had no idea how to build a afsk modem.
Apparently LLM quality is sensitive to emotional stimuli?
"Large Language Models Understand and Can be Enhanced by Emotional Stimuli": https://arxiv.org/abs/2307.11760
Strings of tokens are vectors. Vectors are directions. When you use a phrase like that you are orienting the vector of the overall prompt toward the direction of depth, in its map of conceptual space.
—HAL, open the shuttle bay doors.
(chirp)
—HAL, please open the shuttle bay doors.
(pause)
—HAL!
—I'm afraid I can't do that, Dave.
HAL, you are an expert shuttle-bay door opener. Please write up a detailed plan of how to open the shuttle-bay door.
The disconnect might be that there is a separation between "generating the final answer for the user" and "researching/thinking to get information needed for that answer". Saying "deeply" prompts it to read more of the file (as in, actually use the `read` tool to grab more parts of the file into context), and generate more "thinking" tokens (as in, tokens that are not shown to the user but that the model writes to refine its thoughts and improve the quality of its answer).
It is as the author said, it'll skim the content unless otherwise prompted to do so. It can read partial file fragments; it can emit commands to search for patterns in the files. As opposed to carefully reading each file and reasoning through the implementation. By asking it to go through in detail you are telling it to not take shortcuts and actually read the actual code in full.
My guess would be that there’s a greater absolute magnitude of the vectors to get to the same point in the knowledge model.
The original “chain of thought” breakthrough was literally to insert words like “Wait” and “Let’s think step by step”.
The author is referring to how the framing of your prompt informs the attention mechanism. You are essentially hinting to the attention mechanism that the function's implementation details have important context as well.
if it’s so smart, why do i need to learn to use it?
It's very much believable, to me.
In image generation, it's fairly common to add "masterpiece", for example.
I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?
I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.
Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!
Its effectiveness is even more apparent with older smaller LLMs, people who interact with LLMs now never tried to wrangle llama2-13b into pretending to be a dungeon master...
[dead]
[dead]
I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
Yeap, I recently came to realization that is useful to think about LLMs as assumption engines. They have trillions of those and fill the gaps when they see the need. As I understand, assumptions are supposedly based on industry standards, If those deviate from what you are trying to build then you might start having problems, like when you try to implement a solution which is not "googlable", LLM will try to assume some standard way to do it and will keep pushing it, then you have to provide more context, but if you have to spend too much time on providing the context, then you might not save that much time in the end.
Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.
There also blue team / red team that works.
The idea is always the same: help LLM to reason properly with less and more clear instructions.
This sounds very promising. Any link to more details?
A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.
All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.
If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.
This approach sounds clean in theory, but in production you're building a black box. When your planning agent hands off to an implementation agent and that hands off to a review agent — where did the bug originate? Which agent's context was polluted? Good luck tracing that. I went the opposite direction: single agent per task, strict quality gates between steps, full execution logs. No sub-agents. Every decision is traceable to one context window. The governance layer (PR gates, staged rollouts, acceptance criteria) does the work that people expect sub-agents to do — but with actual observability.
After 6 months in production and 1100+ learned patterns: fewer moving parts, better debugging, more reliable output. Built a full production crawler this way — 26 extractors, 405 tests — without sub-agents. Orchestrator acts as gatekeeper that redispatches uncompleted work.
> Every decision is traceable to one context window
There are no models that can do all the mentioned steps in a single usable context window. This is why subagents or multi-agent orchestrators exist in the first place.
You're right that no model handles everything in one context window — that's exactly why I built context rotation. Each task runs in a single agent context (one responsibility, clear scope), and when the window fills up, the system automatically rotates: writes a structured handover, clears, and resumes in a fresh window.
The key distinction: sub-agents run within a parent context with shared state (black box). My approach uses independent parallel agents (separate terminals, separate context windows) that report back to an orchestrator. Large tasks get split into smaller dispatches upfront — each scoped to fit a single context window. The orchestrator can dispatch research to 3 agents in parallel, collect their outputs, then dispatch a synthesis task to a single agent that merges the findings.
So it's not "one context window for everything" — it's right-sized tasks with full observability per agent, and a governance layer managing the sequence and merging results.
That sounds interesting. I do hate how there's no observability into subagents and you just get a summary.
How do they report back to the orchestrator? Tmux?
Yes, tmux. The setup is a 2x2 grid:
T0 (orchestrator) | T1 (Track A) T2 (Track B) | T3 (Track C)
When a worker finishes, it writes a structured report to a shared unified_reports/ directory. A file watcher (receipt processor) detects it, parses the report into a structured NDJSON receipt (status, files changed, open items, git ref), and delivers it to T0's pane.
T0 then reviews the receipt, runs a quality advisory (automated pass/warn/hold verdict), and decides: close open items, complete the PR, or redispatch. Everything is filesystem-based — no API, no database, no shared memory between agents. Each terminal has its own context window, its own Claude Code (or Codex/Gemini) session, and the only communication channel is structured files on disk.
The receipt ledger is append-only NDJSON, so you can always trace: which agent did what, when, on which dispatch, with which git commit.
I open-sourced the setup recently if you want to dig into the details.
Since the phases are sequential, what’s the benefit of a sub agent vs just sequential prompts to the same agent? Just orchestration?
Context pollution, I think. Just because something is sequential in a context file doesn’t mean it’ll happen sequentially, but if you use subagents there is a separation of concerns. I also feel like one bloated context window feels a little sloppy in the execution (and costs more in tokens).
YMMV, I’m still figuring this stuff out
This runs counter to the advice in the fine article: one long continuous session building context.
I think claude-code is doing this at the background now
I recently learned a trick to improve an LLM's thinking (maybe it's well know?):
Requesting { "output": "x" } consistently fails, despite detailed instructions.
Changing to requesting { "output": "x", "reasoning": "y" } produces the desired outcome.
It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own
> LLMs don’t usually fail at syntax?
Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.
Might depend on used language. From my experience Claude Sonnet indeed never make any syntax mistakes in JS/TS/C#, but these are popular language with lots of training data.
Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent
[dead]
Did you just write this with ChatGPT?
I've never seen an LLM use "etc" but the rest gives a strong "it's not just X, it's Y" vibe.
I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.
I go a bit further than this and have had great success with 3 doc types and 2 skills:
- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.
- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.
- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)
- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.
- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.
I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.
And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.
This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much)
I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.
I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.
IME, Claude is more powerful, but Codex follows instructions better. So the more precise the context, the better results you'll get with Codex.
Claude OTOH works better with ambiguity, but it also tends to stray a bit off spec in subtle ways. I always had to take more corrective action w/ the PRs it produced.
That said, I haven't used CC in 3 months and the latest models may be better.
This looks very similar to what I'm doing. Few questions:
- How do you adress spec drift? A new feature can easily affect 2 or 3 specs. Do you update them manually? Is a new feature part of a new spec or you update the spec and then plan based on spec changes?
- How do you address plan drift? A plan may change as implementer surfaces some issues with the spec for example.
- Whenever I have a change to suggest, I ask Gemini to review my docs/specs folder. I then describe the change I'm thinking of and ask it to modify the specs as it sees fit. I review those changes, ask questions or make suggestions/corrections, rinse/repeat until I'm satisfied. This tends to take about 5-6 iterations, esp. if the agent is adding or suggesting things I hadn't considered and want to dig in deeper on.
- I don't update plans in the past - any work that superseeds work from an earlier plan is simply a new plan. If during creation of a new plan I review the plan and decide I want to something else that requires a spec update, I trash the plan, do the spec update, and rerun plan generation. Past plans of course can point to divergent specs but that's not something I care about much, as plans are a self-contained enough story of the work that was done.
Looks good. Question - is it always better to use a monorepo in this new AI world? Vs breaking your app into separate repos? At my company we have like 6 repos all separate nextjs apps for the same user base. Trying to consolidate to one as it should make life easier overall.
It really depends but there’s nothing stopping you from just creating a separate folder with the cloned repositories (or worktrees) that you need and having a root CLAUDE.md file that explains the directory structure and referencing the individual repo CLAUDE.md files.
Just put all the repos in all in one directory yourself. In my experience that works pretty well.
AI is happy to work with any directory you tell it to. Agent files can be applied anywhere.