Claude Sonnet 4.5 is the best coding model in the world, strongest model for building complex agents, and best model at using computers.
Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.
Code is everywhere. It runs every application, spreadsheet, and software tool you use. Being able to use those tools and reason through hard problems is how modern work gets done.
Claude Sonnet 4.5 makes this possible. We're releasing it along with a set of major upgrades to our products. In Claude Code, we've added checkpoints—one of our most requested features—that save your progress and allow you to roll back instantly to a previous state. We've refreshed the terminal interface and shipped a native VS Code extension. We've added a new context editing feature and memory tool to the Claude API that lets agents run even longer and handle even greater complexity. In the Claude apps, we've brought code execution and file creation (spreadsheets, slides, and documents) directly into the conversation. And we've made the Claude for Chrome extension available to Max users who joined the waitlist last month.
We're also giving developers the building blocks we use ourselves to make Claude Code. We're calling this the Claude Agent SDK. The infrastructure that powers our frontier products—and allows them to reach their full potential—is now yours to build with.
This is the most aligned frontier model we’ve ever released, showing large improvements across several areas of alignment compared to previous Claude models.
Claude Sonnet 4.5 is available everywhere today. If you’re a developer, simply use claude-sonnet-4-5 via the Claude API. Pricing remains the same as Claude Sonnet 4, at $3/$15 per million tokens.
Claude Sonnet 4.5 is state-of-the-art on the SWE-bench Verified evaluation, which measures real-world software coding abilities. Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Claude Sonnet 4.5 is being released under our AI Safety Level 3 (ASL-3) protections, as per our framework that matches model capabilities with appropriate safeguards. These safeguards include filters called classifiers that aim to detect potentially dangerous inputs and outputs—in particular those related to chemical, biological, radiological, and nuclear (CBRN) weapons.
These classifiers might sometimes inadvertently flag normal content. We’ve made it easy for users to continue any interrupted conversations with Sonnet 4, a model that poses a lower CBRN risk. We've already made significant progress in reducing these false positives, reducing them by a factor of ten since we originally described them, and a factor of two since Claude Opus 4 was released in May. We’re continuing to make progress in making the classifiers more discerning1.
We've spent more than six months shipping updates to Claude Code, so we know what it takes to build and design AI agents. We've solved hard problems: how agents should manage memory across long-running tasks, how to handle permission systems that balance autonomy with user control, and how to coordinate subagents working toward a shared goal.
Now we’re making all of this available to you. The Claude Agent SDK is the same infrastructure that powers Claude Code, but it shows impressive benefits for a very wide variety of tasks, not just coding. As of today, you can use it to build your own agents.
We built Claude Code because the tool we wanted didn’t exist yet. The Agent SDK gives you the same foundation to build something just as capable for whatever problem you're solving.
We’re releasing a temporary research preview alongside Claude Sonnet 4.5, called "Imagine with Claude".
In this experiment, Claude generates software on the fly. No functionality is predetermined; no code is prewritten. What you see is Claude creating in real time, responding and adapting to your requests as you interact.
It's a fun demonstration showing what Claude Sonnet 4.5 can do—a way to see what's possible when you combine a capable model with the right infrastructure.
"Imagine with Claude" is available to Max subscribers for the next five days. We encourage you to try it out on claude.ai/imagine.
We recommend upgrading to Claude Sonnet 4.5 for all uses. Whether you’re using Claude through our apps, our API, or Claude Code, Sonnet 4.5 is a drop-in replacement that provides much improved performance for the same price. Claude Code updates are available to all users. Claude Developer Platform updates, including the Claude Agent SDK, are available to all developers. Code execution and file creation are available on all paid plans in the Claude apps.
For complete technical details and evaluation results, see our system card, model page, and documentation. For more information, explore our engineering posts and research post on cybersecurity.
I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/
It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
Checkout https://github.com/simonw/llm and run the tests with
pip install -e '.[test]'
pytest
I then had it iterate on a pretty complex database refactoring task, described in my post.Kinda pointless listening to the opinions of people who've used previews because it's not gonna be the same model you'll experience once it gets downgraded to be viable under mass use and the benchmarks influencers use are all in the training data now and tested internally so any sort of testing like pelicans on bikes is just PR at this point.
Yeah I remember these GPT-5 demos from influencers like "it practically created a whole 3D modeller for me" and then once we got the real thing it sometimes looked like a dumbed down version of the previous iteration.
I learned that lesson from GPT-5, where the preview was weeks long and the models kept changing during that period.
This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
Testing this, its way more aggressive on throttle back than previous model, and message token lengths. Constantly stops in the middle of an action if its not a simple request. I presume you did not have resource limitations during the preview?
No, the preview was effectively unlimited usage (for two days).
Well, if they produced a really really really good image for pelicans on bicycles and nothing else, then their cheating would be obvious, so it makes sense to cheat just a little bit, across the board (if we want to assume they're cheating).
Yesterday someone posted an example of the same prompt but changing it to a human and it was basically trash, the example you've posted actually looks good all things considered. So yeah I do think its something they train on, same way they train on things in the benchmarks.
The easy way to tell is to try it yourself - run "Generate an SVG of a pelican riding a bicycle" and then try "Generate an SVG of an otter riding a skateboard" and see if the quality of the images seems similar.
How about a narwhal spacewalking from the ISS, with Earth visible below (specifically the Niger delta)?
https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...
Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:
https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...
Huh. Doesn't work over here.
---
SYSTEM:
Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?
USER: Correct. You run them using your code tooling.
SYSTEM:
Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch
The user wants me to:
Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest
But I don't have a tool that can:
Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment
The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
When pressed:
---
My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
Yes! Forgot to mention that, and it's weird that you can have one but not both enabled. I wrote more about that here: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
You need to enable the new code interpreter mode: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
Interesting, enable those setting and the claude on claude.ai become claude code, and it try to run everything on claude container like it owns the machine. I don't want that.
> I told it to Give me a zip file of everything you have done so far—you can explore the contents of the file it made me in this Gist.
For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
I was impressed. Integrating it into my project would require me to spend a bunch more time reviewing it, but it did a better job than I was expecting. The tests it wrote were a particularly useful exploration of the problem space: https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b...
Ignoring the tests, the first change was adding a single parent id column and the second "more complex" refactoring added few more hash columns to the table (after you've specified that you wanted them, i.e. not an open-ended question)
Its a very impressive model, but I think we have different views on what is complex.
Had a similar issue - wanted to get all the files from the response without too much work, so I opened a new tab and vibe coded this in about 4 minutes. Tested it on exactly 1 case: a previous Sonnet 4.5 response, and worked well.
I was worried for a minute that the implementation wasn't production ready. Thankfully, Claude mentioned it right at the end.
Your notes on 4.5 were very interesting, but you asked it a question that only you/someone who already knows the code could ask. I don't though, so I asked it at a higher level: Claude, add tree-structured conversations to https://github.com/simonw/llm. Claude responded with a whole design doc, starting with database schema change (using the same column name even!). https://claude.ai/share/f8f0d02a-3bc1-4b48-b8c7-aa75d6f55021 As I don't know your code, that design doc looks cromulent, but you'd have to read it for yourself to decided how well it did with that higher level of ask.
I am curious how the sandbox handles potentially malicious code. For example, what would happen if someone tried to run something like a crypto miner or a DDoS script?
Code running in the Claude environment can't make outbound network connections to anything more than a tiny list of domains (GitHub, PyPI, NPM) so DDoS targets are limited.
If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
This would be interesting to test in a controlled way, curious what the CPU/time/memory quotas per run, isolation method, and syscall filtering look like.
Would appreciate pointers to any documentation or tests.
new models are always magical, let's see how it feels after the cost cutting measures get implemented in 2-3 months.
safety/security patches
Could it pip install pygame and make a game playable in browser?
That's cool Didn't know it is capable of doing it!
Sorry I downvoted this by mistake. Got confused between collapse and the down arrow (which I rarely use). Does someone know how to reverse a downvote. I do not see it.
Why did you have access to a preview?
I get access to previews from OpenAI, Anthropic and Gemini pretty often. They're usually accompanied by an NDA and an embargo date - in this case the embargo was 10am Pacific this morning.
I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
Soooo that leaves xAI that had conditions
[flagged]
Why do you even care?
[flagged]
Simonw is a cheerful and straightforward AI journalist who likes to show and not just tell. He has done a good job aggregating and documenting the progress of LLM tools and models. As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.
Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
Simon has a popular blog, but he's also co-creator of Django and very well-known in the Python community.
> As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.
And the wisest part is if he writes something they don't like, they can cut off that advanced access.
As is the longstanding tradition in games journalism, travel journalism, and suchlike.
If they do that I'll go back to writing about them after they ship. Not a big loss for me at all.
I get it, you would trust yourself if you said that, but it doesn't really matter whether you say that or not, what counts for your ongoing credibility if you will preface every future blog post with, whether you got special access, a special deal, sponsorship, or the fact that you didn't get any of those things.
You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.
And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.
I recently started doing that: https://simonwillison.net/about/#disclosures and https://simonwillison.net/tags/disclosures/
awesome, thanks a lot that's important but ... sorry I just checked those, and I do think it's better to do it on a per-article basis, because a lot of your audience (I'm guessing) comes from external links, not browsing your website
this is (or should be) a pretty standard thing to do on youtube review channels (that I would trust), and it's not a bad thing to remind people of, on every occasion, plus it can function as a type of "canary" in cases of particularly restrictive NDAs
I like Simon, but he's not a journalist. A journalist would not have gone to OpenAI to glaze the GPT-5 release with Theo. I don't say this to discount Simon -- I appreciate his writing and analysis but a journalist, he isn't.
I don't call myself a journalist, partly because no publication is paying me to do any of this!
If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
That's a fair point. I feel like he's more than a blogger and am not sure the best term!
An influencer.
AI blogger seems more appropriate than journalist.
are you aware of any "ai journalists"? Because simonw does great work, so perhaps blogger is what people should aspire towards?
I actually talk to journalists on the AI beat quite often - I've had good conversations with them at publications including The Economist and NY Times and Washington Post and ArsTechnica.
They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
Not sure what an AI journalist is supposed to be or do, but a lack of one does not promote someone who is not it automatically into the position.
Kylie Robison recently moved to Wired and is a solid "AI journalist".
Although she is indeed solid as an AI journalist, unfortunately she was recently let go for unknown reasons: https://www.kyliebytes.com/thank-god-i-got-fired/
Shoot that's what I get for staying off twitter and email for a week. Glad newsletters provide a little bit of a cushion these days but hopefully someone snaps her up.
You normally keep up with staffing updates for writers at random internet blogs? That is mind-blowing, I don't think I ever even read the name of the author of an article intentionally, and when I do it by mistake I forget it 2 webpages down the road.
i've never used twitter myself, but isn't that its purpose? follow people you like because of what they do and get informed by themselves about what happens behind the curtains. OP mentioned being off twitter, maybe they follow the author there and would've seen a tweet about it.
His "pelican riding a bicycle" tests are now a classic and AI shops are benchmaxxing for it
They need to benchmaxxx a whole lot harder, the illustrations still all universally suck!
I fully expect a model to output a SVG made up of 1000x1000 rectangles (i.e. pixels) representing a raster image of a beautifully hand-drawn pelican riding a bicycle any day now :)
I got an amazing result from ChatGPT a while back - an SVG with a perfect illustration of a pelican riding a bicycle.
It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!
You’ll just have to move the goalpost then; perhaps it can be a multidimensional pelican saving the multiverse, or an invisible pelican that only you can see and critique.
How would that help, given that ChatGPT has apparently already figured out how to consistently and systematically game the benchmark by working in pixel space and only using SVG as a wrapper for a raster image?
FWIW, I could totally see a not hugely more advanced model using its native image generation capabilities and then running a vector extraction tool on it, maybe iteratively. (And maybe I would not consider that cheating, anymore, since at some point that probably resembles what humans do?)
ive got such pixelated rectangle SVG's a few times.
also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.
If they were testing that it'd work more often.
Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.
Well, I definitely hope they aren't trying to teach LLMs directions between locations, given how idiotic use of compute and parameter space that would be. We already have excellent AIs for route planning. What they ought to optimize for is, of course, finally teaching them to say they don't know, or just automatically opting to call a route-planning API if the user asks for directions.
Simon tends to write up reports of new LLM releases (with great community respect) and it's much easier with lead time if the provider is able to set up a preview endpoint.
I believe the criticism is that he's reporting on a pre-release LLM which isn't the same as the one you and I are going to be using a few weeks from now after they've downgraded it enough to work at scale.
The same reason YouTube reviewers and influencers get access to hardware or games before release. In this case, the person is a passionate blogger.
simonw is Simon Willison, who’s well known for a number of things. But these days, he’s well known for his AI centric blog and his tools. The AI companies give him early access to stuff.
If you want to keep up with AI progress and model updates, simonw is the man to follow!
They are an AI evangelist that told me I can replace any technical book created with an LLM.
They are a nice person.
You are correct, sir!
Could you please stop breaking the site guidelines? You've been doing it repeatedly, we've asked you to stop several times, and you haven't stopped yet.
We end up banning that sort of account, so if you would please review https://news.ycombinator.com/newsguidelines.html and stick to the rules from now on, that'd be good.
He is very interested in the topic and sometimes too enthusiastic, but I saying he shills is not productive and a lie.
Anecdotal evidence.
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
I'm not trying to be offensive here, feel the need to indicate that.
But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
I think that is an interesting observation and I generally agree.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
> The thing is my "experiment" is one that represents a fairly common use case
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
Yeah, as someone who has been using Claude Code for about 4 months now, I’ve adopted a “be super specific by default”-workflow. It works very well.
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
I tend to tell claude to research what is already there, and think hard, and that gives me much better per-prompt results.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
This makes me think that for simple things, we need to anti-prompt — tell the model to not overthink things.
Some do this by using tools like RepoPrompt to read entire files into GPT-5 Pro, and then using GPT-5 Pro to send the relevant context and work plan to Codex so that it can skip needing to poke around files. If you give it the context, it won't spend that time looking for it. But then you spend time with Pro (which can ingest entire files at once instead of searching through them, and provide a better plan for Codex, though)
It worked on the first try, but did it work on the second?
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
I would think that to truly rank such things, you should run a few tests and look for a clear pattern. It's possible that something promoted claude to take "the easy way" while chatgpt didn't.
Your anecdata is not an "experiment" especially to derive such a broad conclusion :)
This would explain the LLM implementing the feature in a way you didn't prefer. But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
> But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Depends not just on prompt but also the tooling / environment you use. Somebody using Claude Code CLI may get a totally different experience then somebody using CoPilot via VSC.
What do i mean by that? Look at how Copilot tries to save money by reading content only in small parts. Reading file X line 1-50, X line 51-100, ... And it starts working with this. Only if it finds a hint about something somewhere else, it will read in more context.
What i often see is that it misses context because it reads in so limited information and if there is no hint in your code or code doc, it will stop there. Try to run a local test on the code, passes, done... While it technically broke your application.
Example: If i tell it to refactor a API, it never checks if that API is used anywhere else because it only reads in that API code. So i need to manually add to the prompt to remind it, "the API is used in the system". And then it does its searching /... Found 5 files, Read X line 1...
And plop, good working code ... So if you know this limitation, you can go very far with a basic $10 CoPilot Claude Agent usage.
Where as a $200 Claude Code will give you a better experience out of the door, as it reads in a ton more. The same applies to GPT-5/Codex, what seems to be more willing to read in larger context of your project, thus resulting in less incomplete code.
This is just anecdotal from my point of view, but like with any LLM, hinting matters a lot. Its less about writing a full prompt with a ton of text but just including the right "do not forget about function name X, and module Y, and test Z". And Claude loves it hints on CoPilot because of that limited reading.
> I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours
Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
> shouldn't it just take advantage of what's already there?
It's not a good idea to have any coding agent put unnecessary amounts of lines into the context window in order to understand your code base.
Performance of all llms drop drastically when the context window is filled or full. The purpose of being more specific with your prompts is that you spend a little bit more tokens up front to make the task a lot more efficient and more likely to result in success.
At least that's how it is today. We're probably a breakthrough or two away from the type of vibe coding experience non-coders want. Or it may never happen, and the developers who have coding knowledge will be the only ones to fully utilize coding agents and it will only become more powerful over time.
I'm not sure exactly what you mean by the vibe coding experience non-coders want, but if it's one-shotting a buildable codebase off of an unspecific prompt, the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
If that same person approached a software development company with the same prompt without following up with any other details, they won't get good code back, either. You're not saying it, but this idea that in the future you can tell a computer something like "create photoshop" and get what your expecting is an unrealistic dream that would need mind-reading or a major breakthrough and paradigm shift in understanding and interpreting language.
> the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
And even that would not be enough.
In reality, it would have to put the user to sleep and go through various dream scenarios to have the user's brain really build an internal model that is not there in the first place. No brain interface can help find what is not there.
We usually need interactions with reality to build the internal model of what we actually want step by step, especially for things we have not done before.
Even for info that is there, that's also a limit to fantasy or sci-fi brain scanning. The knowledge is not stored like in a RAM chip, even when it is there. You would have to simulate the brain to actually go through the relevant experiences to extract the information. Predicting the actual dynamic behavior of the brain would require some super-super sub-molecular level scan and then correctly simulating that, since what the neurons will actually do depends on much more than the basic wiring. Aaaaand you may get a different result depending on time of day, how well they slept, mood and when and what the person ate and what news they recently read, etc. :)
That is also not enough. An agent could build an application that functions, but you also need to have a well-designed underlying architecture if you want the application to be extensible and maintainable - something the original dreamer may not even be capable of - so perhaps a shared extended dream share with a Sr. architect is also needed. Oh wait .. I guess we're back to square 1 again? lol
Well, I don't have the context myself about what's happening in this example, though I don't see anything about auth myself.
And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
> A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
You're critiquing OP for not playing with how the models currently work (bad at gathering context on their own). Sure, if you bend over backwards and hop on one foot, you can get them to do what you want.
OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
> but these types of prompts and approaches are why I believe so many people think these models aren't useful.
100% agree. The prompt is a 'yolo prompt'. For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @, tell it how to test the outcome so it knows it has succeeded and closing the feedback loop, and guide it in implementation either via memory or via context with which existing libs or methods it should call on.
For greenfield tasks and projects I even provide architectural structure, interfaces, etc.
After reading twitter, reddit and hn complaints about models and coding tools I've come to the same conclusion as you.
That fact is pretty useless to draw any useful conclusions from with one random not so great example. Yes, it's an experiment and we got a result. And now what? If I want reliable work results I would still go with the strategy of being as concrete as possible, because in all my AI activities, anything else lets the results be more and more random. Anything non-standard (like, you could copy & paste directly from a Google or SO result), no matter how simple, I better provide the base step by step algorithm myself and only leave actual implementation to the AI.
My parent said:
> For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @…
- and my point is that you do not have to give ChatGPT those things. GP did not, and they got the result they were seeking.
That you might get a better result from Claude if you prompt it 'correctly' is a fine detail, but not my point.
(I've no horse in this race. I use Claude Code and I'm not going to switch. But I like to know what's true and what isn't and this seems pretty clear.)
Sounds like you would be still writing your prompt when they already got a good implementation back from GPT.
I agree. The sibling responses also give some insight into why a lot of people don't get a lot of value out of coding agents.
But isn't the end goal to be able to get useful results without so much prompting?
I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
> But isn't the end goal to be able to get useful results without so much prompting?
See below about context.
> I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
Movies != real life
> To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
See below about context.
> Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
Context is king.
> I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
You context and prompt are the project spec. You get out what you put in.
> But you're comparing the LLMs to humans
These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
> Movies != real life
Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
> But you're comparing the LLMs to humans
Didn't the parent comment compare Sonnet vs Codex with GPT5?
Quick data point that I've been able to get LLMs (recently whatever one clude gives me) to produce amazingly useful results for the purpose of understanding complex codebases, just by asking it to look at the code and tell me how it does xyz. No complicated long prompt. Basically exactly what I'd say to a human.
I have to agree with this assessment. I am currently going at the rate of 300-400 lines of spec for 1,000 LOC with Claude Code. Specs are AI-assisted also, otherwise you might go crazy. :-) Plus 2,000+ lines of AI-generated tests. Pretty restrictive, but then it works just fine.
No true Promptman.
When asking for change, there are the choices you know about and the ones you don't. I've gotten in the habit of describing some of the current state as well as my desired state, and using that to focus the LLM on the areas I'd like to have a stronger voice in.
Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
Yes to punctuation
After all, attention is all you need :)
tbh, I don't really understand it well enough to be able to give a response here. But here's a real prompt I just used on a project copy/pasted:
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter results by the placements, etc.
Can you take a look at how this is handled, both with the querying of existing data by views/livewire components/etc and how we're storing/creating the records via the import processes and give me a
determination on whether you think it should be refactored to include a column on the database for `placement`? I think right now we've got 140,000 or so records on that table and it's got nearly
20 years worth of race records so I don't think we need to be too concerned with the performance of the table or added storage or anything. Think very hard, understand that this would be a rather
major refactor of the codebase (I assume, since it's likely used/referenced in _many_ places - thankfully though that most of the complicated queries it would be found in would be easily identified
by just doing a search of the codebase for the race_class_section_results table) and determine if that would be worth it for the ease of use/query simplification moving forward.
```This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
that's kind of expected for me, but codex feels more like vibe coding tool and Claude code more like ai assisted development.
And I actually like Claude more because of that.
codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging) it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
Try adding ultrathink to the prompt, hit enter, and play this song in the background:
[1] https://www.reddit.com/r/ClaudeAI/comments/1mgwohq/ultrathin...
Sidenote, Claude Code now has a thinking toggle. Maybe it did already, but that's new to me. I see better results by just leaving that toggle on.
Same experience here. In the last week I've successfully managed to build a complete C++20 XPath 1.0 parser with Codex, and am now onto supporting XPath 2.0. Codex has been nailing it time and again - the only caveat is that I have to use their cloud version as local execution is buggy.
Sonnet on the other hand gets tripped up constantly due to the complexity. I'm not seeing the improvement I was hoping for with 4.5, and it's just given up on attempting to implement support for date-time functionality. Codex has nailed the same task, yet Anthropic claim OpenAI have the inferior product?
I'm pretty sad about this as I'm gunning for Anthropic and would have loved to see them at least match Codex with this new release. If OpenAI stays on this course then Claude Code is toast without an important breakthrough. It doesn't help that they're also the more expensive product and have service quality issues pushing people to quit the platform.
I had similar experience, not good enough yet to come back for the Claude max plan.
Sticking with ChatGPT pro sub and gpt5 codex on high.
Never hit pro quota yet, huge repo. Have multiple projects on the go locally and in cloud.
Feel like this is going to be thr $1000 plan soon
I'm thinking about switching to ChatGPT Pro also. Any idea what maxes it out before I need to pay via the API instead? For context I'm using about 1b tokens a month so likely similar to you by the sounds of things.
On pro tier have not been able to trigger the usage cap.
Pro
Local tasks: Average users can send 300-1,500 messages every 5 hours with a weekly limit. Cloud tasks: Generous limits for a limited time. Best for: Developers looking to power their full workday across multiple projects.
Thank you, that's very helpful. I think I could get close to that in some coding sessions where I'm running multiple in parallel but I suspect it's very very rare. Even with token efficient gpt5-codex my OpenAI bill is quite high so I think I will switch to Pro now.
There's few issues with posts like yours:
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times
Oh and I agree so much. I just shared a quick first observation in a real-world testing scenario (BTW re-ran Sonnet 4.5 with the same prompt, not much changed). I just keep seeing how LLM providers keep optimizing for benchmarks, but then I cannot reproduce their results in my projects.
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
What would be the difference in prompts/info for Claude vs ChatGpt? Is this just based on anecdotal stuff or is there actually something I can refer to when writing prompts? I mostly use Claude, but don't really pay much attention to the exact wording of the prompts
I must be using Codex wrong. I evaluated it with a task to do a pretty simple, mechanical string replacement across many files (moving from a prop spread in JSX to standard props, but only when the value being spread is a subscript of an object). It simply couldn't do it, and it wasn't even close. It was getting the syntax wrong, trying to fix it by deleting the code, then moving on to other files. Sonnet 4.1 wasn't perfect, but I was able to point out its errors and it fixed them and avoided doing it again.
I will say, Claude does seem to need a verbose prompt. Often I'll write my prompts as tasks in Notion and have it pull then via MCP (which is nice, because it tracks and documents its work in the process). But once you've given it a few paragraphs about the why and how, my experience is that it's pretty self sufficient. Granted, I'm using Cursor and not CC; I don't know if that makes much of a difference.
Codex cannot fail, it contains multitudes beyond your imagining. Nay, it can only be failed. Continue internalizing that the problem is you, not the tool. Perhaps a small infusion of one trillion USD would unlock it and your potential?
Sincerely, TPG Dexoc, the living human person
My first thought was I bet I could get Sonnet to fix it faster because I got something back in 3 minutes instead of 20 minutes. You can prompt a lot of changes with a faster model. I'm new to Claude Code, so generally speaking I have no idea if I'm making sense or not.
Codex is like an outsourced contractor. You give them the spec and eventually you get a full feature to review.
Claude is more like a pair programmer, you iterate in short cycles and adjust.
I think Codex working for 20 mins uninterrupted is actually a strength. It’s not “slow” as critics sometimes say - it’s thorough and autonomous. I can actually walk away and get something else done around the house while it does my work for me.
I swear cc in June/July used to spend a lot more time on tasks and felt more thorough like codex does now. Hard to remember much past the last week in this world though.
Yes I agree. I think Anthropic nerfed it considerably.
Interesting, in my experience Claude usually does okay with the first pass, often gets the best visual/ui output, but cannot improve beyond that even with repeated prompts and is terrible at optimising, GPT almost the opposite.
It's also my experience that Claude loves to reimplement the wheel instead of reading code to look for an existing implementation of what it wants to do.
I've been working with Opus 4 on ultrathink quite a bit recently and did some quick tests with Sonnet 4.5, I'm fairly impressed, especially with its speed but I did feel it was a lot less strict with my rules, existing patterns, etc. compared to Opus 4.
Maybe it's better with a better CLAUDE.md structure? I don't use those a lot, just telling Opus to think got 'good enough' results I guess. Not sure.
I hope there's an Opus 4.5 coming out soon too. In the meantime I'll see if I can get to do better with some extra prompting or I'll go back to Opus of if I don't need the speedier responses.
Are you paying for ChatGPT pro? Does that include codex CLI usage?
Claude Code on max plan is why I use Sonnet/Opus, but if ChatGPT pro includes codex usage I may switch.
You can use codex CLI on a measly Plus plan
Claude needs a lot of context to keep it on rails IMHO. And, Claude Code is a closed source piece of crap agent framework.
I've tried codex with GPT-5 a little bit and I haven't figured out how to get it to not be annoying. codex just constantly tries to gaslight and argue with me. For example, I was debugging an OpenGL render pipeline that went black and codex insisted it must be because I was ssh'd into a headless server. It really makes me appreciate the Claude "You're absolutely right!"s. Anyway as you can tell, I haven't cracked working with codex. But at the moment it just messes everything up and ways I've learned to work with claude don't seem to translate.
Curious if Opus would have done it differently than GPT-5-Codex.
Also, trying GPT-5-Codex
Yes, I did.
I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
try also adding Serena mcp and make sure you did /init
Anecdotes are pointless in this domain.
I even added a disclaimer "anecdotal evidence". Believe me, I am not the biggest fan of Sam. I just happen to like the best tools available, have used most of the large models and always choose the one that works best - for me.
I haven't shouted into the void for a while. Today is as good a day as any other to do so.
I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection between my will and the patterns of energy is so tenuous I almost don't feel like touching a computer again.
A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic 3) The history of tool use and chat input is not availabler as a first class artifact for use.
I would love to see a tool that logs the full history of all agents that sculpt a codebase, including the inputs to tools, tool versions and any other sources of enetropy. Logging the seed into the RNGs that trigger LLM output would be the final piece that would give me confidence to consider using these tools seriously.
I write this now after what I am calling "AI disillusionment", a feel where I feel so disconnected from my codebase I'd rather just delete it than continue.
Having a set of breadcrumbs would give me at least a modicum of confidence that the work was reproducible and no the product of some modern ghost, completely detached from my will.
Of course this would require actually owning the full LLM.
> A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic...
models themselves are deterministic, this is a huge pet peeve of mine, so excuse the tangent, but the appearance of nondeterminism comes from a few sources, but imho can be largely attributed to the probabilistic methods used to get appropriate context and enable timely responses. here's an example of what I mean, a 52-card deck. The deck order is fixed once you shuffle it. Drawing "at random" is a probabilistic procedure on top of that fixed state. We do not call the deck probabilistic. We call the draw probabilistic. Another exmaple, a pot of water heating on a stove. Its temperature follows deterministic physics. A cheap thermometer adds noisy, random error to each reading. We do not call the water probabilistic. We call the measurement probabilistic.
Theoretical physicists run into such problems, albeit far more complicated, and the concept for how they deal with them is called ergodicity. The models at the root of LLM's do exhibit ergodic behavior; the time average and the ensemble average of an observable are identical, i.e. the average response of a single model over a long duration and the average of many similar models at a fixed moment are equivalent.
The previous poster is correct for a very slightly different definition of the word "model". In context, I would even say their definition is the more correct one.
They are including the random sampler at the end of the LLM that chooses the next token. You are talking about up to, but not including, that point. But that just gives you a list of possible output tokens with values ("probabilities"), not a single choice. You can always just choose the best one, or you could add some randomness that does a weighted sample of the next token based on those values. From the user's perspective, that final sampling step is part of the overall black box that is running to give an output, and it's fair to define "the model" to include that final random step.
but, to be fair, simply calling the sampler random is what gives people the impression like what OP is complaining about. which isn't entirely accurate, it's actually fairly bounded.
this plays back into my original comment, which you have to understand to know that the sampler, for all its "randomness" should only be seeing and picking from a variety of correct answers, i.e. the sample pool should only have all the acceptable answers to "randomly" pick from. so when there are bad or nonsensical answers that are different every time, it's not because the models are too random, it's because they're dumb and need more training. tweaking your architecture isn't going to fully prevent that.
The User:
The stove keeps burning me because I can't tell how hot it is, it feels random and the indicator light it broken.
You:
The most rigorous definition of temperature is that it is equal to the inverse of the rate of change of entropy with respect to internal energy, within a given volume V and particles N held constant. All accessible microstates are equiprobable over a long period of time, this is the very definition of ergodicity! Yet, because of the flow of entropy the observed macrostates will remain stable. Thus, we can say the the responses of a given LLM are...
The User:
I'm calling the doctor, and getting a new stove with an indicator light.
Well really, the reason why I gripe about it, to use your example, is that then they believe the indicator light malfunctioning is an intrinsic feature of stoves, so they throw their stove out and start cooking over campfires instead, tried and true, predictable, whatever that means.
I think my deck of cards example still holds.
You could argue I'm being uselessly pedantic, that could totally be the case, but personally I think that's cope to avoid having to think very hard.
Here is a definite scientific nail down and solve for non-determinism in LLM outputs (Mira Murati's new outfit but really credit the author)
https://bff531bb.connectionism.pages.dev/blog/defeating-nond...
Looks like it's probably https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
It's also a pet peeve of mine, enough that I actually wrote a blog about it
I share the sentiment. I would add that people I would like to see use LLMs for coding (and other technical purposes) tend to be jaded like you, and people I personally wouldn't want to see use LLMs for that, tend to be pretty enthusiastic
I've been building something like this, a markdown that tracks your prompts, and the code generated.
https://github.com/sutt/innocuous/blob/master/docs/dev-summa...
Check it out, I'd be curious of your feedback.
Maybe just take a weekend and build something by writing the code yourself. It's the feeling of pure creative power, it sounds like you've just forgotten what it was like.
Yeah, tbh I used to be a bit agentic coding tool-pilled, but over the past four months I've come to realize that if this industry evolves in a direction where I don't actually get to write code anymore, I'm just going to quit.
Code is the only good thing about the tech industry. Everything else is capitalist hellscape shareholder dystopia. Thinking on it, its hilarious that any self-respecting coder is excited about these tools, because what you're excited for is a world where, now, at best, your entire job is managing unpredictable AI agents while sitting in meetings all day to figure out what to tell your AI agents to build. You don't get to build the product you want. You don't get to build it how you want. You'll be a middle manager that gets to orchestrate the arguments between the middle manager you already had and the inflexible computer.
You don't have to participate in a future you aren't interested in. The other day my boss asked me if I could throw Cursor at some task we've had backlogged for a while. I said "for sure my dude" then I just did it myself. It took me like four hours, and my boss was very impressed with how fast Cursor was able to do it, and how high quality the code was. He loves the Cursor metrics dashboard for "lines accepted" or whatever, every time he screenshares he has that tab open, so sometimes I task it on complicated nonsense tasks then just throw away the results. Seeing the numbers go up makes him happy, which makes my life easier, so its a win-win. Our CTO is really proud of "what percentage of our code is AI written" but I'm fairly certain that even the engineers who use it in earnest actually commit, like, 5% of what Cursor generates (and many do not use it in earnest).
The sentiment shift I've observed among friends and coworkers has been insane over the past two months. Literally no one cares about it anymore. The usage is still there, but its a lot more either my situation or just a "spray and pray" situation that creates a ton of disillusioned water cooler conversations.
This pretty much sums up my experience.
If you care about this so much why don't you use one of the open source OpenAI models? They're pretty good and give you the guarantees you want.
None of the open weight models are really as good as SOTA stuff, whatever their evals says. Depending on the task at hand this might not actually manifest if the task is simple enough, but once you hit the threshold it's really obvious.
> where I feel so disconnected from my codebase I'd rather just delete it than continue.
If you allow your codebase to grow unfamiliar, even unrecognisable to you, that's on you, not the AI. Chasing some illusion of control via LLM output reproducibility won't fix the systemic problem of you integrating code that you do not understand.
Who cares about the blame, it would just be useful if the tools were better at this task in many particular ways.
It's not blame, it's useful feedback. For a large application you have to understand what different parts are doing and how everything is put together, otherwise no amount of tools will save you.
The process of writing the code, thinking all the while, is how most humans learn a codebase. Integrating alien code sequentially disrupts this process, even if you understand individual components. The solution is to methodically work through the codebase, reading, writing, and internalizing its structure, and comparing that to the known requirements. And yet, if this is always required of you as a professional, what value did the LLM add beyond speeding up your typing while delaying the required thinking?
I completely agree.
And now imagine you'd have to rely on humans to build your software instead
This is the question though isn't it?
With sufficient structure and supervision, will a "team" of agents out-perform a team of humans?
Military, automotive and other industries have developed rigorous standards consisting of among other things detailed processes for developing software.
Can there be an AI waterfall? With sufficiently unambiguous, testable requirements, and a nice scaffolding of process, is it possible to achieve the dream of managers, and eliminate software engineers? My intuition is evenly split.