OpenAI Codex CLI: Lightweight coding agent that runs in your terminal

2025-04-1617:24516289github.com

Lightweight coding agent that runs in your terminal - openai/codex

Show article

Lightweight coding agent that runs in your terminal

npm i -g @openai/codex

Table of Contents

Install globally:

npm install -g @openai/codex

Next, set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

Note: This command sets the key only for your current terminal session. To make it permanent, add the export line to your shell's configuration file (e.g., ~/.zshrc).

Run interactively:

Or, run with a prompt as input (and optionally in Full Auto mode):

codex "explain this codebase to me"

codex --approval-mode full-auto "create the fanciest todo-list app"

That’s it – Codex will scaffold a file, run it inside a sandbox, install any missing dependencies, and show you the live result. Approve the changes and they’ll be committed to your working directory.

Codex CLI is built for developers who already live in the terminal and want ChatGPT‑level reasoning plus the power to actually run code, manipulate files, and iterate – all under version control. In short, it’s chat‑driven development that understands and executes your repo.

Zero setup — bring your OpenAI API key and it just works!
Full auto-approval, while safe + secure by running network-disabled and directory-sandboxed
Multimodal — pass in screenshots or diagrams to implement features ✨

And it's fully open-source so you can see and contribute to how it develops!

Codex lets you decide how much autonomy the agent receives and auto-approval policy via the --approval-mode flag (or the interactive onboarding prompt):

Mode	What the agent may do without asking	Still requires approval
Suggest (default)	• Read any file in the repo	• All file writes/patches • All shell/Bash commands
Auto Edit	• Read and apply‑patch writes to files	• All shell/Bash commands
Full Auto	• Read/write files • Execute shell commands	–

In Full Auto every command is run network‑disabled and confined to the current working directory (plus temporary files) for defense‑in‑depth. Codex will also show a warning/confirmation if you start in auto‑edit or full‑auto while the directory is not tracked by Git, so you always have a safety net.

Coming soon: you’ll be able to whitelist specific commands to auto‑execute with the network enabled, once we’re confident in additional safeguards.

The hardening mechanism Codex uses depends on your OS:

macOS 12+ – commands are wrapped with Apple Seatbelt (sandbox-exec).
- Everything is placed in a read‑only jail except for a small set of writable roots ($PWD, $TMPDIR, ~/.codex, etc.).
- Outbound network is fully blocked by default – even if a child process tries to curl somewhere it will fail.
Linux – we recommend using Docker for sandboxing, where Codex launches itself inside a minimal container image and mounts your repo read/write at the same path. A custom iptables/ipset firewall script denies all egress except the OpenAI API. This gives you deterministic, reproducible runs without needing root on the host. You can read more in run_in_container.sh

Both approaches are transparent to everyday usage – you still run codex from your repo root and approve/reject steps as usual.

Requirement	Details
Operating systems	macOS 12+, Ubuntu 20.04+/Debian 10+, or Windows 11 via WSL2
Node.js	22 or newer (LTS recommended)
Git (optional, recommended)	2.23+ for built‑in PR helpers
RAM	4‑GB minimum (8‑GB recommended)

Never run sudo npm install -g; fix npm permissions instead.

Command	Purpose	Example
`codex`	Interactive REPL	`codex`
`codex "…"`	Initial prompt for interactive REPL	`codex "fix lint errors"`
`codex -q "…"`	Non‑interactive "quiet mode"	`codex -q --json "explain utils.ts"`

Key flags: --model/-m, --approval-mode/-a, and --quiet/-q.

Codex merges Markdown instructions in this order:

~/.codex/instructions.md – personal global guidance
codex.md at repo root – shared project notes
codex.md in cwd – sub‑package specifics

Disable with --no-project-doc or CODEX_DISABLE_PROJECT_DOC=1.

Run Codex head‑less in pipelines. Example GitHub Action step:

- name: Update changelog via Codex run: |
 npm install -g @openai/codex
 export OPENAI_API_KEY="${{ secrets.OPENAI_KEY }}"
 codex -a auto-edit --quiet "update CHANGELOG for next release"

Set CODEX_QUIET_MODE=1 to silence interactive UI noise.

Below are a few bite‑size examples you can copy‑paste. Replace the text in quotes with your own task.

✨	What you type	What happens
1	`codex "Refactor the Dashboard component to React Hooks"`	Codex rewrites the class component, runs `npm test`, and shows the diff.
2	`codex "Generate SQL migrations for adding a users table"`	Infers your ORM, creates migration files, and runs them in a sandboxed DB.
3	`codex "Write unit tests for utils/date.ts"`	Generates tests, executes them, and iterates until they pass.
4	`codex "Bulk‑rename .jpeg → .jpg with git mv"`	Safely renames files and updates imports/usages.
5	`codex "Explain what this regex does: ^(?=.*[A-Z]).{8,}$"`	Outputs a step‑by‑step human explanation.
6	`codex "Carefully review this repo, and propose 3 high impact well-scoped PRs"`	Suggests impactful PRs in the current codebase.
7	`codex "Look for vulnerabilities and create a security review report"`	Finds and explains security bugs.

From npm (Recommended)

npm install -g @openai/codex
# or
yarn global add @openai/codex

Build from source

# Clone the repository and navigate to the CLI package
git clone https://github.com/openai/codex.git
cd codex/codex-cli # Install dependencies and build
npm install
npm run build # Run the locally‑built CLI directly
node ./dist/cli.js --help # Or link the command globally for convenience
npm link

Codex looks for config files in ~/.codex/.

# ~/.codex/config.yaml
model: o4-mini # Default model
fullAutoErrorMode: ask-user # or ignore-and-continue

You can also define custom instructions:

# ~/.codex/instructions.md
- Always respond with emojis
- Only use git commands if I explicitly mention you should

How do I stop Codex from touching my repo?

Codex always runs in a sandbox first. If a proposed command or file change looks suspicious you can simply answer n when prompted and nothing happens to your working tree.

Does it work on Windows?

Not directly, it requires Linux on Windows (WSL2) – Codex is tested on macOS and Linux with Node ≥ 22.

Which models are supported?

Any model available with Responses API. The default is o4-mini, but pass --model gpt-4o or set model: gpt-4o in your config file to override.

This project is under active development and the code will likely change pretty significantly. We'll update this message once that's complete!

More broadly We welcome contributions – whether you are opening your very first pull request or you’re a seasoned maintainer. At the same time we care about reliability and long‑term maintainability, so the bar for merging code is intentionally high. The guidelines below spell out what “high‑quality” means in practice and should make the whole process transparent and friendly.

Create a topic branch from main – e.g. feat/interactive-prompt.
Keep your changes focused. Multiple unrelated fixes should be opened as separate PRs.
Use npm run test:watch during development for super‑fast feedback.
We use Vitest for unit tests, ESLint + Prettier for style, and TypeScript for type‑checking.
Make sure all your commits are signed off with git commit -s ..., see Developer Certificate of Origin (DCO) for more details.

# Watch mode (tests rerun on change)
npm run test:watch # Type‑check without emitting files
npm run typecheck # Automatically fix lint + prettier issues
npm run lint:fix
npm run format:fix

Start with an issue. Open a new one or comment on an existing discussion so we can agree on the solution before code is written.
Add or update tests. Every new feature or bug‑fix should come with test coverage that fails before your change and passes afterwards. 100 % coverage is not required, but aim for meaningful assertions.
Document behaviour. If your change affects user‑facing behaviour, update the README, inline help (codex --help), or relevant example projects.
Keep commits atomic. Each commit should compile and the tests should pass. This makes reviews and potential rollbacks easier.

Fill in the PR template (or include similar information) – What? Why? How?
Run all checks locally (npm test && npm run lint && npm run typecheck). CI failures that could have been caught locally slow down the process.
Make sure your branch is up‑to‑date with main and that you have resolved merge conflicts.
Mark the PR as Ready for review only when you believe it is in a merge‑able state.

One maintainer will be assigned as a primary reviewer.
We may ask for changes – please do not take this personally. We value the work, we just also value consistency and long‑term maintainability.
When there is consensus that the PR meets the bar, a maintainer will squash‑and‑merge.

Be kind and inclusive. Treat others with respect; we follow the Contributor Covenant.
Assume good intent. Written communication is hard – err on the side of generosity.
Teach & learn. If you spot something confusing, open an issue or PR with improvements.

If you run into problems setting up the project, would like feedback on an idea, or just want to say hi – please open a Discussion or jump into the relevant issue. We are happy to help.

Together we can make Codex CLI an incredible tool. Happy hacking! 🚀

All commits must include a Signed‑off‑by: footer.
This one‑line self‑certification tells us you wrote the code and can contribute it under the repo’s license.

# squash your work into ONE signed commit
git reset --soft origin/main # stage all changes
git commit -s -m "Your concise message"
git push --force-with-lease # updates the PR

We enforce squash‑and‑merge only, so a single signed commit is enough for the whole PR.

Scenario	Command
Amend last commit	`git commit --amend -s --no-edit && git push -f`
GitHub UI only	Edit the commit message in the PR → add `Signed-off-by: Your Name <email@example.com>`

The DCO check blocks merges until every commit in the PR carries the footer (with squash this is just the one).

Have you discovered a vulnerability or have concerns about model output? Please e‑mail security@openai.com and we will respond promptly.

This repository is licensed under the Apache-2.0 License.

Read the original article

Comments

By gklitt 2025-04-1620:377 reply

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

By strangescript 2025-04-1623:083 reply

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.

I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.

By artdigital 2025-04-172:192 reply

Claude Code is just way too expensive.

These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.

By aitchnyu 2025-04-175:461 reply

Is it using one of these models? https://openrouter.ai/models?q=amazon

Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.

By artdigital 2025-04-1711:462 reply

No, Amazon Q is using Amazon Q. You can't change the model, it's calling itself "Q" and it's capped to $20 (Q Developer Pro plan). There is also a free tier available - https://aws.amazon.com/q/developer/

It's very much a "Claude Code" in the sense that you have a "q chat" command line command that can do everything from changing files, running shell commands, reading and researching, etc. So I can say "q chat" and then tell it "read this repo and create a README" or whatever else Claude Code can do. It does everything by itself in an agentic way. (I didn't want to say like 'Aider' because the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change)

(It's calling itself Q but from my testing it's pretty clear that it's a variant of Claude hosted through AWS which makes sense considering how much money Amazon pumped into Anthropic)

By dingnuts 2025-04-1715:442 reply

> the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change

how is this appealing? I think I must be getting old because the idea of letting a language model run wild and run commands on my system -- that's unsanitized input! --horrifies me! What do you mean just let it change random files??

I'm going to have to learn a new trade, IDK

By winrid 2025-04-1718:38

It shows you the diff and you confirm it, asks you before running commands, and doesn't allow accessing files outside the current dir. You can also tell it to not ask again and let it go wild, I've built full features this way and then just go through and clean it up a bit after.

By hmottestad 2025-04-186:32

In the OpenAI demo of codex they said that it’s sandboxed.

It only has access to files within the directory it’s run from, even if it calls tools that could theoretically access files anywhere on your system. Also had networking blocked, also in a sandboxes fashion so that things like curl don’t work either.

I wasn’t particularly impressed with my short test of Codex yesterday. Just the fact that it managed to make any decent changes at all was good, but when it messed up the code it took a long time and a lot of tokens to figure out.

I think we need fine tuned models that are good at different tasks. A specific fine tune for fixing syntax errors in Java would be a good start.

In general it also needs to be more proactive in writing and running tests.

By aitchnyu 2025-04-1711:551 reply

I felt Sonnet 3.7 would cost at least $30 a month for light use. Did they figure out a way to offer it cheaper?

By nmcfarl 2025-04-1712:29

I don’t know what Amazon did - but I use Aider+Openrouter with Gemini 2.5 pro and it cost 1/6 of what sonnet 3.7 does. The aider leaderboard https://aider.chat/docs/leaderboards/ - includes relative pricing theses days.

By monsieurbanana 2025-04-178:211 reply

> Upgrade apps in a fraction of the time with the Amazon Q Developer Agent for code transformation (limit 4,000 lines of submitted code per month)

4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.

Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore

By artdigital 2025-04-1712:071 reply

You are confusing Amazon Q in the editor (like "transform"), and Amazon Q on the CLI. The editor thing has some stuff that costs extra after exceeding the limit, but the CLI tool (that acts similar to Claude Code) is a separate feature that doesn't have this restriction. See https://aws.amazon.com/q/developer/pricing/?p=qdev&z=subnav&..., under "Console" see "Chat". The list is pretty accurate with what's "included" and what costs extra.

I've been running this almost daily for the past months without any issues or extra cost. Still just paying $20

By monsieurbanana 2025-04-199:191 reply

I see, thanks. The 4k limit for the gui still seems so low, but I might try the cli sometime.

By artdigital 2025-04-225:16

Do try! The free tier doesn't cost anything and is enough to tinker around with. You don't even need an AWS account for it, it'll prompt you to create a new separate account specifically for Q

By ekabod 2025-04-170:213 reply

"gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).

[1]https://aider.chat/

By razemio 2025-04-175:392 reply

Compared to cline aider had no chance, the last time I tried it (4 month ago). Has it really changed? Always thought cline is superior because it focuses on sonnet with all its bells an whistles. While aider tries to be an universal ide coding agent which works well with all models.

When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.

Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.

By ekabod 2025-04-1718:42

I tried Cline, but I work faster using the command line style of Aider. Having the /run command to execute a script and having the console content added to the prompt, makes fixing bugs very fast.

By mstipetic 2025-04-176:15

It has multiple edit modes, you have to pair them up properly

By jacooper 2025-04-170:231 reply

Don't they train on your inputs if you use the free Ai studio api key?

By asadm 2025-04-170:27

speaking for myself, I am happy to make that trade. As long as I get unrestricted access to latest one. Heck, most of my code now is written by gemini anyway haha.

By strangescript 2025-04-2119:49

I would use Aider if it had an agent mode. It needs to catch up with UX, frankly just have a mode that copies what claude code does.

By Aeolun 2025-04-1623:252 reply

> Its not cheap, but its by far the best, most consistent experience I have had.

It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.

By jasonjmcghee 2025-04-170:27

If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I'm not surprised you're seeing high cost.

Be aware of the "cache".

Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).

Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.

Have a clear goal in mind and keep sessions to as few messages as possible.

Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.

I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).

If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).

For hobby stuff, it adds up - totally.

For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).

By Implicated 2025-04-172:336 reply

I keep seeing this sentiment and it's wild to me.

Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.

Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.

Like others have said - if you're exhausting the context window, the problem is you, not the tool.

Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.

If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)

If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.

I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)

I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)

100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.

It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.

It's a skill problem. Not the tool.

By someothherguyy 2025-04-174:213 reply

How can it be a skill problem when the tool itself is sold as being skilled?

By mirsadm 2025-04-175:061 reply

You're using it wrong, you're using the wrong version etc etc insert all the excuses how it's never the tool but the users fault.

By Implicated 2025-04-176:471 reply

If this is truly your perspective, you've already lost the plot.

It's almost always the users fault when it comes to tools. If you're using it and it's not doing its 'job' well - it's more likely that you're using it wrong than it is that it's a bad tool. Almost universally.

Right tool for the job, etc etc. Also important that you're using it right, for the right job.

Claude Code isn't meant to refactor entire projects. If you're trying to load up 100k token "whole projects" into it - you're using it wrong. Just a fact. That's not what this tool is designed to do. Sure.. maybe it "works" or gets close enough to make people think that is what it's designed for, but it's not.

Detailed, specific work... it excels, so wildly, that it's astonishing to me that these takes exist.

In saying all of that, there _are_ times I dump huge amounts of context into it (Claude, projects, not Claude Code - cause that's not what it's designed for) and I don't have "conversations" with it in that manner. I load it up with a bunch of context, ask my question/give it a task and that first response is all you need. If it doesn't solve your concern, it should shine enough light that you now know how you want to address it in a more granular fashion.

By troupo 2025-04-177:35

The unpredictable non-deterministic black box with an unknown training set, weights and biases is behaving contrary to how it's advertised? The fault lies with the user, surely.

By mwigdahl 2025-04-175:54

A junior developer is skilled too, but still requires a senior’s guidance to keep them focused and on track. Just because a tool has built in intelligence doesn’t mean it can read your intentions from nothing if you fail to communicate to it well.

By Implicated 2025-04-176:491 reply

Serious question?

Is it a tool problem or a skill problem when a surgeon doesn't know how to use a robotic surgery assistant/robot?

By troupo 2025-04-177:36

https://news.ycombinator.com/item?id=43714059

By threecheese 2025-04-1715:02

How can one develop this skill via trial and error if the cost is unknowably high? Before reasoning, it was less important when tokens are cheap, but mixing models, some models being expensive to use, and reasoning blowing up the cost, having to pay even five bucks to make a mistake sure makes the cost seem higher than the value. A little predictability here would go a long way in growing the use of these capabilities, and so one should wonder why cost predictability doesn’t seem to be important to the vendors - maybe the value isn’t there, or is only there for the select few that can intuit how to use the tech effectively.

By afletcher 2025-04-1710:37

Thanks for sharing. Are you able to control the context when using Claude Code, or are you using other tools that give you greater control over what context to provide? I haven't used Claude Code enough to understand how smart it is at deciding what context to load itself and if you can/need to explicitly manage it yourself.

By disqard 2025-04-173:491 reply

This comment echoes my own experience with Claude. Especially the advice about only pulling in the context you need.

I'm a paying customer and I know my time is sufficiently valuable that this kind of technology pays for itself.

As an analogy, I liken it to a scribe (author's assistant).

Your comment has lots of useful hints -- thanks for taking the time to write them up!

By Implicated 2025-04-176:50

I like the scribe analogy. And, just like a scribe, my primary complaint with claude code isn't the cost or the context - but the speed. It's just so slow :D

By siva7 2025-04-175:24

True. Matches my experience. It takes much effort to get really proficient with ai. It's like learning to ride a wild horse. Your senior dev skills will sure come handy in this ride but don't expect it to work like some google query

By Aeolun 2025-04-1817:20

> It's not going to build your entire app in a single session/context window.

I mean, it was. Right up until it exhausted the context window. Then it suddenly required hand holding.

If I wanted to do that I might as well use Cursor.

By ilaksh 2025-04-1620:541 reply

Did you try the same exact test with o3 instead? The mini models are meant for speed.

By gklitt 2025-04-1621:57

I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.

By ksec 2025-04-1713:37

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.

Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.

The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.

By enether 2025-04-1710:301 reply

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case

I wonder if this is what's causing it to do badly in these cases

By victor9000 2025-04-2116:56

> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.

AGI may well be on its way, as the mode is mastering the fine art of bullshitting.

By kristopolous 2025-04-178:46

Ever use Komment? They've been in the game a awhile. Looks pretty good

By swyx 2025-04-1617:372 reply

related demo/intro video: https://x.com/OpenAIDevs/status/1912556874211422572

this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320

and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46

by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space

(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)

if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025

By axkdev 2025-04-1619:511 reply

Hey! The weakest part of Claude Code I think is that it's closed source and locked to Claude models only. If you are looking for inspiration, Roo is the the best tool atm. It offers far more interesting capabilities. Just to name some - user defines modes, the built in debug mode is great for debugging, architecture mode. You can, for example, ask it to summarize some part of the running task and start a new task with fresh context. And, unlike in Claude Code, in Roo the LLM will actually follow your custom instructions (seriously, guys, that Claude.md is absolutely useless)! The only drawback of Roo, in my opinion, is that it is NOT a cli.

By kristopolous 2025-04-1622:09

there's goose. plandex and aider also there's kilo as a new fork of roo.

By senko 2025-04-1619:56

I got confused, so to clarify to myself and others - codex is open source, claude code isn't, and the referenced decompilation tweets are for claude code.

By asadm 2025-04-1622:454 reply

These days, I usually paste my entire (or some) repo into gemini and then APPLY changes back into my code using this handy script i wrote: https://github.com/asadm/vibemode

I have tried aider/copilot/continue/etc. But they lack in one way or the other.

By jwpapi 2025-04-1623:142 reply

It’s not just about saving money or making less mistakes its also about iteration speed. I can’t believe this process is remotely comparable to aider.

In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.

Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.

By asadm 2025-04-1623:20

I am at my 5th project in this workflow and these are of different types too:

- an embedded project for esp32 (100k tokens)

- visual inertial odometry algorithm (200k+ tokens)

- a web app (60k tokens)

- the tool itself mentioned above (~30k tokens)

it has worked well enough for me. Other methods have not.

By t1amat 2025-04-171:14

Use a tool like repomix (npm), which has extensions in some editors (at least VSCode) that can quickly bundle source files into a machine readable format

By brandall10 2025-04-1622:481 reply

Why not just select Gemini Pro 2.5 in Copilot with Edit mode? Virtually unlimited use without extra fees.

Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.

By asadm 2025-04-1622:557 reply

copilot (and others) try to be too smart and do context reduction (to save their own wallets). I want ENTIRETY of the files I attached to context, not RAG-ed version of it.

By bredren 2025-04-1623:402 reply

This problem is real.

Claude Projects, chatgpt projects, Sourcegraph Cody context building, MCP file systems, all of these are black boxes of what I can only describe as lossy compression of context.

Each is incentivized to deliver ~”pretty good” results at the highest token compression possible.

The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.

Self plug but super relevant: I built FileKitty specifically to aid this, which made HN front page and I’ve continued to improve:

https://news.ycombinator.com/item?id=40226976

If you can prepare your file system context yourself using any workflow quickly, and pair it with appropriate additional context such as run output, problem description etc, you can get excellent results and you can pound away at OpenAI or Anthropic subscription refining the prompt or updating the file context.

I have been finding myself spending more time putting together prompt complexity for big difficult problems, they would not make sense to solve in the IDE.

By airstrike 2025-04-170:21

> The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.

Same. I used to run a bash script that concatenates files I'm interested in and annotates their path/name to the top in a comment. I haven't needed that recently as I think the # of attachments for Claude has increased (or I haven't needed as many small disparate files at once)

By asadm 2025-04-1623:531 reply

filekitty is pretty cool!

By bredren 2025-04-175:20

Thank you! I was glad to read your comments here and see your project.

I have encountered this issue of reincorporation of LLM code recommendations back into a project so I’m interested in exploring your take.

I told a colleague that I thought excellent use of copy paste and markdown were some of the chief skills of working with gen AI for code right now.

This and context management are as important as prompting.

It makes the details of the UI choices for copying web chat conversations or their segments so strangely important.

By nowittyusername 2025-04-1623:163 reply

I believe this is the root of the problem for all agentic coding solutions. They are gimping the full context through fancy function calling and tool use to reduce the full context that is being sent through the API. Problem with this is you can never know what context is actually needed for the problem to be solved in the best way. The funny thing is, this type of behavior actually leads many people to believe these models are LESS capable then they actually are, because people don't realize how restricted these models are behind the scenes by the developers. Good news is, we are entering the era of large context windows and we will all see a huge performance increase in coding as a results of these advancement.

By pzo 2025-04-173:56

OpenAI shared chart about performance drop with large context like 500k tokens etc. So you still want to limit the context not only for the cost but performance as well. You also probably want to limit context to speedup inference and get reponse faster.

I agree though that a lot of those agents are black boxes and hard to even learn how to best combine .rules, llms.txt, prd, mcp, web search, function call, memory. Most IDEs don't provide output where you can inspect final prompts etc to see how those are executed - maybe you have to use some MITMproxy to inspect requests etc but some tool would be useful to learn best practices.

I will be trying more roo code and cline since they open source and you can at least see system prompts etc.

By cynicalpeace 2025-04-170:111 reply

This stuff is so easy to do with Cursor. Just pass in the approximate surface area of the context and it doesn't RAG anything if your context isn't too large.

By asadm 2025-04-170:19

i havent tried recently but does it tell if it RAG'ed or not ie. can I peak at context it sent to model?

By asadm 2025-04-1623:33

exactly. I understand the reason behind this but it's too magical for me. I just want dumb tooling between me and my LLM.

By thelittleone 2025-04-1623:181 reply

Regarding context reduction. This got me wondering. If I use my own API key, there is no way for the IDE or coplilot provider to benefit other than monthly sub. But if I am using their provided model with tokens from the monthly subscription, they are incentivized to charge me based on tokens I submit to them, but then optimize that and pass on a smaller request to the LLM and get more margin. Is that what you are referring to?

By asadm 2025-04-1623:39

Yup. but also there was good reason to do this. Models work better with smaller context. Which is why I rely on Gemini for this lazy/inefficient workflow of mine.

By brandall10 2025-04-1623:082 reply

FWIW, Edit mode gives the impression of doing this, vs. originally only passing the context visible from the open window.

You can choose files to include and they don't appear to be truncated in any way. Though to be fair, I haven't checked the network traffic, but it appears to operate in this fashion from day to day use.

By bredren 2025-04-1623:31

I’d be curious to hear what actually goes through the network request.

By asadm 2025-04-1623:311 reply

i will try again but last i tried adding folder to edit mode and asking it to list down files it sees, it didn't list them all down.

By brandall10 2025-04-170:25

I like to use "Open Editors". That way, it's only the code I'm currently working on that is added to the context, seems more a more natural way to work.

By AaronAPU 2025-04-1623:221 reply

Is that why it’s so bad? I’ve been blown away by how bad it is. Never had a single successful edit.

The code completion is chefs kiss though.

By asadm 2025-04-1623:32

probably but also most models start to lose it after a certain context size (usually 10-20k). Which is why I use gemini (via aistudio) for my workflow.

By siva7 2025-04-177:46

Thanks, most people don't understand this fine difference. Copilot does RAG (as all other subscription-based agents like Cursor) to save $$$, and results with RAG are significantly worse than having the complete context window for complex tasks. That's also the reason why Chatgpt or Claude basically lie to the users when they market their file upload functions by not telling the whole story.

By MrBuddyCasino 2025-04-179:32

Cline doesn’t do this - this is what makes it suitable for working with Gemini and its large context.

By fasdfasdf11234 2025-04-1713:222 reply

Isn't this similar to https://aider.chat/docs/usage/copypaste.html

Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.

If you wish you could /add * to add all your files.

Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.

But hey I might be wrong. Did a limited test with 3 files in project.

By asadm 2025-04-1713:441 reply

that’s correct. aider doesn’t RAG on files which is good. I don’t use it because 1) UI is so slow and clunky 2) using gemini 2.5 via api in this way (huge context window) is expensive but also heavily rate limited at this point. No such issue when used via aistudio ui.

By fasdfasdf11234 2025-04-1713:491 reply

You could use aider copy-paste with aistudio ui or any other web chat. You could use gemini-2.0-flash for the aider model that will apply the changes. But I understand your first point.

I also understand having build your own tool to fit your own workflow. And being able to easily mold it to what you need.

By asadm 2025-04-1714:01

yup exactly. as weird workflows emerge it’s nicer to have your own weird tooling around this until we all converge to one optimal way.

By ramraj07 2025-04-1622:591 reply

I felt it loses track of things on really large codebases. I use 16x prompt to choose the appropriate files for my question and let it generate the prompt.

By asadm 2025-04-1623:01

do you mean gemini? I generally notice pretty great recall UPTO 200k tokens. It's ~OK after that.