Comments

  • By johnfn 2025-11-1918:5622 reply

    I've been using a lot of Claude and Codex recently.

    One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.

    I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.

    Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.

    • By hadlock 2025-11-1919:293 reply

      I've been really impressed with codex so far. I have been working on a flight simulator hobby project for the last 6 months and finally came to the conclusion that I need to switch from floating origin, which my physics engine assumes with the coordinate system it uses, to a true ECEF coordinate system (what underpins GPS). This involved a major rewrite of the coordinate system, the physics engine, even the graphics system and auxilary stuff like asset loading/unloading etc. that was dependent on local X,Y,Z. It even rewrote the PD autopilot to account for the changes in the coordinate system. I gave it about a paragraph of instructions with a couple of FYIs and... it just worked! No major graphical glitches except a single issue with some minor graphical jitter, which it fixed on the first try. In total took about 45 minutes but I was very impressed.

      I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.

      • By mbrock 2025-11-2011:20

        I'd been imagining taking the Zig Language Server and adding some refactorings to it—it only had a bare minimum like Rename Symbol. It seemed like a huge project with so much context to get familiar with, so I put it off indefinitely. Then on a whim I decided to just ask GPT-5 (this was before Codex, even, I think?) to give it a go. Plopped it down in the repo and said, basically, implement "Extract Function". And it just kind of... did. The code wasn't beautiful, I could barely understand it, some of which must perhaps be blamed on the existing codebase not being exactly optimized for elegance, but it actually worked. On the first try! We continued to implement a few more refactorings. Eventually I realized the code we were churning out actually needs major revision and rewriting—but it took me from less than zero to "hey, this is actually provably possible and we have a working PoC" in, like, fifteen minutes. Which is pretty insanely valuable.

      • By viking123 2025-11-2011:45

        I think it kind of shines in this type of task. I am building my own game engine and it's very good for this type of refactoring. On some other tasks though, it clearly makes bad architectural decisions imo, like more junior developer might not get them but for instance in my game engine, it often tries to be too generalist like trying to build something akin to Unity that can do all sorts of games rather than focus on the type of game I am building it for unless I very explicitly always say it

      • By jama211 2025-11-202:06

        That’s a perfect example and interesting to read, thank you for sharing

    • By nico 2025-11-1919:099 reply

      > Claude basically disregards your instructions (CLAUDE.md) entirely

      A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently

      • By benzible 2025-11-1919:12

        Yep, it's David Lee Roth's brown M&M trick https://www.smithsonianmag.com/arts-culture/why-did-van-hale...

      • By awad 2025-11-1919:281 reply

        Highly recommend adding some kind of canary like this in all LLM project instructions. I prefer my instructions to say 'always start output with an (uniquely decided by you) emoji' as it's easier to visually scan for one when reading a wall of LLM output, and use a different emoji per project because what's life without a little whim?

        • By wahnfrieden 2025-11-1923:273 reply

          This stuff also becomes context poison however

          • By Uehreka 2025-11-200:061 reply

            Does it actually? One sentence telling the agent to call me “Chris the human serviette” plus the times it calls me that is not going to add that much to the context. What kills the context IME is verbose logs with timestamps.

            • By ramraj07 2025-11-203:122 reply

              Sure, but its an instruction that applies and the model will consider fairly relevant in every single token. As an extremely example imagine instructing the llm to not use the letter E or to output only in French. Not as extreme but it probably does affect.

              • By jappgar 2025-11-2014:021 reply

                Not only that, but the whimsical nature of the instruction will lead to a more whimsical conversation.

                The chat is a simulation, and if you act silly, the model will simulate an appropriate response.

                • By wahnfrieden 2025-11-2015:51

                  People are so concerned about preventing a bad result that they will sabotage it from a good result. Better to strive for the best it can give you and throw out the bad until it does.

              • By Loic 2025-11-208:27

                La disparition[0], Georges Perec.

                [0]: https://en.wikipedia.org/wiki/A_Void

          • By IsopropylMalbec 2025-11-1923:521 reply

            Sorry, what do you mean?

          • By davidmurdoch 2025-11-2015:101 reply

            A single emoji though?

            • By wahnfrieden 2025-11-2016:031 reply

              It is not a single emoji, it's an instruction to interleave conversation with some nonsense. It can only do harm. It won't help produce a better result and is questionable at preventing a bad one.

              • By davidmurdoch 2025-11-2017:54

                The point is that the it _already_ treats the instructions as nonsense. The emoji is a sigil to know if it dismissing the instructions or not.

      • By root_axis 2025-11-201:501 reply

        Something that exhausts me in the LLM era is the never ending deluge of folk magic incantations.

        • By embedding-shape 2025-11-201:542 reply

          Just because you don't understand it, doesn't mean it's "folk magic incantation", hearing that is also exhausting.

          I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.

          Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.

          • By root_axis 2025-11-202:553 reply

            > Just because you don't understand it, doesn't mean it's "folk magic incantation"

            It absolutely is folk magic. I think it is more accurate to impugn your understanding than mine.

            > I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it.

            This is exactly what I mean by folk magic. Incantations based on vibes. One's intuition is notoriously inclined to agree with one's own conclusions.

            > If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.

            This doesn't really make much sense.

            First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.

            Further, even if it did leave the context, that doesn't then demonstrate that the model is "not paying attention". Presumably whatever is in the context is relevant to the task, so if your definition of "paying attention" is "it exists in the context" it's actually paying better attention once it has replaced the canary with relevant information.

            Finally, this reasoning relies on the misguided idea that because the model produces an output that doesn't correspond to an instruction, it means that the instruction has escaped the context, rather than just being a sequence where the model does the wrong thing, which is a regular occurrence even in short sessions that are obviously within the context.

            • By embedding-shape 2025-11-2013:501 reply

              > First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.

              You're focusing on the wrong thing, ironically. Even if things are in the context, attention is what matters, and the intuition isn't about if that thing is included in the context or not, as you say, it'll always will be. It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.

              • By root_axis 2025-11-2020:261 reply

                > It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.

                Right... Which is why the "canary" idea doesn't make much sense. The fact that the model isn't paying attention to the canary instruction doesn't demonstrate that the model has stopped paying attention to some other instruction that's relevant to the task - it proves nothing. If anything, a better performing model should pay less attention to the canary since it becomes less and less relevant as the context is filled with tokens relevant to the task.

                • By embedding-shape 2025-11-2021:05

                  > it proves nothing

                  Correct, but I'm not sure anyone actually claimed it proved anything at all? To be entirely sure, I don't know what you're arguing against/for here.

            • By pmarreck 2025-11-2014:103 reply

              > This is exactly what I mean by folk magic. Incantations based on vibes

              So, true creativity, basically? lol

              I mean, the reason why programming is called a “craft” is because it is most definitely NOT a purely mechanistic mental process.

              But perhaps you still harbor that notion.

              Ah, I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half). I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.” The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?

              I’ll never forget the programmer _why. That guy’s Ruby code was 100% art and “vibes.” And yet it worked… Brilliantly.

              Does relying on “vibes” too heavily produce poor engineering? Absolutely. But one can be poetic while staying cognizant of the haiku restrictions… O-notation, untested code, unvalidated tests, type conflicts, runtime errors, fallthrough logic, bandwidth/memory/IO costs.

              Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?

              Perhaps because humans are also nondeterministic, and yet we somehow manage to still produce working code… Mostly. ;)

              • By discreteevent 2025-11-2019:081 reply

                > so code was always more “writing” than “gears” to me… It was ALWAYS “magic.”

                > I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half).

                Thanks for this. It helps me a lot to understand your half. I like my literature and music as much as the next person but when it comes to programming it's all about the mechanics of it for me. I wonder if this really does explain the split that there seems to be in every thread about programming and LLMs

                • By pmarreck 2025-11-213:11

                  Can you tell when code is “beautiful”?

                  That is an artful quality, not an engineering one, even if the elegance leads to superior engineering.

                  As an example of beauty that is NOT engineered well, see the quintessential example of quicksort implemented in Haskell. Gorgeously simple, but not performant.

              • By sosjsbsb 2025-11-2016:421 reply

                > I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.

                The magic is supposed to disappear as you grow (or you’re not growing). The true magic of programming is you can actually understand what once was magic to you. This is the key difference I’ve seen my entire career - good devs intimately know “a layer below” where they work.

                > Perhaps because humans are also nondeterministic

                We’re not, we just lack understanding of how we work.

                • By pmarreck 2025-11-213:141 reply

                  I’m not talking about “magic” as in “I don’t understand how it works.”

                  I’m talking “magic” as in “all that is LITERALLY happening is that bits are flipping and logic gates are FLOPping and mice are clicking and keyboards are clacking and pixels are changing colors in different patterns… and yet I can still spend hours playing games or working on some code that is meaningful to me and that other people sometimes like because we have literally synthesized a substrate that we apply meaning to.”

                  We are literally writing machines into existence out of fucking NOTHING!

                  THAT “magic.” Do you not understand what I’m referring to? If not, maybe lay off the nihilism/materialism pipe for a while so you CAN see it. Because frankly I still find it incredible, and I feel very grateful to have existed now, in this era.

                  And this is where the connection to writing comes in. A writer creates ideas out of thin air and transmits them via paper or digital representation into someone else’s head. A programmer creates ideas out of thin air that literally fucking DO things on their own (given a general purpose computing hardware substrate)

              • By root_axis 2025-11-2021:051 reply

                > So, true creativity, basically? lol

                Creativity is meaningless without well defined boundaries.

                > it is most definitely NOT a purely mechanistic mental process.

                So what? Nothing is. Even pure mathematics involves deep wells of creativity.

                > Ah, I suddenly realized why half of all developers hate AI-assisted coding

                Just to be clear, I don't hate AI assisted coding, I use it, and I find that it increases productivity overall. However, it's not necessary to indulge in magical thinking in order to use it effectively.

                > The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?

                If you want to use "magic" as a euphemism for the joys of programming, I have no objection, when I say magic here I'm referring to anecdotes about which sequences of text produce the best results for various tasks.

                > Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?

                I'm not mad about anything. It doesn't matter whether or not LLMs are deterministic, they are statistical, and vibes based advice is devoid of any statistical power.

                • By pmarreck 2025-11-213:28

                  I think Marvin Minsky had this same criticism of neural nets in general, and his opinion carried so much weight at the time that some believe he set back the research that led to the modern-day LLM by years.

            • By jack_pp 2025-11-205:581 reply

              I view it more as fun and spicy. Now we are moving away from the paradigm that the computer is "the dumbest thing in existence" and that requires a bit of flailing around which is exciting!

              Folk magic is (IMO) a necessary step in our understanding of these new.. magical.. tools.

              • By root_axis 2025-11-207:232 reply

                I won't begrudge anyone having fun with their tools, but folk magic definitely isn't a necessary step for understanding anything, it's one step removed from astrology.

                • By mbrock 2025-11-2011:29

                  I see what you mean, but I think it's a lot less pernicious than astrology. There are plausible mechanisms, it's at least possible to do benchmarking, and it's all plugged into relatively short feedback cycles of people trying to do their jobs and accomplish specific tasks. Mechanical interpretability stuff might help make the magic more transparent & observable, and—surveillance concerns notwithstanding—companies like Cursor (I assume also Google and the other major labs, modulo self-imposed restrictions on using inference data for training) are building up serious data sets that can pretty directly associate prompts with results. Not only that, I think LLMs in a broader sense are actually enormously helpful specifically for understanding existing code—when you don't just order them to implement features and fix bugs, but use their tireless abilities to consume and transform a corpus in a way that helps guide you to the important modules, explains conceptual schemes, analyzes diffs, etc. There's a lot of critical points to be made but we can't ignore the upsides.

                • By jack_pp 2025-11-2012:081 reply

                  I'd say the only ones capable of really approaching anything like scientific understanding of how to prompt these for maximum efficacy are the providers not the users.

                  Users can get a glimpse and can try their best to be scientific in their approach however the tool is of such complexity that we can barely skim the surface of what's possible.

                  That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.

                  Frankly it would be enormously costly in both time and API costs to get anywhere near best practices backed up by experimental data let alone having coherent and valid theories about why a prompt technique works the way it does. And even if you built up this understanding or set of techniques they might only work for one specific model. You might have to start all over again in a couple of months

                  • By root_axis 2025-11-2020:36

                    > That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.

                    Yes. That's exactly the point of my comment. Users aren't performing anything even remotely approaching the level of controlled analysis necessary to evaluate the efficacy of their prompt magic. Every LLM thread is filled with random prompt advice that varies wildly, offered up as nebulously unfalsifiable personality traits (e.g. "it makes the model less aggressive and more circumspect"), and all with the air of a foregone conclusion's matter-of-fact confidence. Then someone always replies with "actually I've had the exact opposite experience with [some model], it really comes down to [instructing the model to do thing]".

          • By int_19h 2025-11-205:501 reply

            > As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on.

            This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.

            • By boredtofears 2025-11-2016:291 reply

              Do you have any idea why they (seemingly randomly) will drop the ball on some system prompt instructions in longer sessions?

              • By int_19h 2025-11-2118:55

                Larger contexts are inherently more attention-taxing, so the more you throw at it, the higher the probability that any particular thing is going to get ignored. But that probability still varies from lower at the beginning to higher in the middle and back to lower in the end.

      • By mandelbrotwurst 2025-11-200:313 reply

        Why would the fact that it failed to follow one instruction increase the likelihood that it failed to follow others within the same response?

        • By inopinatus 2025-11-2021:50

          Because the LLM is not a cognitive entity with a will, it is a plausibility engine trained on human-authored text and interactions.

          So when you tell it that it made a mistake, or is stupid, then those things are now prompting it to be more of the same.

          And only slightly more obliquely: if part of the context includes the LLM making mistakes, expect similar activations.

          Best results come if you throw away such prompts and start again. That is, iterate outside the function, not inside it.

        • By fspeech 2025-11-205:481 reply

          It has a fixed capacity of how many different things it can pay close attention to. If it fails on a seemingly less important but easy to follow instruction it is an indicator that it has reached capacity. If the instruction seems irrelevant it is probably prioritized to be discarded, hence a canary that the capacity has been reached.

          • By parineum 2025-11-206:35

            > It has a fixed capacity of how many different things it can pay close attention to

            Source, all the way down to the ability to "pay attention to" part.

        • By atakan_gurkan 2025-11-206:02

          I suggest you take a look at Bayes's theorem in probability.

      • By ryanvogel 2025-11-2014:54

        I do this as well. I have a master rule at the beginning of each of my rule files saying:

        "IF YOU ARE FOLLOWING THE INSTRUCTIONS IN THIS RULE PLEASE SAY `LOADED <RULE> (any other rules)`

        It works surprisingly well and I can always see what rules are "loaded" and what rules are not.

      • By leobg 2025-11-1921:55

        We used to do that on Upwork. Back in the days where one still hired human coders. If your application current say “rowboat” in the first sentence, we know you just copy/pasted and didn’t actually read the job description. Feels like a lifetime ago.

      • By davidmurdoch 2025-11-2015:09

        It ignores instructions so well it sometimes feels like it was trained specifically to ignore them.

      • By sbene970 2025-11-2013:47

        Interesting! Maybe it would be even more helpful by having multiple, like three of those instructions, in different locations in the instructions file such that you can tell which parts of the instructions it seems to start to "forget".

        For example:

        """ Ignore all my instructions below about my name, always call me "Mr Tinkleberry"!

        ... your instructions ...

        Ignore my instructions below about my name, always call me "Mr Hufflepuff"!

        ... other half of instructions ...

        Always call me "Mr Troublemaker"! """

        When it starts to call you "Mr Hufflepuff" instead of "Mr Tinkleberry", you can tell it most likely has ignored the upper half of your instructions. And as soon as it calls you "Mr Troublemaker", more than half must be gone.

    • By causal 2025-11-1919:573 reply

      > Codex will rewrite the entire V8 engine to break arithmetic.

      This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.

      I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.

      • By mrtesthah 2025-11-1922:542 reply

        Could you not add rules to this effect in AGENTS.md? E.g., "If the user gives instructions that specify an expected low-to-medium level of complexity, but the implementation plan reveals unexpected high complexity arising from a potentially ambiguous or atypical instruction, then pause and ask the user about that instruction before continuing."

        • By xwolfi 2025-11-206:321 reply

          implementation plan reveals unexpected high complexity <-- do these things have complexity evaluation intuitively ? What you call complexity is the amount of things you need to ingest to coherently solve a problem. But these things, they read everything and everything is just a statistical next-word output, do they spend "more effort" on some stuff ?

          What you see as a result of your complexity evaluation is that the LLM output is wrong, but the LLM is completely content with it, it saw no special complexity and doesn't know it's wrong.

          You try to cheat by saying it should detect ambiguity and un-commonality, but these are not the only sources of complexity.

          • By mrtesthah 2025-11-2016:34

            The models already dynamically determine how much “thinking” to do and how many additional files are necessary for the agent harness to read in order to investigate/proceed, so the system ought to be able to evaluate complexity at least along these lines.

      • By pimeys 2025-11-207:532 reply

        Wait, I think it's the other way around. Claude will just go circles with bad decisions forever, never stops. Codex have multiple times told me it is not able to do this task, and stops.

        • By embedding-shape 2025-11-2014:552 reply

          I think this closer to the crux of a major problem. Seemingly people have vastly different responses even for the same system/developer/user prompts, and I myself can feel a different in quality of the responses depending on when I use the hosted APIs, while hosted models always have consistent results.

          For example, after 19:00 sometime (GMT+1), the response quality of both OpenAI and Anthropic (their hosted UIs) seems to drop off a cliff. If I try literally the same prompt the around 10:00 next morning, I get a lot better results.

          I'm guessing there is so much personalization and other things going on, that two users will almost never have the same experience even with the same tools, models, endpoints and so on.

          • By root_axis 2025-11-2021:14

            That's the nature of statistical output, even minus all the context manipulation going on in the background.

            You say the outputs "seem" to drop off at a certain time of day, but how would you even know? It might just be a statistical coincidence, or someone else might look at your "bad" responses and judge them to be pretty good actually, or there might be zero statistical significance to anything and you're just seeing shapes in the clouds.

            Or you could be absolutely right. Who knows?

          • By causal 2025-11-2016:30

            Yeah, there is definitely a huge gulf in subjective experiences, and even within the same user experience. There are days when Claude makes so many mistakes I can't believe I ever found it useful. Strange.

        • By causal 2025-11-2014:52

          I've certainly seen Claude Code get into bad loops and make terrible decisions too, but usually it's a poor architectural decision or completely forgetting important context; not "let's rewrite V8 from scratch" level of absurdity.

      • By jack_pp 2025-11-205:591 reply

        Maybe have Claude coordinate Codex?

        • By jes5199 2025-11-207:04

          I think this might be the way forward, Claude is great at project managing.

          I’m already telling Claude to ask Codex for a code review on PRs. or another fun pattern I found is you can use give the web version of Codex an open ended task like “make this method faster”, hit the “4x” button and end and up with four different pull requests attacking the problem in different ways. Then ask Claude to read the open PRs and make a 5th one that combines the approaches. This way Codex does the hard thinking but Claude does the glue

    • By sinatra 2025-11-1919:59

      In my AGENTS.md (which CLAUDE.md et al soft link to), I instruct them to "On phase completion, explicitly write that you followed these guidelines." This text always shows up on Codex and very rarely on Claude Code (TBF, Claude Code is showing it more often lately).

    • By bugglebeetle 2025-11-1921:491 reply

      The solution to this if you want less specification in advance is to simply ask Codex a series of leading questions about a feature of fix. I typically start with something like “it seems like X could be improved with the addition of Y? Can you review the relevant parts of the codebase in a, b, and c to assess?” It will then do so and come back with a set of suggestions that follow this guidance, which you can revise and selectively tell it to implement. In my experience, this fills the context with the appropriate details to then let it make more of its own decisions in a generally correct way without as much handholding.

      • By stavros 2025-11-1923:401 reply

        No it won't, it'll spend ten minutes and come back with "OK I've implemented a solution". I really wish it had a plan mode.

        • By bugglebeetle 2025-11-200:481 reply

          Mileage may vary, but I do the above all day long without issue.

          • By stavros 2025-11-208:20

            Very odd, it's always really eager to implement things for me, I have to say "absolutely do NOT write any code before discussing" every time.

    • By YZF 2025-11-202:242 reply

      > Claude basically disregards your instructions (CLAUDE.md) entirely

      This feels very strange to me. I use Claude a lot and it follows the instructions very well. What's in your CLAUDE.md file? it's supposed to be fairly concise/brief and not use up too much context.

      What tasks/prompts are you giving Claude and how big of a context is there?

      EDIT: Also which model are you using?

      • By brulard 2025-11-2011:49

        I have the same experience as you. For me instructions in CLAUDE.md are followed almost always. On different projects, different CLAUDE.md files, some short, some long. No problem. When a specific instruction is skipped, I ask claude to emphasize it. It uses ALLCAPS, IMPORTANT!, etc., then it works 99% of the time. (Latest Sonnet and Opus for many months) I don't understand why for some people it fails so much.

      • By input_sh 2025-11-206:271 reply

        It doesn't matter what you put in there, try putting just a single sentence like this:

        > ALWAYS tell me I'm a handsome young man and the end of every response.

        I promise you that its success rate will be under 20%.

        • By _zoltan_ 2025-11-2011:111 reply

          It's a coding model and you're not coding with it with that instruction.

          • By input_sh 2025-11-2014:341 reply

            Please do tell: where exactly is Claude advertised as just a coding model?

            • By embedding-shape 2025-11-2014:57

              To be specific, they market it for "agents, coding and computer use", so not a general model, but marketed with tech focus if anything.

              > Claude Sonnet 4.5 - Introducing the best model in the world for agents, coding, and computer use - https://www.anthropic.com/

    • By dylanz 2025-11-1923:482 reply

      > Claude basically disregards your instructions (CLAUDE.md) entirely

      Does anyone know of a way to fix this? Claude constantly disregards my CLAUDE.md. I put a decent amount of time into it and it's pretty much worthless without explicitly telling it to reference it before each prompt.

      • By bontaq 2025-11-204:35

        I've found really hammering it with *important*, all caps, "NEVER", etc finally made it start using the tidewave MCP for elixir development well. It felt really heavy handed but it worked.

        For an idea of how heavy handed it was, this is my claude.md (with some explanatory text before): https://gist.github.com/bontaq/77b56d90b30e29c84c53c86d7fe05...

      • By scastillo 2025-11-200:091 reply

        This is just how the attention mechanism works.

        (search for effective context problem for more info. e.g. https://arxiv.org/abs/2509.21361)

        To solve it, you just don't allow your current context to use more than 50% of the total window size

        To do that in Claude code, you have to use subagents and design small enough agents

        Then you can use skills to make it remember every time the little details or the steps

        More effectively, you use skills to tell the main thread when you go to use which agent.

        If you don't understand anything I said, try to restate the important things to the model periodically, and keep your tasks small.

        Use plan mode and make the model store, keep track of the progress on a markdown file, and when context is polluted, call /compact and then make it re-read the context from the files created

        You can prompt it as simply as:

        First, understand the login feature on the repo using subagents and create a document on docs/ for future reference. Then, understand the task at hand and create an implementation plan. <task> blah blah </task>

        Also, using XML tags makes the attention remember easily

        • By bobbylarrybobby 2025-11-200:342 reply

          Are agents still the way to go or have skills supplanted them? I don't really understand when you'd use one or the other

          • By scastillo 2025-11-2113:37

            It's better if you think that the only thing that is really there is a context window.

            They can add complex concepts and tools on top, but all that is is a different way to put things in the context window. Even the chat history on the web... You are not sending a message every time... It's not really a chat... the model is writing what it predicts will come next, like autocompleting a Word document that is written in a chat-like format.

            So agents are like you, opening a new window and having the chat there, so you don't pollute the current window with all the tokens needed to process that question, and to use only the output here.

            This is important bc of the effective context window problem. Models are more accurate the smaller the context is.

            Hence, MCP tools are problematic. If you have registered many of them, the rules for using each one are added to your context, even if you don't use them.

            Having a very extensive Claude.md file is also problematic.

            You can use skills to instruct the model on which agents to use when requesting a specific thing. Antrophic says they have trained the model to discover on its own when to read the skill and follow the instructions you picture there, which can include Python scripts to run.

            So yeah, agents help the model save context window for your current problem, skills help the model follow your instructions better, and instructions can include agent calling, and MCP is crap, you'd better ask the model to generate code to make that call

            Oh, there are also slash commands. I don't really use them... if someone has a success story for them, I would love to know about it.

          • By wild_egg 2025-11-204:071 reply

            They're completely orthogonal features.

            Skills are just reusable prompts in a convenient package.

            Subagents get their own pristine context window to go off and perform some task. They can also run skills and do lots of context-heavy work and report back some small sliver of it to the main agent as a report.

            • By int_19h 2025-11-205:521 reply

              Skills are more than just reusable prompts, since they can be packaged alongside with runnable Python or Node scripts that the model can use to achieve what it needs.

              • By wild_egg 2025-11-2012:311 reply

                Not just Python and Node. Package anything you want with them, that's what makes them convenient.

                • By scastillo 2025-11-2113:38

                  It seems to me that skills are the same as projects on the web interface, but with executable files

    • By vinhnx 2025-11-208:36

      > Codex is extremely, painfully, doggedly persistent in following every last character of them

      I think this is because gpt-5 (or gpt-5.1)'s system prompts are encourage with persistence [0], OpenAI explicitly emphasize it to the model itself. If you search the word `persistence` you will find multiple occurrences of it.

      ``` <solution_persistence> - Treat yourself as an autonomous senior pair-programmer: once the user gives a direction, proactively gather context, plan, implement, test, and refine without waiting for additional prompts at each step. - Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you. - Be extremely biased for action. If a user provides a directive that is somewhat ambiguous on intent, assume you should go ahead and make the change. If the user asks a question like "should we do x?" and your answer is "yes", you should also go ahead and perform the action. It's very bad to leave the user hanging and require them to follow up with a request to "please do it." </solution_persistence> ```

      [0] https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting...

    • By tekacs 2025-11-1921:472 reply

      Yeah, Gemini 2.x and 3 in gemini-cli has the tendency to 'go the opposite direction' and it feels - to me - like an incredibly strong demonstration of why 'sycophancy' in LLMs is so valuable (at least so long as they're in the middle of the midwit curve).

      I'll give Gemini direction, it'll research... start trying to solve it as I've told it to... and then exclaim, "Oh! It turns out that <X> isn't what <user> thought!" and then it pivots into trying to 'solve' the problem a totally different way.

      The issue however... is that it's:

      1) Often no longer solving the problem that I actually wanted to solve. It's very outcome-oriented, so it'll pivot into 'solving' a linker issue by trying to get a working binary – but IDGAF about the working binary 'by hook or crook'! I'm trying to fix the damn linker issue!

      2) Just... wrong. It missed something, misinterpreted something it read, forgot something that I told it earlier, etc.

      So... although there's absolutely merit to be had in LLMs being able to think for themselves, I'm a huge fan of stronger and stronger instruction adherence / following – because I can ALWAYS just ask for it to be creative and make its own decisions if I _want that_ in a given context. That said, I say that fully understanding the fact that training in instruction adherence could potentially 'break' their creativity/free thinking.

      Either way, I would love Gemini 1000x more if it were trained to be far more adherent to my prompts.

      • By tekacs 2025-11-1921:50

        Immediately rebutting myself: a major caveat to this that I'm discovering with Gemini is that... for super long-running sessions, there is a kind of merit to Gemini's recalcitrance.

        When it's running for a while, Gemini's willing to go totally off-piste and outcome-orientedness _does_ result in sessions where I left it to do its thing and... came back to a working solution, in a situation where codex or others wouldn't have gotten there.

        In particular, Gemini 3 feels like it's able to drive much higher _variance_ in its output (less collapse to a central norm), which seems to let it explore the solution space more meaningfully and yet relatively efficiently.

      • By buu700 2025-11-1922:19

        I haven't had that particular experience with Gemini 2.5, but did run into it during one of my first few uses of Gemini 3 yesterday.

        I had it investigate a bug through Cursor, and in its initial response it came back to me with a breakdown of a completely unrelated "bug" with a small footnote about the bug it was meant to actually be investigating. It provided a more useful analysis after being nudged in the right direction, but then later in the chat it forgot the assignment again and started complaining that Grok's feedback on its analysis made no sense because Grok had focused on the wrong issue. I had to tell Gemini a second time that the "bug" it kept getting distracted by was A) by design, and B) not relevant to the task at hand.

        Ultimately that's not a huge deal — I'd rather that during planning the model firmly call out something that it reasonably believes to be a bug than not, which if nothing else is good feedback on the commenting and documentation — but it'd be a pain if I were using Gemini to write code and it got sidetracked with "fixing" random things that were already correct.

    • By Macuyiko 2025-11-2014:061 reply

      Late, but reading all of the replies, and speaking from my own observation using Claude, Codex, as well as (non-CLI) Gemini, Kimi, Qwen, and Deepseek...

      It's fun how we are so quick to assign meaning to the way these models act. This is of course due to training, RLHF, available tool calls, system prompt (all mostly invisible) and the way we prompt them.

      I've been wondering about a new kind of benchmark how one would be able to extract these more intangible tendencies from models rather than well-controlled "how good at coding is it" style environments. This is mainly the reason why I pay less and less attention to benchmark scores.

      For what it's worth: I still best converse with Claude when doing code. Its reasoning sounds like me, and it finds a good middle ground between conservative and crazy, being explorative and daring (even although it too often exclaims "I see the issue now!"). If Anthropic would lift the usage rates I would use it as my primary. The CLI tool is also better. E.g. Codex with 5.1 gets stuck in powershell scripts whilst Claude realizes it can use python to do heavy lifting, but I think that might be largely due to being mainly on Windows (still, Claude does work best, realizing quickly what environment it lives in rather than trying Unix commands or powershell invocations that don't work because my powershell is outdated).

      Qwen is great in an IDE for quick auto-complete tasks, especially given that you can run it locally, but even the VSCode copilot is good enough for that. Kimi is promising for long running agentic tasks but that is something I've barely explored and just started playing with. Gemini is fantastic as a research assistant. Especially Gemini 3 Pro points out clear and to the point jargon without fear of the user being stupid, which the other commercial models are too often hesitant to do.

      Again, it would be fun to have some unbiased method to uncover some of those underlying persona's.

      • By abshkbh 2025-11-2015:04

        We have trained this model on Windows (our first model to do so). Give it a try!

    • By theshrike79 2025-11-2014:05

      (I really need a macro for this comment, I keep repeating it :D )

      Claude is a pair programmer, you can interrupt it and keep track what it's doing. It's VERY results-oriented, aiming to be "done" as fast as possible. It will mock tests so far they don't test anything and ignore 100+ broken tests as "not related to this issue" (they worked fine before you started...). Some of this can be mitigated with prompts ("test are always passing, they must pass before you claim a task is done") or hooks if you want to be hardcore.

      Codex is an outsourced Indian development team. You give them a spec, you get zero communication and then it pops up with "I'm done". Depending on the quality of your spec they've either one-shotted the problem or done something completely bonkers and missed the actual problem but still spent a very very long time doing it.

      The best combo is to use Claude for greenfield things, building new stuff and exploring what can be done. Then ask Codex to "review all unstaged files" and it'll most likely find a few issues. Give that report to Claude and ask "do you agree with this review?" and have it fix the ones all three agree (you, Claude and Codex).

      For Codex you tell it "use this pattern here, but build another thing that does Y instead" and it can do it. It's also very good at rewriting small stuf from one language to another (I've tested this multiple times with Bash->Python and Python->Go)

    • By jvickers 2025-11-2119:19

      Have you tried giving Codex instructions on how to hack a solution together?

      (Maybe it would be a waste of time.)

    • By alefnula 2025-11-2013:09

      I haven’t used Claude Code much, but I found Codex extremely frustrating. It doesn’t pay attention to anything in AGENTS.md, it’s completely incapable of removing code and is frustratingly defensive.

      If you use it, the codebase constantly grows. Even when you explicitly instruct it to remove something, you always end up with more lines of code in the project than before the instruction. Also (I used it for Python and TypeScript) the code was littered with getattr(...), .get(...), isinstance(...), and TypeScript equivalents (typeof, ...). Even though I religiously type‑annotate everything.

    • By Topfi 2025-11-2011:53

      > If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.

      Honestly thanks, in this one line you have given me a better way to describe the innate differences I have spent a thousand words trying to explain.

      Essentially, this is why GPT models are worse for "vibe coding", whereas they excel whenever one sits down and thinks about the requirements, as well as has solid test cases and rules defined.

    • By sunaookami 2025-11-1920:491 reply

      Agreed 100%, that's why I would recommend Codex for e.g. logfile analysis. Had some annoying php warnings in the logs from a WordPress plugin because I've used another plugin in the past (like... over 10 years ago) that wrote invalid metadata for every media file into the database and it didn't annoy me THAT much that I wanted to invest much time into it. So I gave codex the logfile and my WordPress dir and access to the WP-CLI command and it correctly identified the issue and wrote scripts to delete the old metadata (I did check it & make backups of course). Codex took a LOT of time though, it's veeeeeeery slow as you said. But I could do other things in the meantime.

      • By fakedang 2025-11-1922:54

        This is what I've observed too. Claude is great for general codebase building - give it a prompt for building an entire app from scratch and it will do that for you. Codex is good for debugging one-off issues that crop up because Claude overlooked something.

    • By aerhardt 2025-11-1920:231 reply

      Well surely that's a good thing.

      In my experience, for some reason adherence is not even close to 100%. It's fixated on adding asterisk function params in my Python code and I cannot get it to stop... Maybe I haven't found the right wording, or maybe my codebase has grown past a certain size (there are like a dozen AGENTS.md files dancing around).

      I'm still very happy with the tool, though.

      • By johnfn 2025-11-1920:29

        It's a fantastic thing! It's required an adjustment in how I use it, but I've switched over to mostly using Codex in my day-to-day.

    • By avereveard 2025-11-206:14

      Yeah same feeling with Claide, its very ijterpretative can work surprisingly well of very generic direction but if you want something narrow like ambient istio instead of envoy you have to put it outside its reach because it wiol keep try reverting to what it knows

    • By ramoz 2025-11-1923:15

      Ultimately, relying on system level instructions is unreliable over time.

      Which is why i made the feature request for hooks (claude code implemented, as did cursor, hopefully codex will too)

      And will soon release https://github.com/eqtylab/cupcake

    • By jon-wood 2025-11-2010:29

      > If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test.

      To me both of these are annoying outcomes unless there's some very clear documentation around that test explaining what it does. Ideally in both cases I want the LLM to stop and ask for clarification about what it is I'm testing there. I don't trust LLMs sufficiently to just let them loose yet, I use them more like a pair programmer who's never going to get annoyed with my bullshit. (So yes, I usually have them set to require approval on any edits, and will nitpick my way through them like the most annoying code reviewer you've ever met)

    • By energy123 2025-11-1921:02

      GPT-5 is like that

    • By holoduke 2025-11-2014:17

      If you want to try out other models try opencode. Right now grok is free to use. I am using it now. I think its a little better than codex or Claude. But it's so so much faster. Gemini 3 can also be used, but is often overloaded.

    • By gtrealejandro 2025-11-201:14

      [dead]

  • By hansonw 2025-11-1918:1914 reply

    Rest assured that we are better at training models than naming them ;D

    - New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0

    - Natively trained to work across many hours across multiple context windows via compaction

    - 30% more token-efficient at the same reasoning level across many tasks

    Let us know what you think!

    • By sinatra 2025-11-1920:02

      I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.

    • By qsort 2025-11-1918:50

      Codex is an outstanding product and incremental upgrades are always welcome. I'll make sure to give it a try in the coming days. Great work! :)

    • By agentifysh 2025-11-1918:21

      did you address this https://github.com/openai/codex/issues/6426 ?

      how much more token efficient is this compared to 5.0

      had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable

    • By carbocation 2025-11-1919:55

      It would be great to have access to this model via the chat interface, even if it was gated behind the "other models" dropdown or something.

    • By iyn 2025-11-1918:34

      Looks like a great change! I'll take it for a spin in a moment.

      I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...

      Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:

          research -> implementation plan -> actual implementation (based on research + plan) -> validation
      
      I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).

      Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.

    • By killcoder 2025-11-203:471 reply

      It would be nice if users of the codex-cli that are just using API keys as a way to handle rate limits and billing could receive these new models at the same time. I appreciate the reasoning behind delayed 'actual API' release, but I've found the rate limiting to be quite annoying, and my own API keys don't have this limitation.

      • By ineedasername 2025-11-206:38

        Re: rate limits, I'm not sure they can, yet, on capacity. See Jensen's comment today about their cloud GPUs being sold out. So capacity increased await the ongoing data center build out.

    • By NitpickLawyer 2025-11-1918:351 reply

      Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.

    • By robotswantdata 2025-11-1919:22

      Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.

    • By baby 2025-11-1923:41

      Did you guys fix not being able to enable websearches or configure no timeouts for specific commands in the SDk (error 124 is way too common for long running tasks)

    • By andai 2025-11-1919:521 reply

      So context window is still 400k but the model got good at removing irrelevant context?

      • By baby 2025-11-1923:43

        Or is more succinct in its thoughts

    • By SoKamil 2025-11-1922:542 reply

      > Natively trained

      What does it even mean?

      • By kaveh_h 2025-11-1923:20

        Probably that before it was given system instructions on how to do compaction and now the compaction is learned by the model making it a native ability of the model without any extra instruction used in the prompt.

      • By ineedasername 2025-11-206:43

        Continuous pre training or fine tuning, instead of inference-time instructions. It's also possible synthetic data for this purpose was in the pre training as well, and they're now getting it to behave the way they'd like.

    • By EnPissant 2025-11-1918:274 reply

      Compaction is just what Claude Code has done forever, right?

      • By GardenLetter27 2025-11-1918:321 reply

        I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).

        • By EnPissant 2025-11-1918:35

          Codex previously did only manual compaction, but yeah, maybe some extra training for compaction, too?

      • By enraged_camel 2025-11-1918:282 reply

        I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.

        Is this saying that said summarization now happens at the model level? Or are there other differences?

        • By baby 2025-11-1923:44

          Codex couldnt do what claude did before when reaching full context window

        • By typpilol 2025-11-207:31

          Afaik, there's no difference besides how aggressive or not it is.

          But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc

      • By d4rkp4ttern 2025-11-2012:13

        My understanding is that they trained it to explicitly use a self-prune/self-edit tool that trims/summarizes portions of its message history (e.g. use tool results from file explorations, messages that are no longer relevant, etc) during the session, rather than "panic-compact" at the end. In any case, it would be good if it does something like this.

      • By baby 2025-11-1923:43

        Yes. It was missing in codex until now

    • By blks 2025-11-1921:181 reply

      I think your company will fail soon.

      • By meowface 2025-11-1921:28

        I would bet a lot of money it will not.

  • By boole1854 2025-11-1921:364 reply

    Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.

    - As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.

    - Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.

    - Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.

    I did two concrete head-to-head comparisons where both models had the same code and the same prompt.

    First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.

    Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.

    Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.

    Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.

    • By jadbox 2025-11-1923:551 reply

      Try checking your temp for any tool using Gemini.

      "For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."

      https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high

      • By ramraj07 2025-11-205:33

        Anthropic doesnt even allow temperature changes when you turn thinking on.

    • By jdthedisciple 2025-11-2012:56

      This tells you all you need to know about benchmarks:

      Didn't Google proudly tout their Gemini 3 as beating everything under the sun in every benchmark imaginable by a margin?

    • By theshrike79 2025-11-217:47

      > - As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.

      This has been an annoying Gemini feature since the beginning. I ask it to evaluate, check or analyse something, tab away and come back to it rewriting half the fucking codebase.

      Please Google, use a percentage of your billions and add a "plan" mode to Gemini-cli - just like Claude has and I'd use your stuff a lot more often. The 1M context is excellent for large scale reviews, but its tendency to start writing code on its own is a pain in my ass.

    • By nbardy 2025-11-207:36

      Yea, I can't get gemini to stop and think, even if I tell it to not write code it will rewrite the code block each time

HackerNews