
Discover the top games and apps for iPhone, including Claude by Anthropic, ChatGPT, Google Gemini, Temu: Shop Like a Billionaire on the App Store.
Obvious political reasons and implications aside, a clear quality gap opened up late last year when Opus 4.5 was released vs. GPT-5. Opus was obviously and demonstrably superior to any GPT-5 tier. The release of GPT-5.2 didn't improve matters, and then Opus 4.6 widened the gap further. Right now talking to GPT-5.2 Pro is 10x slower than chatting with Opus 4.6 and the output returned is, nevertheless, generally lower quality and more "sloppy."
What I'm getting at is that this could be, in part, because Claude is genuinely better at this point in time.
I just cancelled my OpenAI $200 sub yesterday because of all this, but sadly I can't agree.
Codex 5.3 Xhigh > Opus 4.6 in my work to this point.
Hoping for Opus 4.7 or whatever comes next to rectify this as I'm a bit annoyed over having to drop to a lower quality model.
Weirdly enough, I agree with both sides. Opus beats every version of GPT 5 as a chat interface, hands down. ChatGPT, at this point, is mostly me correcting its output style, cadence, behavior, etc, and consistently remaining dissatisfied, meanwhile Opus one-shots things I didn’t even think it could (Typst code). All that said, I do my programming in OpenAI’s Codex app for Mac. It has completely dominated Claude Code for me. I’ll only ever use Opus to check 5.3-Codex’s work. Very weird world we’re living in. I hope it gets even weirder once Deepseek does whatever they’ve been cooking.
whatever they've been cooking at deepseek, i don't think i'm going to let their coding agent run shell stuff on my computer unless they make it free or something
For coding, I agree, Codex-5.3 is the best out there.
But for the chat, I feel like ChatGPT got worse and worse.
Something very weird is going on; I just tried a free trial of Codex-5.3, and a significant fraction of what it gives me doesn't even compile (or in the case of python, run without crashing).
Unless I specifically say "use git", it won't bother using git, apparently saying "configure AGENTS.md to us best practices" isn't enough for it to (at least in this case) use git. If this was isolated I might put that down to bad luck, given the nature of LLMs, but I have been finding Codex uses the wrong approaches all over the place, also stops in the middle of tasks, skips some tasks entirely (sometimes while marking them as done, other times it just doesn't get around to it).
I'd rank the output of Claude as similar to a junior with 1-3 years experience. It's not great, but it's certainly serviceable, a bit of tweaking even shippable. Codex… what I see is more like a student project. Or perhaps someone in the first month of their first job. Even the absolute worst human developers I've worked with after university weren't as bad as Codex, but several of them I'd rank worse than Claude.
Do you have instructions.md and docs.md files?
Also, I noticed it skips instructions if I steer it with prompts while it is doing stuff instead of queueing my instructions.
Are you using it on Xhigh?
I have not observed meaningful quality differences between the default (medium) and extra high. What does make a difference is to turn the metaphorical lights back on, and instead of vibe-coding (as in, don't even look at the code) actually examine what it did at each step (either at code level or QA) before allowing it to proceed to the next step.
OpenAI's 5.3 Codex model on xhigh still makes a huge number of mistakes, somewhere between 25-50% of commits, and it's still terrible at making its own plan, estimating how long tasks will take to complete, and recognising which tasks need to be subdivided*. Claude's model last November was better on both counts, even though it still wasn't IMO ready for true lights-off-no-code-check-needed-vibe-coding, it was making mistakes far less often and was scoping task complexity appropriately.
That said, given xhigh seems to be going through my token allowance far, far slower than on medium, I wouldn't be surprised if it turns out the Codex app itself is vibe coded and has mis-mapped that setting in some weird way. Either that or they've suddenly got a lot more spare capacity because of the boycotts.
* given the METR study, in the planning phase I ask all these models (Codex and Claude) to break down tasks into things that would take a junior developer 1-2 hours, but Codex will estimate 60 minutes for everything from "write 19 lines including comments to stub 3 empty methods in new class" to tasks I'd expect to take a senior 2 days.
What where you using it for? claude is really good at agentic stuff, Pure coding, I can see codex being better, but for the entire workflow, I'm not sure
I use Codex purely for coding, and that's 90% of my use case for AI in general (10% using ChatGPT web for misc stuff). I pop out to Opus in Claude Code regularly to try to stay up on their relative performance, but so far the primary value I've been able to derive from CC is as a second set of eyes for code review / poking holes in plans. For primary planning / debugging / implementation Codex outclasses it atm sadly.
I use Opus 4.6 Fast-mode. It produces significantly better results in my work than any Codex 5.3 tier.
Me too. It's great that my employer pays for it and there's basically no budget, because this configuration is 10x more expensive than the regular default Sonnet.
Rapid iteration would possibly make up for the drop in quality, but I can't afford to use fast mode as I'm a contractor and pay for my own AI usage :(
Yep. For the past month I’ve been doing this thing where every time I need something from AI, I give Opus and Codex the same prompt. Opus is just better by a wide margin, especially on complex tasks. It uses tools quite a lot, taps into available MCP servers when it makes sense, and can think about repercussions down the line much better. Codex I feel is optimized for brevity, approaching terseness. Hard to put my finger on it but it’s never as thorough and it always misses important details.
Agree on the gap - in my own complex greenfield software dev spec test, opus 4.6 blows codex 5.3 out of the water, by wide margin, both in ui and backend.
Massivly better and I cannot understand how many comments online say that they're comparable (other than paid actors which now fits the right wing angle that OpenAI takes because right wing paid online comments seems quite common overall).
I remember on the Opus 4.5 release data watching what it can do to my test app I wanted it to build and saying outloud to myself "oh shit" because of how much better it was at the conversation, planning, understanding, and building. Posts like this[0] say similar things, where Opus 4.5 release + Claude Code was the tipping point and the gap is widening and Anthropic has infinite more momentum and going in the better direction with useful models that aren't fully aligned with bad actors.
no, it is because of the public perspection of Anthropic holding a principled stance against allowing their software to pull the trigger and kill humans. ChatGPT still has the bigger brand name recognition.
Anyone who's used both Claude and ChatGPT will instantly agree what is better by a large margin. Theres maybe a brand recognition long tail but its more likely theyre the rare occasional users who use the free tier. Thus ChatGPT is becoming the shitty free AI app while Claude is what you use to get real work done. Time (in months) will tell yow this will go.
If that's entirely the case, there could still be interesting implications, as people who switch to Claude are unlikely to switch back to ChatGPT in the near future. (If, that is, they regularly use LLMs for any technical or professional task.)
Open AI had first mover advantage.
Sam squandered it.
I guarantee you the people downloading these apps aren't thinking about that. They use what works best.
How would most people know what works best? Most people are only using one.
I was going to say that I was surprised that enough "normal" users had heard about the Pentagon news story that it would make a difference.
Then I remembered that the app store rankings [1] seem to be based on activity from just the past day or so.
And so a lot of "plugged-in" users switching to Claude all at once then would be enough to briefly send Claude to #1, since the migration would be sizeable in comparison to the normal daily download baseline.
But we can also expect that this would probably be just a blip for a couple of days, as it's unlikely to make much difference in the baseline ratio for the general population.
I don't think you realize how mainstream the story has been. My grandfather, who is as non-tech as humanly possible, brought it up to me.
Agreed. My wife acts as my North Star for knowing whether tech news is in the mainstream. She doesn’t care about tech.
She said yesterday and Friday her instagram was full of people deleting ChatGPT and downloading/paying for Claude.
It’s definitely beyond the tech bubble.
This does remind me of the brief period after the TikTok annexation where people moved to Xiaohongshu (RedNote) .. for about three days.
It's much, much closer to the X -> Bluesky, I expect.
My girlfriend migrated to Claude from Gemini and she's not techie at all. She says she likes the answers Claude gives a lot more in general because Gemini is too dramatic. Claude is definitely beyond the tech sphere.
The real story here is "how the hell did Dick's Sporting Goods get to #3 and how can I get me some of their magic dust?"
Well, apparently they give you some miniscule rewards points for meeting step or activity goals. I told my wife who was immediately intrigued. Pump out more apps that appeal to women.
This does not check out for me intuitively. You are telling me Dick's sporting goods with 800 stores was able to get to the top of the app store leaderboard, which companies spend a lot of effort and money on, against the largest AI behemoths, by giving rewards on activity goals? I think it has to be something else.
United healthcare does that too and it’s great because it’s just straight cash on a visa gift card. Like 1.25 a week for just doing your normal stuff and more for hitting other goals. Easy to rig too…
Must be pretty high turnover on these charts, they're not on there at all now. Which might mean the absolute numbers are smaller than you'd expect.