Codex writes better code than Claude Code. I moved everything back to Claude Code anyway.
TL;DR: I’ve run Claude Code and Codex side by side for months, and I still think Codex produces the better code. But I’ve just moved everything back to Claude Code, including the work projects that have lived on Codex since January. The reason isn’t quality, it’s iteration size. Claude Code works in short bursts I can check, fix and build on every few minutes. Codex wants to deliver the whole feature in one long run, and by the time it’s done there’s too much to review at once. It turns out the way a tool chunks the work matters more to me than how clean any single diff is.
Five tools in a few years
My history with AI coding only looks short until you remember how young the tech is. ChatGPT launched in November 2022, three and a half years ago, back when OpenAI was still mostly known for DALL-E making weird images. I’ve been coding with AI for nearly all of that. It started with pasting snippets into ChatGPT and copying the answers back into my editor. Then GitHub Copilot, before agent mode existed, which could write a few methods for you but couldn’t see across files. Then an early version of Claude Code, which is where the real shift happened, because for the first time the tool could work across a whole codebase instead of a cursor position.
In December 2025 Claude Code went through a bad patch, and early January was worse, so I jumped to Codex. For a while I ran both at the same time. Then I settled into a split that lasted months: Inselnova, my browser strategy game, stayed on Claude Code, and my work projects went to Codex. A few weeks ago I noticed I was getting much less done on the Codex side. Then Claude Fable 5 came out, and it’s great. I’ve kept Codex updated the whole time too, but a new Codex version never feels like a big jump, it feels like an iterative one. Fable 5 felt like a jump. I stopped Codex and moved everything back.
So this isn’t a hot take from a week of testing. It’s two full switches, in both directions, with real projects on the line each time. It will also be out of date in a few weeks, because everything in this space is. Read it as my journey up to now, not a verdict.
The tools cost $100 to $200 a month
Worth saying before the comparison: these aren’t the $20 plans. At the level I use these tools, the subscriptions are $100 to $200 a month on both sides, and I’ve changed and switched levels multiple times as my usage moved. That’s not a spending habit, it’s the cost of learning a tech that makes the work much faster. On the $20 tier you hit the limits constantly, and saving the difference isn’t worth my time. When I was younger I could easily put $100 to $200 into a weekend of drinks. I’m an adult now, I don’t go out drinking anymore, so the money goes here instead.
At one point I gave up on ChatGPT entirely and unsubscribed. I went back, but not for the coding. The ChatGPT chat interface has a good memory, it knows my businesses by now, and it’s useful for business thinking in a way that has nothing to do with writing code. That’s its own subscription decision. The coding one is separate, and it’s the one this post is about.
Iterating keeps me under the rate limits
Rate limits are the most common complaint about Claude Code, and one report found it burned around four times more tokens than Codex on identical tasks. I don’t doubt the measurement. I’ve just never really lived it. On the $200 plan I’ve maxed out the limit a couple of times ever, and both were when I’d asked for something very generic, “search the web for x, then go do it” as a rough first pass. I rarely work like that. I’ve hit the Codex limit a few times more than that.
I think the loop is the reason. Because I’m iterating the way I always have, each burst is small, and I clear sessions between pieces of work instead of letting context pile up. Maxing out a context window is, I suspect, a vibe coder problem, for people who haven’t learned yet that you build a product in rounds. I use AI as an extension of how I already build. And the five hour window doesn’t bite either, because each round ends with me cleaning up and QA’ing what just landed, so there are natural breaks and gaps inside it. The same loop that keeps the review small keeps the meter slow.
Codex really does write better code
Let me concede the obvious point first, because I’m not going to pretend otherwise. Codex feels like it was trained on a higher quality body of code. Its first attempt is usually cleaner, it sticks to constraints more strictly, and it hallucinates less. My guess is that’s part of why OpenAI shipped it later than Anthropic shipped Claude Code. They were optimising for a different thing.
I should say the numbers don’t all back me up here. Composio spent 100+ hours with both and pointed at a survey of 500+ developers where 65% preferred Codex for daily work, yet when the output was reviewed blind, Claude Code’s code was rated cleaner two times out of three. Plenty of people experience the exact reverse of what I do. Sit with that for a second, because if the quality gap can flip depending on who’s looking at it, it’s a poor basis for picking a tool.
If the only measure was the quality of a single diff, Codex would win and this post would be over. But that’s not the unit that matters, at least not for me.
Ask Claude Code for the directory structure before any code
Where the gap shows most is scaffolding. Codex lays out a proper structure on its own. Claude Code, left alone, has a tendency to throw everything into a few files, sometimes just one. The fix is to make the structure part of the plan before any code exists.
I run Claude Code in plan mode and add one line: show me a directory structure of the plan. The way I see it, a framework is a collection of directories and file names. Controller, service, repo, model, validations, middleware. If you get the AI thinking like that, in the plan, in a skill, or in your AGENTS.md, Claude Code lays out the same structure Codex would have.
Tests are the same story with the roles swapped. Start coding with Claude Code today and, unless you ask, it typically won’t create tests. Codex writes and runs its own tests out of the box, which is a genuinely better default. But most of the time it throws them all into one flat directory instead of mirroring the structure of the code they cover. Both gaps dissolve the same way, with a line of guidance or a skill that carries your conventions. Neither tool’s defaults are fixed costs. They’re habits you build into the harness once and stop paying for.
I’ve already paid. Months of building my game with Claude Code left me with a harness of skills and conventions, and that’s a big part of why the switch back was cheap. Wherever Claude Code falls short by default, the harness lifts it to the level Codex hits out of the box. So I get Codex’s structure with Claude Code’s loop, which is the combination I actually wanted all along.
The cup of tea problem
When I ran both side by side, the difference wasn’t in the diffs. It was in what I was doing while I waited. A Codex run would go off for ten minutes or more, and I’d sit there thinking, what do I work on now? Do I make another cup of tea? That gap, multiplied across a working day, is where my productivity went.
The obvious answer is to run multiple Codex sessions at once and fill the gap with parallel work. I never did. I don’t fully know why, but I never felt confident kicking off three long Codex runs at the same time, probably because each one was carrying so much undecided work. With Claude Code I parallelise without thinking about it, because each session is only ever a few minutes from its next checkpoint.
I build in 25% chunks. Codex wants to do the whole 100%.
The honest version of this post is that it’s about the software development lifecycle, not the models. I’ve always worked iteratively. When I start building something I roughly know where the goal is, but I want the freedom to take the journey there, because the path always bends and the result is better for it.
Claude Code matches that. It works in short bursts, and I inch toward done in maybe 25% chunks depending on the session. It runs for five minutes, tells me it’s ready, and I look at what it did. If it went wrong, I catch it there and redirect, while the mistake is still small. Each round I clean up, test and QA before moving to the next one. The review load arrives in pieces I can actually hold in my head.
Codex is the opposite shape. Before I ask it to do anything I need to be locked in and fully decided, because it’s going to run for a long time and I’m not going to interrupt it. Then it goes for the full 100% in one pass. And when it finally comes back, I have too much to QA at once. A big pile of mostly-good code is worse than four small piles of rougher code, because I stop reviewing properly somewhere around the middle of the big pile.
Here’s the whole comparison as I experience it:
| Claude Code | Codex | |
|---|---|---|
| How it moves | short bursts, maybe 25% of the feature per round | one long run at the whole feature |
| What I need before starting | a rough goal | every decision made up front |
| When it goes wrong | I catch it five minutes in and redirect | I find out at the end, with a lot to unpick |
| Review load | small, every round | one pile, too big to review properly |
| First-try code quality | rougher | cleaner |
| Scaffolding | needs to be asked for a directory structure | lays one out on its own |
| Tests | won’t write them unless asked | writes and runs its own, in one flat directory |
| Running sessions in parallel | without thinking about it | never felt confident doing it |
| When it can’t finish | quietly stops, you have to ask | tells you it didn’t get to the end |
Claude Code stops without telling you it’s not done
One thing I should be fair about, because it’s the flip side of those short bursts. When I give Codex a task it does the entire thing, and if it doesn’t get to the end it tells me. Claude Code doesn’t, unless you prompt it. I think it runs out of loops or something internally. It’s not about context length, it just kind of stops.
So you say “and now what?” and it thinks for a bit and tells you it’s done one task out of eight. The dangerous part is the excitement. It sounds finished, you think wow, I’m ready to test this, and if you’re not careful you go looking for the feature and find there’s nothing to click. It’s built the backend infrastructure and the tests, there’s nothing visual yet, and you’re 15% of the way there. The short loop only works if you treat every “done” as a claim to check, not a result.
The tool has to match how you already build
I’ve been writing software for over thirty years, and iterative development is the one habit that survived every language, framework and team in that time. Plan a little, build a little, look at it, adjust. A tool that needs the whole feature pinned down before it starts is asking me to work waterfall, and no amount of code quality makes up for that.
If you’re choosing between them right now, skip the benchmarks and ask yourself four questions instead:
- Do you know the full spec before you start, or does it form as you build? Locked in up front suits Codex. Forming as you go suits Claude Code.
- How big a diff can you honestly review in one sitting? Be truthful about this one. If the answer is small, you want short loops, whatever tool they come from.
- Will you actually run sessions in parallel? Short loops made that feel safe for me. Long ones never did, even after months.
- Do you have a harness yet? Skills, AGENTS.md, conventions. They close most of either tool’s default gaps, and they transfer when you switch.
That’s the takeaway, and it’s the one part of this post that won’t be out of date in a few weeks. When you’re picking an AI coding tool, the benchmark scores and the diff quality are the visible part. The part that decides whether you’re actually faster is the loop length, how much work the tool wants to take on before it checks back in with you. Pick the one whose loop matches yours.
I build Inselnova, a free browser strategy game, this way. Play here.