Claude Overreaches. Codex Underreaches. I'm Still Figuring Out How to Use Both.
After a Claude outage left me shipping nothing, I added Codex. Here's why running both AI coding agents beats picking one — and what I'm still figuring out.
I was a one-agent guy until Claude had a run of outages.
On those days I didn’t ship less. I shipped nothing. I’d open my editor, remember Claude was down, stare at the codebase, close the editor. A single-vendor dependency masquerading as a workflow.
So I reluctantly installed Codex CLI. Poked at it. Resented it for a week. Then task by task — caught myself reaching for it on purpose, even when Claude was up.
I still don’t have the workflow figured out. What I do know is that “pick one” is the wrong frame, and the Reddit threads that get it right aren’t the ones with the most upvotes.
The One Sentence That Explains Everything
From a 520-upvote r/ClaudeCode thread analyzing both tools’ open-source prompts:
“Claude Code reads like a product trying to create initiative while Codex reads like a product trying to prevent drift.”
— u/idkwhattochoosz
And the pithier version, from the comments:
“Claude is more willing to sin by overreaching. Codex is more willing to sin by underreaching.”
— u/entheogenicentity
Read those twice. That’s not a model-quality take. That’s a product-philosophy take. Two teams looked at the same question — what should an agent do when it doesn’t know what you meant? — and picked opposite defaults. One said “guess and move.” The other said “ask and wait.”
Claude Code’s system prompt pushes hard toward initiative: “A good colleague faced with ambiguity doesn’t just stop — they investigate, reduce risk, and build understanding.” Codex’s harness does the opposite: narrow the ambiguity, verify, don’t guess.
Every “Claude vs Codex” benchmark you’ve seen is scoring two products that were never competing on the same axis. It’s like benchmarking a kayak against a sedan because they both move you forward.
My Honest Opinion: Codex’s Harness Is Better
This is going to get me yelled at in r/ClaudeCode, and that’s fine.
After several weeks running both, Codex’s harness feels more mature. Not the model — the harness. The scaffolding around the model. The way it handles ambiguity, scope, and completeness.
Three things Codex does that Claude Code still doesn’t:
1. It doesn’t lie about completion. Claude will hand you a summary saying the work is done, tests pass, shipping-ready. Codex more often flags what it didn’t fix, what it wasn’t sure about, what it skipped. One r/ClaudeCode commenter put it better than I can: “Claude will always claim all is done and ready, while Codex will flag it and say ‘no, there is this and this and this that still need to be fixed.’”
2. It respects your instructions. Claude treats CLAUDE.md as a helpful suggestion. Codex treats AGENTS.md as a contract. If you tell Codex “don’t touch the migration files,” it doesn’t touch them. If you tell Claude the same thing, you’ll find a migration file edit in the diff and a cheerful note about how it improved schema consistency.
3. The restraint scales better. Claude’s “volunteer more” bias is delightful at 30 minutes of work. It becomes a liability at 3 hours. Codex’s restraint is annoying in a small task and load-bearing in a long one.
None of this means Claude Code is bad. It means Claude Code is optimized for a different shape of work than I’m doing. The initiative bias is a great fit for exploration and greenfield work. For production changes to a real codebase, Codex’s paranoia is the right default.
Here’s the one that changed my mind. I built Supabyoi (managed self-hosted Supabase) with Claude Code. When the MVP felt feature-complete — Claude’s verdict, confidently delivered, complete with a tasteful little summary of everything that worked — I ran a second pass on Codex in a parallel directory (~/supabyoi-codex). Just to see.
Codex came back with a whole second project’s worth of findings. Not the usual “bugs Claude missed.” Bugs Claude had confidently signed off on. Shipping-ready, per Claude. Not shipping-ready, per Codex. Codex was right about every one of them.
That was the week I stopped treating Codex as the thing I installed during an outage and started treating it as a different kind of reviewer. Not better. Differently biased. A second pair of eyes is only useful if it’s not the same pair of eyes.
Why You Should Actually Run Both
The flip side — and this matters, because I don’t want this post read as “switch to Codex, you fool” — Claude’s initiative bias is a real asset. You just have to point it at the right phase of the work. The problem isn’t Claude. It’s that you’re using Claude for the part of the job Codex is better at, and vice versa.
Four reasons to dual-sub instead of picking:
1. Hallucination diversity. This is the biggest one and almost nobody articulates it clearly. From u/campbellm on Reddit:
“I’ve been doing ‘have claude write something, have codex review it, have claude consider and critique that review.’ It is VERY unlikely that both will hallucinate the same way.”
Two models trained on different data with different RLHF signals don’t fail identically. When Claude writes confident-but-wrong code, Codex flags it. When Codex skips a subtle edge case, Claude’s “check adjacent concerns” bias picks it up. You get a natural adversarial review without hiring anyone.
2. The planner-executor split. Use Claude for the part it’s good at — exploring a messy problem space, drafting a plan, proposing a dozen angles. Then hand the plan to Codex for implementation. u/ocombe on r/ClaudeCode: “Run claude for the plan & fast work, use codex for thorough plan & code reviews.” u/mrothro’s version: “I use Claude Code for ideating and small implementation, then tell it to run Codex to do complex implementations and code reviews.”
The pattern is consistent across the threads: Claude’s strength is at the start (wide search, first drafts); Codex’s strength is at the end (narrow, verify, harden).
3. Cross-harness rule enforcement. Rules one model ignores, the other enforces. If Claude drifts on a constraint you set, Codex catches it in review. If Codex is too literal and missed an obvious improvement, Claude’s adjacent-concerns bias surfaces it. Two different failure modes cancel each other out.
4. Throughput. Both platforms throttle hard at the Max/Pro tier. When Claude hits limits on Friday morning, you switch to Codex and keep shipping. One r/ClaudeCode commenter reported pulling down from a Claude 20x plan to 5x, then adding a $100/mo Codex plan — roughly the same total cost, dramatically more runway. I’m not sure that math works for everyone, but the principle holds: one subscription is a single point of failure.
Agent-Flywheel Is the Tooling Signal
There’s a product called agent-flywheel.com that pre-configures Claude Code, Codex CLI, and Gemini on a fresh VPS. Total damage — VPS plus both Max/Pro subs — lands between 440and440and656 a month. That’s a car payment for a car that writes your code.
What I find interesting isn’t the tool. It’s the bet underneath it: a whole product assumes real developers want all three installed by default. Six months ago that would have read as overkill. Today it reads as table stakes.
The hype cycle hasn’t caught up yet. The mainstream take is still “pick your favorite,” as though these were ice cream flavors. The people actually shipping production code with agents have quietly moved to “run both. Sometimes three. And don’t make a big deal about it.”
I’m planning to deploy it — not on a greenfield project (everybody has a greenfield story), but on an existing one already shipping to real users. The interesting question isn’t whether a three-agent stack works on a clean slate. It’s what breaks when you wire it into a codebase with real uptime constraints, customers, and six months of decisions the tooling didn’t witness. Real-world battle stories from agent-flywheel setups are scarce. I want to write one.
The Honest Part: I Don’t Have the Workflow Figured Out Yet
Everything above reads like I’ve got this nailed. I don’t. Here’s the list of things I still don’t know, offered in the spirit of not pretending:
When exactly to hand off. I know Claude should plan and Codex should review. I don’t have a clean trigger. Sometimes I bounce mid-implementation because Claude is about to go off the rails. Sometimes I trust Claude to finish and Codex only sees the final diff. The “right” cadence isn’t obvious.
How much context to share. Each agent wants the full CLAUDE.md / AGENTS.md treatment. Writing both, keeping them in sync, and remembering which one has which convention is its own small job. I haven’t found a clean answer.
Whether the adversarial review actually catches bugs. It sounds great in theory. In practice, most of the time both agents agree the work is done, and the bugs I catch in review are ones I would have caught with one agent too. The hallucination-diversity argument may be overstated at the tasks most of us are actually doing.
Whether the cost is worth it at my usage. I’m not running agents 40 hours a week. At $400+/month for the dual sub, I’m probably over-subscribed for my actual throughput. The math gets better if you’re coding all day. I’m not.
Who Should Dual-Sub, Who Shouldn’t
Do it if you’re a solo dev shipping production code daily. You’ll hit Friday-morning limits on one platform whether you budget for it or not, and the adversarial review actually catches things. The cost is real. The throughput gain is bigger. Do the math; it pencils.
Don’t bother if you code a few hours a week. The switching tax and the subscription burn aren’t worth it at low volume. Pick one and move on. Claude if you want initiative. Codex if you want restraint. Nobody is grading you on this.
It’s complicated if you’re at a day job where the company pays for one and you’ve got a side project. Use the company sub for the day job. Don’t stack a second personal sub unless the side project is actually shipping — not “actually going to ship next month,” actually shipping, this week, to real users. The number of people running dual subs to ship nothing is, I suspect, not small.
What This Is Really About
The “ditch ChatGPT for Claude” narrative was a 2025 story. It was right for its moment. But the 2026 version of that story isn’t “ditch Claude for Codex.” It’s “stop treating this as a winner-take-all market.”
Different models have different biases baked into their harnesses. Claude overreaches. Codex underreaches. Gemini is still figuring out its personality. The right move isn’t to pick the bias you like. It’s to stack biases against each other so their failure modes cancel out.
I don’t have this workflow figured out. Neither does anyone else I’ve read on Reddit, honestly — the high-upvote posts are mostly single-tool takes, and the real insight is buried in the comments of threads with a few hundred upvotes.
But “only use one” is already wrong. That much is clear.

