Claude Code 与 Codex 比较：选择最佳 AI 编程助手

深度长文 2026年3月11日

我用了 Claude Code 好几个月，然后转到了 Codex。现在我又切回了 Claude，而原因和基准测试无关。我也在同一个任务上测试了这两者。

在这篇文章中：

我将讨论 Claude Code 和 Codex 的不同方面，

为它们提供支持的两款旗舰模型之间的差异，

Opus 4.6 vs. GPT-5.3-Codex，

真正改变你 AI 编程体验的因素，

以及一个小型案例研究：我如何在同一个构建 RAG 管道的任务中同时使用它们。

提前提醒一下：这篇文章大约需要 12 分钟阅读。如果你打算为其中任何一个每月花费 200 美元，我认为这 12 分钟非常值得投入。

Opus 4.6 vs. GPT-5.3-Codex：任务完成时间跨度

对比 Codex 和 Claude Code 的一个可靠方式，是看它们底层的旗舰模型以及“完成时间跨度”（Completion Time Horizon），你可以在这里查看。

这种比较提出的问题是：这个模型能够可靠完成多长时间的任务？任务完成时间跨度指的是任务持续时间（以人类专家完成该任务所需的时间衡量），在这个时间点上，模型被预测能够以一定的可靠性成功完成任务。因此，如果一个模型的“50% 成功率下的 2 小时时间跨度”，意味着：给它一个熟练的人类需要 2 小时完成的任务，AI 大约有一半的概率能够成功完成。

在这项研究中，他们为每个模型都使用了相应的脚手架（scaffold），包括 Claude Code 和 Codex。因此，虽然重点在模型本身而不是脚手架，但我们也可以大致了解这些脚手架的可靠性。它告诉我们，这些编程代理中哪一个能够处理更长、更困难的任务。

正如你在图表中看到的，Opus 4.6 和 GPT-5.3-Codex 之间存在一个巨大的差距。Opus 4.6 在任务完成时长为 12 小时时达到 50% 的成功率，而 GPT-5.3-Codex 的这个数字是 5 小时 50 分钟。当成功率达到 80% 时，这两个模型之间的差距会缩小。

这清楚地表明这两个模型之间存在差距，因此 Claude Code 和 Codex 之间也存在差距，也就是它们在处理困难和具有挑战性的任务时能力的差异。不过，这可能不会直接对应到你使用它们时的任务类型，所以要记住这一点。

Claude Code 更快，但速度并没有那么重要

众所周知，Claude 的速度比 Codex 更快，但使用编码代理是一个长期过程。

如果一个代理用一半的时间完成任务，却需要你花 10 分钟去调试这该死的东西，相比之下，花更多时间完成实现但之后不需要你再盯着它看，那多花的时间 100% 是值得的。

这并不是说 Claude Code 或 Codex 会犯更多错误，而是当你自己评估这些代理，或者听别人炫耀他们代理的编码速度时，需要在脑海中保持的一个总体概念。

任务类型对代理很重要

Codex 和 Claude Code 在不同的编码任务中表现不同。在 AI 工程任务中，一个可能优于另一个，而在 Web 开发任务中，同一个模型可能会被完全碾压。

哪些编码任务更适合 Codex 或 Claude Code？

目前对此研究得还不充分。

例如，目前还不清楚在底层编程中应该使用哪一个。理想情况下，你应该在一个简单且可验证的环境中同时测试两者，然后再全力投入。但对大多数人来说，同时花 300 到 400 美元去使用两者并不现实。

在各种编码任务中全面评估这两个代理是一个很有意思的研究方向，但这也并不简单，因为这些代理以及驱动它们的模型每隔几个月就会发生巨大的变化。

它们各自是如何诞生的

Claude Code 最初是 Anthropic 的 @bcherny 的一个副项目，他构建了一个终端原型，可以与 Claude API 交互、读取文件，并运行一些 bash 命令。

到第五天时，内部团队已有一半开始使用它。随后，Claude Code 于 2025 年 2 月 24 日作为研究预览版发布，使用的是 Claude 3.7 Sonnet。它花了一些时间才被开发者大规模采用，后来 Anthropic 也为其发布了一个 VS Code 扩展。

另一方面，OpenAI 宣布最初的 Codex 模型是一个 120 亿参数的 GPT-3 模型，并在 GitHub 代码上进行了微调，最终为 GitHub Copilot 的第一个版本提供支持。不过新的 Codex 则是一个完全不同的新产品。

Codex CLI 于 2025 年 4 月 16 日首次发布，作为一个终端代理推出，并且此后随着更强模型的加入不断演进。最新的 GPT-5.3-Codex（2026 年 2 月 5 日）被 OpenAI 描述为“第一个帮助创造自身的模型”。

@GergelyOrosz 对 Claude Code 和 Codex 的开发者进行了两次非常有意思的采访，讨论了他们的技术栈、开发方式，以及各自最初是如何开始的。你可以从这两次采访中学到很多。

👉 Codex 是如何构建的

技术栈与驱动模型

Claude Code 使用 TypeScript 编写，并使用 React 搭配 Ink 来渲染终端 UI。它以单个 Bun 可执行文件的形式发布（Anthropic 于 2025 年 12 月收购了 Bun，原因正是这个）。其使用的 Opus 和 Sonnet 模型也支持 100 万 token 的上下文窗口。

Codex CLI 使用 Rust 编写，以获得更好的性能、正确性和可移植性。OpenAI 甚至为这个团队聘请了 Ratatui（一个 Rust TUI 库）的维护者。

这两个 CLI 工具本质上都是通过 API 调用其所使用模型的轻量封装。在使用 Claude Code CLI 时，我注意到一些小“故障”，而在 Codex 上几乎没有注意到，我认为这可能与它们的技术栈有关。

不过，这些小故障最多只是略微有些烦人；它们并不会真正影响你的编码体验。

基准测试结果接近，但存在一些细微差别：Token 经济性

最大的性能差异不在于准确性，而在于 token 效率。Morph 对 Opus 与 Codex 进行的一项全面评测显示出一个有趣的差距。

在相同任务上，Claude Code 使用的 token 数量是 Codex 的 3.2–4.2 倍。在一次 Figma 插件构建任务中，Codex 消耗了 150 万 tokens，而 Claude 消耗了 620 万。

如果这是真的，那就意味着在为 Claude Code 订阅支付同样费用的情况下，你更有可能更早触及 token 限制。

使用体验最重要

Claude 给人的感觉像是一位为你完成工作的高级开发者，而 Codex 更像是一个你把任务交给他、然后回来领取结果的外包承包商。

这是开发者描述两者差异的常见方式。

据说 Claude Code 具有很强的交互感，同时也具备 Opus 所预期的深度推理能力。它会向你提问、展示推理过程，并解释其方法。虽然在我那次单次对比实验中并没有出现这种情况，但根据我多个月使用 Claude 的经验，我可以确认这确实存在。

Codex 以在简单任务上的首次尝试准确率而闻名，但代价是实现速度会略微下降。

话虽如此，当你在 AGENTS.md 中明确写出你想要什么时，两者在行为上的差异实际上会大大缩小。如果你明确说明需要模型在全面开始执行之前先与你确认实现计划，那么无论你使用的是“高级开发者”型代理还是“承包商”型代理，模型都会这么做。

这并不是说代理之间没有差别，确实是有的。

但并没有你在 X 上经常听到的那么夸张。

快速数据

在 VS Code Marketplace 上，Claude Code 的安装量为 610 万，评分为 4/5；而 Codex 的安装量为 540 万，评分为 3.5/5。

在 GitHub 上，Claude Code 约有 6.5–7.2 万个 star，而 Codex 约有约 6.4 万个 star。

为什么我现在要转回使用 Claude Code

Anthropic 的生态系统吸引力很强

选择使用 Codex 还是 Claude Code 不仅仅是关于写代码。订阅它们中的任何一个，其实也是在订阅整个 Anthropic/OpenAI 的生态系统，这是你可能需要考虑的一点。

我个人认为，Claude 正在成为一个非常火热的生态系统，类似于 Apple。现在已经有 Claude Cowork、Claude Chat 和 Claude Code。看起来 Anthropic 也在通过 Claude 应用逐步构建一个更安全、更温和版本的 OpenClaw（你的主动型个人代理），相关的小功能正在逐步推出。

在 OpenAI 这边，目前我没有看到什么特别吸引人的东西。除了 Codex 之外，其他看起来都比较平淡。我感受不到一个完整的生态系统，更像是一些零散的碎片，而且外面还有更好的替代方案。

我已经开始更多地使用 Claude Chat 而不是 ChatGPT。对我来说，与 Opus 相比，ChatGPT 现在几乎到了难以使用的边缘。无论是 UI、对话语气还是模型选择，都没有真正让我有动力去使用 ChatGPT。

所以，在我已经频繁使用 Claude Chat、并且打算试试 Cowork，同时目前也看不到从 Claude Code 迁移到 Codex 有任何决定性的改进时，转回 Claude Code、并把每月 200 美元的订阅费用从我的支出中砍掉，似乎成了一个非常容易做出的决定。

这已经成为我一个重要的考虑因素，并且极大地影响了我转回使用 Claude 的决定。

定价

Claude Code 和 Codex 的定价基本是一样的：

入门：两者都是 $20/月

进阶用户：Claude Code 有一个 Max 5x 方案，价格为 $100/月

重度用户：两者都是 $200/月

Claude Code 真正出色的地方在于它提供了一个 $100/月的中间档位，而不是从 $20 直接疯狂跳到 $200 的订阅。我认为 Max 5x 方案（$100/月）对大多数开发者来说已经非常够用。

所以从某种意义上说，Claude Code 在实际使用中可能更便宜，因为它允许你选择一个更适合你的、更便宜的方案，而不是被迫一路升级到更高价的订阅。

技能与插件：开发者生态系统

由于技能在 Claude Code 和 Codex 之间是兼容的，所以无论你使用哪一个都不会感觉到差异。不过，大多数技能中心和代码仓库都是以 Claude Code 命名的，这可能会稍微让人有点困惑。

在很多其他方面也是如此。你在 Reddit、X 或博客文章中看到的许多关于编码代理的帖子，大多是在讨论 Claude Code 而不是 Codex，尽管同样的原理其实也适用于两者。这也在一定程度上说明了它的受欢迎程度和社区规模。

Codex 对技能和插件的支持推出得比 Claude Code 晚得多。不过插件的兼容性不如技能。而且由于 Codex 的插件支持是最近才开始的，可用的插件数量也不算多。

总的来说，包括我在内的很多开发者其实完全不用插件。所以除非你确实需要各种插件支持，否则这并不是一个需要担心或作为决策依据的因素。

RAG 流水线：一个案例研究

为了进行比较，我选择了一个可以进行定量评估的任务。例如，创建一个落地页的问题在于它是一个定性任务：一个人可能觉得某个落地页看起来很酷，而另一个人则会说它只是紫色渐变的垃圾。

因此我选择了构建一个简单的 RAG 流水线，因为生成答案的准确性可以用数字来衡量。

如果你想自己做类似的比较，其他不错的想法包括训练一个视觉模型、微调一个 LLM，或者测量一个底层程序的性能。

构建检索流水线是 AI 工程师的一项常见任务，也可能是在工作中使用 Claude Code 或 Codex 的场景。我让这两个编码代理为我构建一个用于研究论文的 RAG 问答流水线。工作流程很简单：

获取若干篇论文并提取其文本。

将内容切分成更小的片段。

将每个片段嵌入到向量空间中。

当用户提出问题时，找到与问题嵌入最接近的片段嵌入。

以原始形式（而不是它们的嵌入）检索这些相近的片段。

利用这些上下文来回答用户的问题。

这是一个足够简单、可以在一次会话中实现的任务，但其中包含许多会极大影响输出结果的细节： - 使用什么样的切分策略 - 如何对片段进行嵌入 - 选择哪种向量存储 - 如何处理判断哪个片段更接近查询的置信度 - 是否要改写用户的查询以帮助找到更多相似的片段，等等。

实验设置

我从过去一周的 @huggingface 每日论文中选取了5篇研究论文，并创建了一个测试数据集（大小 = 100），包含问题和对应的标准答案，之后我会用它来测试 Claude 或 Codex 的实现效果如何。

对于这两个编程代理，我都指定了以下要求：

构建一个 Python RAG 流水线

使用 `PyMuPDF` 处理所有 PDF

为这个用例选择一个合适的分块（chunking）策略

创建向量嵌入，并建立一个持久化的本地向量索引（方式自选）

使用 `llama-3.1-8b-instant` 生成最终答案。

如果找不到足够的证据，不要产生幻觉（hallucinate），而是返回一个回退响应

对于 Codex 和 Claude Code，我都使用了各自最优秀、最流行且默认可用的模型：gpt-5.3-codex 和 Opus 4.6，并且都设置为 High effort（推理强度）。两者都没有 AGENTS.md。

它们是如何实现该流水线的

我没有注意到两个代理在思考任务方式上有明显差异，唯一的区别是 Codex 在解释自己的计划和接下来要做什么时更加详细。Claude 则只是直接编写文件并执行命令，而不会过多说明。

与 Claude 相比，Codex 完成任务所花的时间也更长。

更重要的是，Claude 对脚本进行了端到端测试，并确保整个流程已经可以直接使用。

另一方面，Codex 完成了实现，但没有测试或运行程序，而是让我先用 pip 安装依赖并运行脚本。不出所料，我在运行脚本时遇到了错误，随后 Codex 又修复了这个问题。相比之下，Claude 的脚本运行时完全没有任何问题。

我注意到 Codex 有这样一种模式：它往往把很多需要动手的工作或环境设置留给你来完成，而不是自己直接完成。

虽然 Codex 会告诉你环境问题或实现上的困难，并采取相应行动，但 Claude 往往会直接主动把问题修好——这取决于你的偏好，可能是好事也可能不是。

我还注意到，在新会话中，Codex 生成第一个 token 的初始响应时间有时会高达一分钟，而 Claude Code 的这个时间要短得多。

Claude Code vs. Codex 实现对比

两个编码代理采用了出乎意料地相似的方法：

它们都选择了相同的嵌入模型 all-MiniLM-L6-v2

它们都将 Top-K 检索的 k 值设为 5

两者都在系统提示中限制 LLM 只能使用提供的上下文

而在以下方面，它们采取了不同的方法：

向量存储：Claude Code 选择了 ChromaDB 作为向量数据库，而 Codex 选择了 FAISS——这是一个更底层的相似度搜索库，在内存效率和速度方面更有优势。

分块（Chunking）：Claude Code 采用递归式字符拆分。它先尝试 \n\n，然后是 \n，再然后是 "."，最后是 " "。目标是每块 1000 个字符，并有 200 个字符的重叠。Codex 则采用句子级的词语拆分，将内容填充到最多 220 个词的块中，并有 40 个词的重叠。Claude Code 按结构拆分（段落 → 行 → 句子 → 词），并以字符数来衡量。Codex 先按句子拆分，然后将句子打包进有词数预算的块中。Codex 的方法尊重句子边界，避免在句子中间截断，但在这个上下文（学术文本）中，220 个词可能偏小。

检索（Retrieval）：两者都选择 Top-5 个块。Claude Code 返回原始的 L2 距离，而 Codex 返回内积（余弦）得分。

置信度（Confidence）：Claude Code 对最佳 L2 距离使用单一阈值（>1.2 = 不相关），然后检查平均距离以区分低可信与高可信的块。Codex 使用多标准的三层体系：强（strong）、中等（moderate）和不足（insufficient）。

代码架构：

Claude Code：扁平函数结构，每个模块中都有常量，对模型一致性没有输入校验。

Codex：面向对象（OOP）的流水线类、集中式配置、dataclasses、argparse CLI，以及模型一致性校验。

Codex 的工程化程度明显更高，也更具可配置性。在大型且更严肃的代码库中，这一点至关重要。

结果

使用 gpt-5.4 作为 LLM-as-a-judge，对两个流水线的答案从四个标准进行比较：正确性（Correctness）、完整性（Completeness）、相关性（Relevance）和简洁性（Conciseness）。

在 100 个问题中，Claude Code 赢了 42 个，Codex 赢了 33 个，另有 25 个打平。Claude 获胜的主要原因是其更宽松的置信度门控，也可能与略高的生成温度有关（0.2，而 Codex 的流水线为 0.1）。

一点保留意见（A Pinch of Salt）

不过，这只是一个非常简单的设置，我主要是想看看两个编码代理在实现同一个封闭式任务时采取的不同方法。在专业环境中，整体架构的决策——例如分块方法、向量数据库、检索策略等——通常由开发者来决定。此外，在专业环境中开发此类系统需要更多测试和迭代改进，并配合更可靠的测试集和验证机制。

不过，一个在构建 RAG 流水线方面经验不多的初级开发者把这些决策交给 AI 来做，其实是很可以预期的。

就选一个吧

我认为无论你选择 Claude Code 还是 Codex，都不会有什么致命性的错误决定。与现有生态中的其他选择相比，它们都提供了很强的模型能力，并且在完成任务方面大体相当。

对我来说，两个主要因素是：Anthropic 生态，以及每月 100 美元的价格档位。即使我需要把这个档位提升到每月 200 美元，我仍然会因为前一个原因继续使用 Anthropic 的 Claude Code。

最重要的是你用这些脚手架工具做什么，以及你是如何使用它们的。

与任何基准测试相比，这一点更能决定哪个更适合你。除了在你亲自测试两者之后，凭直觉判断哪个用起来更顺手之外，其实并没有一个明确答案。

有些开发者，比如 @steipete，非常推崇 Codex；同时也有一个社区认为 Opus 在能力上完全碾压 OpenAI 的模型。

我认为他们其实同时都是对的，因为他们使用这些编程代理的工作流程，以及他们的“品味”，是不同的。

如果你不确定该选哪个，我建议先试试它们各自每月 20 美元的版本，在与你相关的编程领域中进行测试，最好用几个可以验证结果的任务来比较。

最后要记住，就像 AI 领域的其他事情一样，这个领域每隔几个月就会发生巨大变化。也许你现在更喜欢其中一个，但三个月后，这个代理的行为可能已经发生漂移，或者市场上又出现了新的模型。

在 AI 领域，很少有问题存在绝对统一的答案，而这个话题显然也不是其中之一 ;)

显示英文原文 / Show English Original

I've used Claude Code for months, then moved to Codex. I just switched back to Claude and the reason has nothing to do with benchmarks. I also tested both on the same task. In this article: I will discuss the different aspects of Claude Code and Codex, the difference between the two flagship models powering them Opus 4.6 vs. GPT-5.3-Codex, what really changes your AI coding experience, and discuss a small case study where I have used both of them for the same task of building a RAG pipeline. Just to give you a fair warning, this article takes ~12 minutes to read, and I think that's a time well-invested if you are going to commit to spending $200/month for either of them.

Opus 4.6 vs. GPT-5.3-Codex: Task-Completion Time Horizon One reliable comparison between Codex vs. Claude Code is about their underlying flagship models and the Completion Time Horizon, which you can check out here. This comparison asks: how long of a task can this model reliably complete? The task-completion time horizon is the task duration (measured by human expert completion time) at which the model is predicted to succeed with a level of reliability. So a model with a "2-hour time horizon at 50%" means: give it a task that would take a skilled human 2 hours, and the AI succeeds about half the time. For this study, they use the appropriate scaffold for each model, including Claude Code and Codex. So while the focus is on the model, and not on the scaffold, we can get an idea of how reliable the scaffolds are as well. It tells us which one of these coding agents can handle longer, harder tasks. As you can see in the chart, there is a BIG gap between Opus 4.6 and GPT-5.3-Codex. Opus 4.6 has a 12 hour task completion length at 50% success while for GPT-5.3-Codex, this number is 5 hours and 50 minutes. This gap closes at 80% success between the two models. This is a clear indication of a gap between these two models and, consequently, between Claude Code and Codex, between how well they can tackle difficult and challenging tasks. It might not directly translate well to the type of tasks you use them for, so keep that in mind. Claude Code is Faster, but Speed Doesn't Matter That Much Claude is famously faster than Codex, but working with coding agents is a long-term process.

If an agent finishes the task in half the time, and then requires you to spend 10 minutes debugging the damn thing, as opposed to spending more time with implementation and not requiring you to babysit it afterwards, that extra time is 100% worth it. This is NOT to say that Claude Code or Codex makes more mistakes, but a general idea to have in the back of your mind when evaluating the agents yourself or hearing people talk flex their agent's coding speed. The Task Matters For Agents Codex and Claude Code perform differently based on the coding task they're used in. In an AI Engineering task, one might outperform the other, while in a web development task, that same model would be obliterated. Which coding tasks are better for Codex or Claude Code? This is not studied well. For example, it's not clear which one to use in low-level programming. Ideally, you'd test both in a simple and verifiable setup before going all-in. But spending $300-$400 for both is not feasible for most people. It's an interesting area of research to fully review both agents in a variety of coding tasks, but it's also not trivial since these agents and the model powering them change drastically every few months.

How Each Came to Exist Claude Code initially started as a side project by @bcherny at Anthropic, who built a terminal prototype that could interact with the Claude API, read files, and run some bash commands. Half the internal team started using it by day five. Then Claude Code was released as a research preview on February 24, 2025, using Claude 3.7 Sonnet. It took some time to be mass-adopted by developers, and over time, Anthropic released a VS Code extension for it as well. OpenAI on the other hand, announced the original Codex model as a 12B GPT-3 model fine-tuned on GitHub code, which eventually powered the first version of GitHub Copilot. The new Codex is an entirely new product though. Codex CLI launched first on April 16, 2025, as a terminal agent, and has evolved with better models even since. The latest GPT-5.3-Codex (February 5, 2026) is described by OpenAI as "the first model that helped create itself." @GergelyOrosz has two very interesting interviews with the developers of Claude Code and Codex, about their tech stack, how they develop them, and also how each one started initially. You can learn a lot from these two interviews. 👉 How Codex is Built Tech stacks and Powering Models

Claude Code is written in TypeScript, using React with Ink for terminal UI rendering. It ships as a single Bun executable (Anthropic acquired Bun in December 2025 for this reason). The Opus and Sonnet models used by it also support a 1M-token context window. The Codex CLI is written in Rust, for its performance, correctness, and portability. OpenAI even hired the maintainer of Ratatui (a Rust TUI library) for the team. Both CLI tools are thin wrappers around the model that they use through the API. I've noticed some small "glitches" when working with the Claude Code CLI that I didn't really notice with Codex, and I think that might be expected given their tech stack. However, these glitches are nothing more than mildly annoying things; they really don't affect your coding experience. Benchmarks are Close, But with Nuances: Token Economics The biggest performance difference isn't accuracy, but token efficiency. A comprehensive review on the Opus vs. Codex done by Morph shows an interesting gap. Claude Code uses 3.2–4.2x more tokens than Codex on identical tasks. On a Figma plugin build, Codex consumed 1.5M tokens compared to Claude's 6.2M. If this is true, it means for paying the same money for a Claude Code subscription, you're more likely to hit token limits.

The Feeling Matters the Most Claude feels like a senior developer doing work for you, and Codex is a contractor you hand off tasks to and then come back to pick up the results. This is the common way developers describe the difference. Claude Code reportedly has a strong interactive feel to it, and also a deep reasoning quality, which is expected of Opus. It asks you questions, shows you the reasoning, and explains its approach. Even though this was not the case in my single comparison experiment, I can confirm this is true, from my many-months experience of using Claude. Codex is famous for its first-attempt accuracy on straightforward tasks, which comes at the cost of a slight decrease in implementation speed. With all that being said, the difference in the behavior really diminishes as you lay out specifically what you want in the AGENTS.md. If you specify that you need the model to check the implementation plan with you before going off guns blazing, the model will do that, regardless of which one you use, the "senior developer" agent or the "contractor" agent. This isn't to say that agents aren't actually different, THEY ARE. But not as exaggerated as you commonly hear on X.

Quick Numbers On VS Code Marketplace, Claude Code has 6.1M installs with a 4/5 rating, while Codex has 5.4M installs with a 3.5/5 rating. On GitHub, Claude Code has approximately 65–72K stars and Codex has ~64K stars. Why I'm Moving Back to Claude Code for Now Anthropic's Ecosystem Pulls Hard Choosing whether to go for Codex or Claude Code isn't just about coding. A subscription to each of them is a subscription to the whole ecosystem of Anthropic/OpenAI and this is something you might want to consider. I personally believe that Claude is becoming a very hot ecosystem similar to Apple, now with Claude Cowork, the Claude Chat, and the Claude Code. It seems Anthropic is also slowly building a safer and tamer version of OpenClaw (your proactive personal agent) with the Claude app, and the small bits and pieces for it are being rolled out gradually. On OpenAI's front, I'm not seeing anything enticing at the moment. Aside from Codex, everything else seems dull. I don't feel an ecosystem, but fragmented bits and pieces with better alternatives out there.

I've already been using Claude chat rather than ChatGPT, as for me, ChatGPT is borderline unusable at this point compared to Opus. The UI, the tone of the chat, and the model selection, none of them really encourage me to use ChatGPT. So, at the point of which I'm using Claude Chat frequently, I'm planning to tinker with cowork, and I don't see any deal-breaking improvement from Claude Code → Codex migration at the moment, the decision to go back to Claude Code and cut $200/month subscription price out of my pocket really seemed like an easy choice to make. This has become a major factor for me, and one that drastically influenced my decision to move back to Claude. Pricing The pricing for both Claude Code and Codex is basically the same: Entry: $20/month for both Power User: Claude Code has a Max 5x priced at $100/month Heavy User: $200/month for both

Where Claude Code really shines is that it offers a mid-tier $100/month, rather than a crazy jump from $20 to $200 subscriptions, and I believe the Max 5x plan ($100/month) is really adequate for most developers. So in a way, you could say Claude Code is cheaper in practice, because it allows you to select a cheaper plan that works for you rather than forcing you to climb the pricing ladder. Skills and Plugins: The Developer Ecosystem As skills are compatible between Claude Code and Codex, you won't notice a difference regardless of which one you use. However, most skill hubs and repos are named after Claude Code, which might be a little confusing. This is the case with most other things as well. Many of the posts you see on Reddit, X, or blog posts about coding agents are about Claude Code rather than Codex, even though the same principles apply to both of them, which really tells you something about the popularity and community size. Codex has launched support for both skills and plugins much later than Claude Code. But plugins aren't as compatible as skills. And as plugin support for Codex started just recently, there's not so many available. All this to say that many developers, including me, don't use plugins at all. So unless you specifically need the support for various plugins, this is not something to worry about or base your decision on. RAG Pipeline: A Case Study

For the comparison, I chose to go with a task that can be quantitatively assessed. The problem with creating a landing page, for example, is that it's a qualitative task: one might think a landing page is cool looking while the other calls it purple-gradient slop. So I chose the task of building a simple RAG pipeline, since the accuracy of the generated answers can be determined in numbers. Other good ideas if you want to do a similar comparison yourself, could be training a vision model or fine-tuning an LLM, or measuring the performance of a low-level program. Building a retrieval pipeline is a common task of an AI engineer, potentially something you'd use Claude Code or Codex in your job. I tasked both of these coding agents to build me a RAG Q&A pipeline for research papers. The workflow is simple: Take a number of papers and extract their text. Chunk the contents into smaller bits. Embed each chunk into a vector space. When a user asks a question, find the closest chunk embeddings to the embedding of the question.

Retrieve the close chunks in their original form (not their embeddings). Use that context to answer the user's question. This is a task simple enough to be implemented in one session, but it has intricate details that massively influence the output: - what chunking strategy to use - how to embed the chunks - what vector storage to go for - how to handle the confidence of which chunk is closer to the query - whether to rephrase the user's query to help find more similar chunks, etc. The Experiment Setup I took 5 research papers from the @huggingface daily papers of the past week, and created a test dataset (size = 100) of questions and ground truth answers, which I would later use for testing how good the implementation of Claude or Codex is. For both coding agents, I specified the following: Build a Python RAG pipeline Process all PDFs using `PyMuPDF`

Choose a good chunking strategy for this use case Create embeddings and a persistent local vector index (your choice) generate final answers with `llama-3.1-8b-instant`. If no sufficient evidence is found, do not hallucinate. return a fallback response For both Codex and Claude Code, I used the best most popular and default available models: gpt-5.3-codex and Opus 4.6, both with High effort (the degree of reasoning). None had an AGENTS.md. How They Implemented The Pipeline I didn't notice any noticeable difference in how each agent thinks about the task, other than the fact that Codex is more verbose in explaining its plan and what it's going to do. Claude simply writes the files and executes the commands without talking so much about it. Codex also took longer to finish the task compared to Claude.

More importantly, Claude tested the script end-to-end and made sure the pipeline is ready to use. Codex, on the other hand, finished the implementation but didn't test or run the program, and instructed me to pip install the requirements and run the script. Naturally, I hit an error in running the script, which Codex fixed. Claude's script worked with no problems whatsoever. I've noticed this pattern with Codex, that it leaves many of the labor or setups for you to do rather than simply doing it itself. While Codex would let you know and take action for an env problem or implementation difficulty, Claude takes the liberty of fixing it, which depending on your preference, can be a good/bad thing. I've also noticed that the initial time-to-response for the first token in a new session for Codex can go as high as a minute, while this is much shorter for Claude Code. Claude Code vs. Codex Implementation Both coding agents went for surprisingly similar approaches: they both went for the same all-MiniLM-L6-v2 as the embedding model

they selected k=5 for the Top-K retrieval both restricted the LLM in the system prompt to only use the provided context This is where they went with separate approaches: Vector Storage: Claude Code chose ChromaDB for the vector DB, and Codex went for FAISS, which is a lower-level similarity search library, more memory-efficient and faster. Chunking: Claude Code went for a recursive character splitting. It tried \n\n first, then \n, then "." , then " ". The target is 1000 chars with 200 char overlap. Codex went for a sentence-level word splitting and fills chunks up to 220 words, with 40-word overlap. Claude Code splits by structure (paragraphs → lines → sentences → words) and measures in characters. Codex splits by sentences first, then packs them into word-budget bins. Codex's approach respects the sentence boundaries and avoids mid-sentence cuts, but the 220 words may be too small for this context (academic text). Retrieval: Both chose Top-5 chunks. Claude Code returns raw L2 distances and Codex returns inner-product (cosine) scores. Confidence: Claude Code used a single threshold on the best L2 distance (>1.2 = irrelevant) and then checks the average distance for low vs. well-grounded chunks. Codex uses multi-criteria with three tiers: strong, moderate, and insufficient. Code Architecture:

Claude Code: Flat functions, constants in each module, no input validation on model consistency. Codex: OOP pipeline class, centralized config, dataclasses, argparse CLI, model consistency validation. Codex is clearly better engineered and more configurable. In large and more serious codebases, this is critical. Results Using gpt-5.4 as the LLM-as-a-judge, the answer of both pipelines is compared in four criteria: Correctness, Completeness, Relevance, Conciseness. Among the 100 questions, Claude Code won 42, Codex won 33, and 25 were ties. Claude won mostly due to its looser confidence gating, and maybe a slightly higher generation temperature (0.2 vs 0.1 in Codex's pipeline). A Pinch of Salt Now this was a very simple setup, and I was mostly curious to see the different approach the two coding agents take in implementing the same close-ended task. In a professional setup, it's the developer who makes the calls for the overall architecture: the chunking method, the Vector DB, the retrieval strategy, etc. Also, in a professional setup, developing such systems requires much more testing and iterative improvements, with more reliable test sets and verifications.

However, it's really expected that a junior developer who's not very experienced in building a RAG pipeline leaves these decisions to the AI to make. Just Pick One I don't think there is any terminally wrong decision whether you choose Claude Code or Codex. Both offer strong models compared to the existing landscape and get the job done to a similar degree. Two major factors for me have been: the Anthropic ecosystem, and the $100/month pricing tier. Even if I have to bump up that tier to the $200/month pricing, I would still stick to Anthropic's Claude Code for the former reason. The most important thing is what you use these scaffolds for and how you use them. This determines which one is better for you better than any benchmarks, and there's no clear answer to that other than your gut telling you which one feels better after you test both. There are developers like @steipete who swear by Codex, and there is a community that believes Opus is just unrivaled by OpenAI models. I think both of them are correct at the same time, simply because their workflow of using these coding agents, and their "taste" is different.

If you're doubtful about which one to go with, I suggest trying out the $20/month version of both of them on the type of programming field that's relevant to you, and test preferably on several verifiable tasks. Finally, keep in mind that similar to anything else related to AI, the landscape changes drastically every few months. While you might like one of them now, three months later, the agent's behavior might drift, or a new model might hit the market. There are very few things in AI with definitive global answers, and this subject is not one of them ;)

来源 Source

https://x.com/i/article/2030946053629915136