The Hidden Architecture of Claude Code, Codex, and Pi

The interesting thing about the major coding agent harnesses isn’t their feature lists. The feature lists are converging, and have been since at least 2025. Everyone has shell access. Everyone has file editing. Everyone has test integration. If you compared Claude Code, Codex, and Pi by ticking off features, you’d come away with the impression that they’re roughly the same product. They are absolutely not.

They embody three meaningfully different philosophies about how a coding agent should be structured, and the differences explain why developers tend to settle strongly with one rather than rotate among them.

Feature parity hides philosophical divergence

Before going further: I’m describing these systems as they appear to a builder reading their public behavior, documentation, and source-available components. Internal architecture details are inevitably partial. The point isn’t to audit them — it’s to use them as concrete examples of distinct design schools.

Three harnesses, three bets about where intelligence lives

Claude Code: minimal kernel, rich skills, conservative autonomy.

The Claude Code design philosophy reads as if it were built top-down from the principles in earlier posts in this series. The harness ships a small, opinionated set of core tools — read, write, edit, search, bash, glob, grep — and treats almost everything else as a skill that the model can discover and load when needed. The skill library is the procedural substrate; the core tools are the kernel. The model isn’t given a giant tool catalog to wade through. It’s given a small set of primitives and a way to discover specialized procedures.

Autonomy is bounded by default. Each meaningful action — file write, shell command, git operation — is visible to the user and, depending on the risk profile, gated by approval. The default isn’t “run wild and ask forgiveness.” The default is “show your work, propose the change, wait for the obvious go-ahead.” This is the collaborative-loop pattern from the previous post, taken seriously as a structural choice.

The implicit bet here is that the substrate is more important than the model’s eagerness. A model that wants to be helpful can be made dramatically more useful by surrounding it with a structured environment of skills, careful context, and reviewable actions, even if it’s less aggressive about taking initiative. This bet is, in my view, basically correct — and the production usage data supports it.

Codex: research-oriented, runtime-flexible, more autonomous defaults.

Codex (and the broader category of CLI-first agentic platforms it represents) is designed for builders rather than end users. The harness is configurable, extensible, and pluggable. You can swap out the model. You can swap out the planner. You can write your own runtime. You can integrate your own tool surface. The bet is that the design space is too large, and changing too fast, for any single opinionated harness to be right for everyone — so the right move is to make the harness modular and let users assemble their own.

The autonomy defaults skew higher than Claude Code’s. Out of the box, a Codex agent will take more steps without intervention, explore more aggressively, and treat the user as a director rather than a continuous collaborator. This makes sense given the user profile: builders running experiments, researchers comparing approaches, teams doing automated coding workflows where high-friction approval steps would defeat the point.

The trade-off is that this configurability is also responsibility. With great runtime flexibility comes the requirement that you actually understand what you’re configuring. Codex-style systems reward sophisticated users and punish naive ones. Claude Code, by comparison, is opinionated in ways that protect users from configurations that would harm them.

Pi: small kernel, small built-in skill set, skills acquired at runtime.

Pi takes a third bet. The system prompt is small. The set of skills baked into the harness is small. What makes Pi distinct is that the skill set is not the limit — the agent extends its own capabilities at runtime, pulling in or composing new skills as a task demands them. Where Claude Code curates a rich skill library up front and Codex hands the user a configuration surface, Pi reaches for what it needs while it’s working.

The trajectory still matters — Pi-style agents tend to surface plans, dependencies, and intermediate steps the way a project board would — but the more interesting property is that two consecutive runs of the same harness can have substantially different skill loadouts. The procedural substrate isn’t shipped with the binary; it accretes around the task.

The trade-off is predictability. A harness that grows its skill set at runtime is harder to reason about than one whose tools are fixed at install time. Pi-style systems reward tasks where the long tail of capability matters more than reproducibility — exploratory work, broad research, anything where “the agent figured out how to do X mid-task” is a feature rather than a bug.

The differences are upstream of features

What’s striking when you look at all three together is that the features are largely the same. All three can read files, write files, run shells, execute code, integrate with git, run tests. The differences are upstream of features — they’re choices about how to structure those features. Specifically:

Where intelligence is concentrated. Claude Code puts it in a curated skill library. Codex puts it in the user’s configuration. Pi puts it in skills acquired at runtime.
How autonomy is calibrated. Claude Code: low and supervised. Codex: high and configurable. Pi: medium and observable.
What the trajectory looks like. Claude Code: incremental, reviewed step by step. Codex: whatever the user designed. Pi: a plan executed against a skill loadout that may shift mid-task.
Who the user is. Claude Code: developers who want a fast collaborator. Codex: builders who want a platform to assemble against. Pi: teams who want an agent that can reach for capability it didn’t ship with.

These are not feature-list distinctions. They’re philosophy distinctions, expressed in design. And they explain why different teams reach for different harnesses for the same nominal task. The “best” harness is genuinely a function of how the user wants to work, not just what they want done.

There’s a useful exercise here: look at a coding agent you’ve used, and ask which of these design schools it’s drawing from. Is the harness minimal and the skills doing the work? Is the runtime maximalist and orchestration-heavy? Is the system trying to be a platform or a product? The answers usually map cleanly onto one of these three poles, even for harnesses that don’t make their philosophy explicit.

The poles are harness philosophies, not models

The deeper observation is that the field is converging on a small number of stable design philosophies for coding agents, each with internal consistency, each making different trade-offs. This is the sign of a maturing field. Three years ago, the design space was wide open and chaotic. Now it has poles. The poles aren’t models; they’re harness philosophies. That’s the post-model era visible in the development tools we use every day.

The next post takes the side of one of these schools more aggressively — the small-prompts-rich-environment school — and argues that the future belongs to the thin reasoning kernel over the rich procedural substrate, not the other way around.