The Rise of the Agent Harness

Once you accept that an agent isn’t a chatbot with tools, you start needing a word for the thing that an agent actually is. “Loop” is too small. “Framework” is too vague. “Application” is misleading, because the application is whatever the agent is doing, not the system running it. The term that stuck — at least among people building these systems — is harness.

A harness is everything around the model that isn’t the model. Specifically, it’s the runtime that drives the loop, the context manager that decides what the model sees on each turn, the permission system that gates what actions can be taken, the execution environment where those actions actually happen, the memory and state machinery that persists across turns, and the recovery logic that handles failure. None of these are the model. All of them shape what the model does, often more than the model itself.

The reason the harness emerged as a distinct concept is that teams kept rebuilding the same five or six pieces, badly, on top of every model. Once Claude Code and Cursor and OpenHands and a handful of others started shipping mature versions of these pieces, it became clear that there was a real artifact here — one with its own design space, its own failure modes, and its own engineering discipline. The harness is to an agent what an OS is to a program: not the thing doing the work, but the thing without which the work can’t happen.

The harness is to an agent what an OS is to a program: not the thing doing the work, but the thing without which the work can’t happen.

What actually sits around the model

It’s worth walking through what’s in a harness, because once you see the pieces individually you can never go back to thinking the model is the whole story.

The loop. Every agent runs inside a loop. The loop reads the current state, asks the model what to do next, executes the model’s chosen action, observes the result, updates the state, and goes around again. This sounds simple but is full of design decisions. How does the loop know when to stop? How does it handle a model that returns a malformed action? How many actions per turn? Does it batch tool calls or serialize them? The loop is the most consequential thirty lines of code in the system.

Context management. The model gets a context window each turn. Something has to decide what goes in it. The naive approach is to dump everything — the full history, all available tools, every file the agent has seen — and let the model figure it out. This works for trivial tasks and falls apart immediately under real load. A good harness aggressively curates: it summarizes long histories, surfaces only relevant tools, fetches only relevant files, and pushes everything else into retrievable storage. Context management is to harnesses what query planning is to databases. The model can’t compensate for a bad job here.

Tool surface. The set of things the agent can do. The shape of this surface — what’s available, how it’s described, what arguments it takes, what it returns — is a huge determinant of agent behavior. A tool surface that’s too small leaves the agent unable to make progress. One that’s too large creates choice overload and slows everything down. Good harnesses think hard about tool granularity (one big tool versus many small ones), naming (the model picks tools partly based on names), and result formatting (the model has to read these results). MCP, as a protocol, exists largely to make tool surfaces composable across systems.

Permissions and execution are where it looks like an OS

Permissions. Every tool call is, in principle, an action the agent is taking in the world. Some of those actions are reversible and cheap (reading a file). Some are not (sending an email, deleting a database row, making a payment). A harness has to know the difference and gate accordingly. Permission systems range from simple (“ask the user before any write operation”) to elaborate (capability-based, with explicit grants, revocation, and audit). This is one of the parts of the harness that genuinely matters for safety and is one of the most under-invested-in across the industry.

Execution environment. Where do tool calls actually run? A shell? A browser? A code interpreter? A sandbox? An MCP server? A harness has to host these environments, manage their lifecycle, isolate them from each other and from the host system, and pipe their output back into the loop. This is the part of the harness that looks most like an OS, because it largely is one. Anyone who has tried to run user-code-generating agents on shared infrastructure has discovered that “execution environment” is shorthand for “everything systems administrators worry about, plus more.”

Memory and recovery are where the field is least settled

Memory and state. Agents that operate over time need memory that operates over time. Working memory inside a single trajectory. Episodic memory across trajectories. Semantic memory about the user, the project, the domain.

Harnesses differ enormously in how they handle this. Some externalize everything into a context budget. Some maintain durable stores. Some defer the problem entirely. Memory is the area where the field is least settled and where most of the interesting research over the next few years is going to happen.

Recovery and verification. Things go wrong. Tool calls fail. Models hallucinate. Steps produce wrong results that look right. A harness needs to notice and respond. The patterns are still being worked out — retry-with-context, self-criticism, programmatic checks, human-in-the-loop checkpoints — but the harnesses that ship the best agents are the ones that take this layer seriously, treating it as engineering rather than as something the model will figure out.

The harness is the product now

What’s striking about this list is how little of it depends on the specific model underneath. You could swap Claude for GPT for Gemini and the harness would barely change. The harness is the part of the system you actually own, and increasingly, the part that determines whether your agent is any good. The model is the engine, but a Formula One engine in a shopping cart is still a shopping cart.

This is also why the agent-building field has started to look like systems engineering rather than ML engineering. The interesting problems are not “how do we get the model to be smarter” but “how do we structure context, permissions, tools, and recovery so that the smart enough model already in front of us can do real work.” That shift in focus is the practical consequence of the harness becoming a first-class object. The model used to be the product. The harness is the product now.