Why Agent Evals Became Behavioral

There’s a moment that happens to every team building agents in production, usually about six months in. They have an eval suite they’re proud of. It’s a spreadsheet of inputs and expected outputs, with a score that goes up over time as the model improves. Then they ship a feature, the score on the eval suite goes up, and the agent gets worse in the wild. Customer complaints rise. The team stares at the suite, which is still smiling at them, and realizes it’s been evaluating the wrong thing.

The wrong thing was the output. The right thing is the behavior. The transition between these two ways of evaluating is the single largest shift in AI evaluation methodology since the field started, and it’s still not complete.

Why output-based evals stopped meaning anything

Output-based evals have an obvious appeal. They look like benchmarks. Benchmarks worked beautifully for the completion era — give the model a question, compare its answer to a reference, score it. SQuAD, GLUE, MMLU, HumanEval — the field made enormous progress with eval suites of this shape. They’re easy to run, easy to compare, and easy to communicate. “Our model scored 89% on X” is a sentence everyone understands.

For agents, this sentence stops meaning anything useful. Consider the trivial-sounding question: “Did the agent solve the task?” In completion-land you check the output against a reference. For an agent, the task involves multiple steps, often with multiple acceptable paths, often with side effects that matter in ways the final output can’t reflect. An agent might “solve” the task by accident, by brute force, by making a mess and then cleaning up most of it, by skipping verification, by stumbling into a correct-looking answer for the wrong reasons. Two agents that both “solve” the task can be wildly different in quality, cost, safety, and predictability. The output-equality check misses all of this.

The shift to behavioral evaluation is the recognition that what you want to know about an agent is not “did it produce the right answer” but “did it produce the right trajectory.”

The trajectory contains the answer along with everything else: the choices the agent made, the tools it used, the order it used them in, the recoveries it performed, the side effects it created.

The patterns behavioral evaluation converged on

Concretely, behavioral evaluation has converged on a few patterns.

Trajectory scoring is the foundation. Each trajectory gets graded across multiple dimensions: did the agent achieve the goal, but also, was the path efficient? Did it use the right tools? Did it verify its own work? Did it cause unintended side effects? Did it produce a trajectory a human could review and trust? Scoring across these dimensions, instead of collapsing everything to “right answer / wrong answer,” gives a much richer picture of agent quality. It also surfaces the trade-offs explicitly — an agent that’s accurate but expensive is different from one that’s accurate and cheap, and a behavioral eval can show you which one you have.

Replay is the workhorse. The substrate logs full trajectories; the eval system replays them. This lets you run regressions: when you change the model, the harness, or a skill, you can re-run a representative set of past tasks and compare the new trajectories to the old ones. Did the agent get faster? Did it skip steps it shouldn’t have? Did it start failing at things it used to handle? Replay lets you see all of this without having to run new live tasks, which is essential because live agent tasks are expensive and slow. Replay is to agent evaluation what unit tests are to software — the fast, repeatable, deterministic layer underneath the slower integration tests.

Recovery analysis is the part of behavioral evaluation that didn’t have an analog in the completion era. It asks: when something goes wrong mid-trajectory — a tool returns an error, a test fails, an assumption proves wrong — how does the agent respond? Does it notice? Does it adapt? Does it back out cleanly, or does it pile on more errors? Recovery is where agents differentiate from each other most starkly, and it’s invisible to output-based evals because by the time you see the output, the recovery has either happened or it hasn’t. Behavioral evals score recovery explicitly, often by deliberately injecting failures — a flaky tool, a misleading initial state — and watching how the agent handles them.

Counterfactual probing is more recent and more ambitious. The idea is: take a trajectory the agent produced and ask what would have happened if some step had gone differently. If the test had not failed, would the agent still have done the right thing? If the file had been slightly different, would the agent have still chosen the right edit? This kind of analysis is starting to be tractable because we have the trajectories logged and the substrate is reproducible enough to replay them under modified conditions.

Production telemetry as evaluation closes the loop. Every real trajectory is, in principle, an eval. The harness collects the trajectory, computes the same scores the offline eval suite uses, and surfaces the distribution. This is how you catch regressions you didn’t predict — production isn’t a sample drawn from your eval set, it’s a fresh data source with its own surprises. Mature teams have started treating their eval suite and their production telemetry as the same system, with the eval suite being a curated, deterministic subset and production being the much larger, noisier, more important one.

This is observability, with the trace as the artifact

All of this looks a lot like observability for distributed systems, and that’s not an accident. Distributed systems also have the property that the final outcome is the result of many interacting components, and that you can’t evaluate them by looking at one component in isolation. The tools that emerged for distributed systems — tracing, metrics, replay, structured logs — are roughly the tools emerging for agent evaluation, with the difference that the trace is the artifact rather than the diagnostic.

The eval stops being a number

There’s a cultural change that goes with this. Teams that used to ship agents based on benchmark scores have started shipping based on trajectory dashboards. The conversation in design reviews has shifted from “what’s our score” to “show me five recent trajectories from this user segment.” The eval is no longer a number; it’s an examined sample of how the system actually behaves. This is a healthier engineering posture, and it’s the one the field is converging on whether or not it has the vocabulary to describe it yet.

The next post is about the related shift in design: as trajectories became the artifact and behavioral evals became the metric, the work of building agents stopped being “prompt engineering” and started being “context engineering.” That’s a different discipline with different concerns, and it’s where most of the real craft of agent-building now lives.