For most of the LLM era, intelligence was measured one completion at a time. You sent a prompt; you got a response; you graded the response. Benchmarks were lists of question-answer pairs. Demos were screenshots. The thing being evaluated, and the thing being shipped, was a single string of model output.
This is no longer the unit that matters. With agents, the artifact is a trajectory — the full sequence of decisions, tool calls, observations, recoveries, and final actions that the model produces in pursuit of a goal. The completion is a frame; the trajectory is the film. And once you start thinking in trajectories, almost everything about how you build, evaluate, and reason about AI systems shifts.
With agents, the artifact is a trajectory — the full sequence of decisions, tool calls, observations, recoveries, and final actions that the model produces in pursuit of a goal.
Two agents, same diff, different trajectories
The clearest way to see this is to look at two systems that produce the “same” answer and notice they’re not the same. Imagine asking two coding agents to fix a bug. Both eventually commit a patch that makes the failing test pass. Agent A reads the test, reads the relevant file, forms a hypothesis, makes a targeted edit, runs the test, sees it pass, and stops. Agent B reads the test, reads twelve unrelated files, makes seven edits across four files, runs the test, sees it fail, reverts everything, tries again, and on the third pass produces a patch that works.
The final diffs might be identical. The trajectories are not. If you only evaluate the output, A and B look equivalent. If you evaluate the trajectory, A is dramatically better — cheaper, faster, easier to review, less likely to have caused collateral damage, and more likely to generalize to the next bug.
The completion-era benchmarks couldn’t see this distinction. They couldn’t see anything that happened between prompt and response. So they couldn’t see whether the model had thrashed, guessed, given up and retried, hit dead ends, or worked methodically. They saw the answer key and graded the answer. For agents, this is like grading a chess player by whether they won, while ignoring how many illegal moves they tried first.
The shift in unit reframes a lot of older debates. The question “is the model smart enough?” stops being meaningful in isolation. A model can be plenty smart for a particular completion and still be terrible at navigating a trajectory — because navigating requires planning, recovering from errors, choosing when to verify, knowing when to stop. These are not the same as producing a fluent paragraph. You can be excellent at the latter and bad at the former, and most chat-first models for most of the LLM era were exactly that.
Joint quality is what kills you, not per-step quality
It also explains why so many “obvious” applications of LLMs underperformed when teams actually shipped them. A chatbot that answers questions correctly 95% of the time sounds great. A multi-step agent in which each step succeeds 95% of the time sounds great until you do the arithmetic: a ten-step trajectory with 95% step success has a 60% chance of completing. The thing that matters isn’t the per-step quality. It’s the joint quality across the path, which is dominated by how the system handles the inevitable bad step.
The mathematics of trajectories produces a different set of engineering priorities than the mathematics of completions. In completion-land, the question is “how do I make each response a bit better?” In trajectory-land, the question is “how do I make the bad steps recoverable, and how do I keep the path short enough that the joint probability doesn’t collapse?” These have different answers. Better recovery often beats better steps. Shorter paths often beat smarter paths. Knowing when to stop often beats knowing what to do next.
Trajectories demand real instrumentation
This is also why the agent era brought back a kind of engineering discipline that prompt engineering had let atrophy. When the unit is a single completion, you can iterate by editing prose. When the unit is a trajectory of fifty tool calls across a workspace, you need real instrumentation.
You need to be able to replay trajectories. You need to be able to diff them. You need to be able to ask “where did this one go wrong” and get a useful answer. The infrastructure that supports trajectories — tracing, replay, scoring, comparison — looks more like APM tooling for distributed systems than like the spreadsheet of test cases that sufficed for completions.
Intelligence is a property of the whole system
There’s a conceptual shift hiding inside all this that I think people underestimate. In the completion era, intelligence was treated as a property of the model — some models were smarter, others less so, and you tried to buy or fine-tune your way to a smarter one. In the agent era, intelligence is a property of the behavior, which is to say, a property of the model and its harness and its environment and its tools and its loop together. You can make a behavior dramatically smarter without changing the model at all, by changing what surrounds it. You can also make a smart model behave stupidly by surrounding it badly. The locus of intelligence is the trajectory, and the trajectory is produced by the whole system, not just the bit in the middle that does inference.
Once you internalize this, the next question becomes obvious: what does that whole system look like, and what do we call the parts of it that aren’t the model? That’s the territory the next several posts cover. We start by clarifying what an agent actually is, because there’s still a lot of confusion in the field between “agent” and “chatbot that calls a function.” Spoiler: they are not the same thing, and conflating them has cost a lot of teams a lot of money.