At some point in 2024, a comparison started circulating in agent-building circles: a modern agent harness looks less like a piece of application software and more like an operating system. The comparison was framed as a metaphor at first, then as an analogy, and by mid-2025 most people building serious harnesses had stopped treating it as either. It wasn’t a useful comparison; it was a description. Harnesses had become, structurally, miniature operating systems. The OS-iness wasn’t a coincidence of design taste — it was a consequence of the problem space.
Harnesses had become, structurally, miniature operating systems.
The reason is straightforward once you list out what an agent has to do that a traditional application doesn’t. An agent has to schedule. It has to enforce permissions. It has to manage memory. It has to isolate untrusted code. It has to maintain state across many independent runs. It has to do all of this for a non-deterministic actor — the model — that may at any moment do something nobody anticipated.
These are exactly the concerns an OS exists to handle.
If you build a harness without thinking in OS terms, you end up reinventing every primitive an OS provides, badly and in an ad-hoc fashion. If you build a harness with OS thinking, you find yourself ahead of problems most teams hit only after their first production incident.
Where the analogy stops being metaphor
Let me walk through the specific places this analogy stops being metaphor.
Scheduling. An OS schedules processes onto CPUs. A harness schedules subtasks onto model calls. When an agent decomposes a goal into substeps, something has to decide what runs when. Serial execution is the obvious choice but rarely optimal — there are tasks that could run in parallel, tasks that should be batched, tasks that should be deferred until a precondition is met. Harnesses that take scheduling seriously can do things like dispatch multiple tool calls concurrently, route different subtasks to different model sizes, and prioritize the substeps most likely to fail so that failure is detected early. Without explicit scheduling, agents run as single-threaded loops and waste their model calls.
Permissions. OS permissions models — users, groups, capabilities, ACLs — exist because programs cannot be trusted to act on the whole system. The same constraint applies to agents, and more sharply: agents are stochastic, prompt-injectable, and capable of taking actions the author didn’t anticipate. The right place to enforce “don’t email customers from this agent” is not inside a model instruction, where it can be ignored or overridden. It’s in the permission layer, where the email tool simply isn’t available, or is available only after a human approval. The permission model is also where you specify what an agent is allowed to read, which becomes a privacy concern as agents touch more user data.
Memory management. OSes have an entire stack of memory abstractions: registers, caches, RAM, swap, disk. Each is faster and smaller than the next. Harnesses are developing analogous hierarchies: model context is the working memory, recent trajectory summaries are the cache, episodic stores are RAM, vector databases are the disk. The OS analogy predicts (correctly) that you’ll spend a lot of engineering effort on the boundaries — what gets promoted up the hierarchy, what gets evicted, what gets indexed for later retrieval. This is the entire field of context engineering, and it really is the agent equivalent of memory management.
Isolation and drivers come from the same playbook
Isolation. OSes evolved process isolation, namespaces, and sandboxes because programs cannot be trusted not to interfere with each other. Agents need the same. Multi-tenant agent platforms have to ensure that one user’s agent can’t read another user’s files. Coding agents that execute generated code need to run that code in a container, jail, or VM. Agents that browse the web need a browser sandbox. The isolation primitives borrowed almost wholesale from systems engineering — containers, MicroVMs, gVisor, bubblewrap — are the same primitives harness builders end up reaching for.
Drivers and devices. An OS abstracts hardware devices behind drivers so applications don’t have to know about specific chipsets. A harness abstracts capabilities behind tool definitions so agents don’t have to know about specific APIs. MCP, in its current form, is more or less a driver model for agent capabilities — a way to plug arbitrary services into a harness with a common interface. The terminology hasn’t fully caught up to this, but the pattern is identical.
System calls. OSes give applications a stable interface to privileged operations through system calls. Harnesses give agents a stable interface to side effects through tool calls. The principle is the same: untrusted code can only affect the world by asking the trusted layer to do something for it, and the trusted layer enforces policy. This is also why “tool calls” should be thought of as the trust boundary in agent systems. The model is the untrusted code; the tool definitions are the syscall surface; the harness is the kernel.
Logging and auditing. OSes log everything because you can’t troubleshoot or audit a system if you didn’t write down what happened. Harnesses are converging on the same conclusion. Every tool call, every model invocation, every state change is logged. The trajectory is the audit trail. There’s a whole sub-discipline emerging around how to store, search, and analyze these trajectories, which is starting to look a lot like the systems we built around OS logs decades ago.
The agent is the process, the harness is the OS
Once you’ve internalized all this, “agent” starts looking like the wrong noun. An agent isn’t a thing — it’s a process, in the OS sense. The agent runs on a harness, which is the OS. Different processes (different tasks, different users, different goals) run on the same harness simultaneously, isolated from each other, managed by the same scheduling and permission machinery. The harness is the durable artifact; agents are ephemeral.
We are in the late 1970s of harnesses
This reframing has consequences for how we should be building. Harness design ought to draw on decades of OS literature. We have books on scheduling. We have books on permission systems. We have books on memory hierarchies and on isolation primitives. We don’t need to invent any of this from scratch, and the teams that have noticed have a substantial head start on the teams that are still treating their harness as “the LangChain code we wrote last quarter.”
It also has consequences for where the field is going. If harnesses are operating systems, then we’re at roughly the position the OS world was at in the late 1970s — multiple incompatible systems, a few converging on dominance, a lot of interesting research, and a strong sense that whichever abstractions stabilize over the next few years will be the substrate that everything else runs on for the next twenty. We are, in other words, in a foundational period. The next several posts dig into what’s emerging at that level — the procedural substrate, the trajectory as interface, evaluation, and context engineering — all of which are best understood as parts of a young operating system finding its shape.