Harness Engineering

Apr 17

A response to PY (@thecommitlog) on why the control layer now matters more than the model swap

Inspired by PY’s video, “Rethinking AI Agents: The Rise of Harness Engineering”. This essay builds on that framing and extends it into a practical view of what teams should actually build next.

For the last two years, most of the conversation around agents has centered on models: which one reasons better, which one uses tools more reliably, which one has the larger context window, which one tops the leaderboard this month. PY’s video pushes the discussion somewhere more useful. The emerging evidence suggests that the biggest gains are increasingly coming not from model swaps alone, but from the system wrapped around the model: the harness.[^1]

That shift is not semantic. It changes what we should optimize, how we should architect agents, and where teams should invest their time. The harness is not “glue code” anymore. It is the control layer that decides what the model sees, what state persists, when work is decomposed, how verification happens, and when the run is allowed to stop. In practice, that means the harness is often the difference between an agent that looks clever in a demo and one that behaves reliably in production.[^2][^3]

The harness → operating system

The cleanest mental model is this: the model is the reasoning engine, and the harness is the operating system.

A model is good at local cognition. It can interpret a task, propose a plan, write code, critique an answer, or choose a next action. But it does not, by itself, provide durable state, bounded execution, permissioning, verifiable completion, or structured recovery from failure. Those are harness responsibilities. The Tsinghua paper on Natural-Language Agent Harnesses makes this explicit by treating harness logic as its own object of study: contracts, roles, stage structure, adapters, state semantics, and failure taxonomies are all first-class parts of the system.[^2]

That framing immediately clarifies why the same model can perform so differently across implementations. Two agents can share the same weights and still behave like entirely different systems because the harness changes how tasks are staged, how much context is injected, whether work is delegated, which tools are available, and whether exit is gated by verification. Anthropic’s agent design guidance describes the recurring workflow patterns teams are using in production: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer loops.[^4] Those are harness choices, not model choices.

What the recent research shows

The strongest recent signal comes from two March 2026 papers that approach the problem from different directions.

The first, Natural-Language Agent Harnesses, argues that harness logic has been too entangled with controller code and runtime conventions to study cleanly. The paper introduces a runtime that lifts model completions into bounded agent calls with explicit execution contracts: required outputs, budgets, permission scope, completion conditions, and designated output paths.[^2] It also makes durable, file-backed state explicit rather than leaving critical progress inside the context window. In the paper’s formulation, good harness design means making control logic inspectable, editable, and executable while keeping deterministic hooks such as tests, parsers, and verifiers outside the model.[^2]

That paper’s results are especially important because they cut against the lazy intuition that “more structure must be better.” On SWE-bench Verified, the full runtime and stripped-down variants landed in a narrow performance band, but with radically different costs. In one comparison, the full setup reached 74.4 while consuming 16.3 million prompt tokens and 642.6 tool calls, whereas a lighter configuration reached 75.2 with 1.2 million prompt tokens and 51.1 tool calls.[^2] The lesson is not that harnesses do not matter. The lesson is that harnesses change behavior dramatically, and extra structure only helps when it aligns tightly with the final acceptance condition.

The same paper also shows why representation matters. When the authors migrated OS-Symphony from a native code harness into their natural-language harness representation, performance on OSWorld rose from 30.4 to 47.2, while runtime fell from 361.5 minutes to 140.8 and LLM calls collapsed from about 1.2k to 34.[^2] More strikingly, the migrated system shifted away from brittle GUI repair loops and toward file-backed state and artifact-backed completion. In other words, the better result did not come from a smarter model. It came from a better control representation.

The second paper, Meta-Harness, asks the next obvious question: if the harness matters this much, can we optimize it directly? Its answer is yes. Instead of tuning prompts inside a fixed pipeline, Meta-Harness searches over the harness code itself. Its proposer has access to source code, scores, and raw execution traces through a filesystem, and uses those traces to generate better harness candidates.[^3] On online text classification, it improves over a state-of-the-art context management baseline by 7.7 points while using four times fewer context tokens. On retrieval-augmented math reasoning, one discovered harness improved performance by 4.7 points on average across five held-out models. On TerminalBench-2, discovered harnesses beat the best hand-engineered baselines.[^3]

That result matters for one reason above all: it suggests the reusable asset is increasingly the harness, not just the model checkpoint. If a discovered harness transfers across multiple models, then teams should think of harness design the way they think of product architecture or evaluation infrastructure: a durable advantage that compounds over time.

The practical takeaway: stop treating the harness as scaffolding

PY’s video is most valuable when read as a design correction. Too many agent systems are still built as if the model were the product and everything around it were disposable support code. The research says the opposite. The harness is where reliability, cost control, recoverability, and operational quality emerge.

A good harness does at least seven things.

First, it turns open-ended model outputs into bounded work. That means each run or sub-run has an explicit contract: what artifact is owed, what budget is available, what permissions exist, and what counts as completion.[^2]

Second, it externalizes state. Important progress should not live only in transient context. Durable artifacts, ledgers, manifests, and task files should survive truncation, restart, delegation, and handoff.[^2]

Third, it imposes stage structure. The most useful default remains some version of plan → execute → verify → repair → finalize, because that structure ties work to acceptance criteria rather than to the model’s internal sense of “I think I’m done.”[^2][^5]

Fourth, it combines prompts with middleware and deterministic checks. LangChain’s February 2026 write-up is useful here because it is operational rather than theoretical. The team improved its coding agent by 13.7 points on Terminal Bench 2.0, from 52.8 to 66.5, while keeping the model fixed and changing only the harness.[^5] The changes were not just better instructions. They included middleware for context injection, self-verification, and pre-completion checks.

Fifth, it uses delegation selectively. Anthropic’s orchestrator-worker pattern remains one of the cleanest ways to structure tasks whose subtasks cannot be known in advance.[^4] But the Tsinghua ablations are a warning that more orchestration is not automatically better. In their experiments, self-evolution helped, but verifier stages and multi-candidate search sometimes hurt.[^2] Delegation should exist to decompose real complexity, not to make the system look sophisticated.

Sixth, it traces everything worth improving. Meta-Harness depends on access to raw traces because summaries compress away the very signal needed to improve the control layer.[^3] LangChain likewise used traces to analyze failure modes and drive targeted harness changes.[^5] If you cannot inspect stage transitions, tool calls, verifier failures, retries, budgets, and termination reasons, you are not doing harness engineering yet. You are still guessing.

Seventh, it treats safety as harness policy. This matters because the harness is also the attack surface: it mediates tool use, artifacts, imported skills, and delegated workers. A January 2026 large-scale study of agent skills found that 26.1% of analyzed skills contained at least one vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply-chain risks.[^6] If the harness governs execution, then the harness must also enforce permissions, provenance, allowlists, and sandbox boundaries.

What a modern harness should look like

A practical harness does not need to start as a grand multi-agent framework. In fact, the research suggests that starting smaller is often wiser.

The right first version is a single-agent runtime with a strong contract. Give it a task file, a workspace, clear budgets, a small set of allowed tools, and a deterministic verifier. Require it to produce named artifacts. Store its state externally. Log its trace. Force it through a stage loop that includes verification before completion.

Only after that baseline is stable should you add middleware: environment discovery, context injection, loop detection, retry rules, budget warnings, and pre-exit verification hooks. Only after traces reveal genuine decomposition bottlenecks should you add worker agents. And only after you have enough trace data should you build an outer loop that proposes harness edits and evaluates them on held-out tasks.

The important design principle is that the harness should become more explicit before it becomes more elaborate. The Tsinghua paper’s core contribution is not “use natural language.” It is “make the control logic legible enough to compare, ablate, and improve.”[^2] The Stanford paper’s contribution is not “use an optimizer.” It is “treat the harness itself as the thing being optimized.”[^3]

That is a very different mindset from prompt tinkering.

The strongest lesson: prune, do not just pile on

One of the more subtle points in both the research and the surrounding practitioner writing is that harness engineering is as much subtraction as addition.

Anthropic presents modular patterns, but it also advises developers to start simple and only add complexity where the task genuinely requires it.[^4] LangChain’s work shows that carefully chosen harness interventions can move benchmarks meaningfully, but it does not imply that every extra verifier, loop, or branch will help.[^5] And the Tsinghua ablations make the warning explicit: more machinery can increase compute, tool calls, and runtime without improving outcomes, or even while making them worse.[^2]

That should change how teams evaluate agent architecture. The right question is not, “How many advanced components can we fit into this system?” It is, “Which parts of the control layer are clearly earning their keep?” A mature harness is one that has learned which structure to remove.

Why this matters now

The move from prompt engineering to context engineering to harness engineering marks a broader shift in the field. We are leaving the era where agent quality is treated mainly as a property of the model and entering one where agent quality is increasingly a property of the system.

That is good news for builders. It means meaningful gains are still available even when model progress is uneven or expensive. It means smaller models can outperform larger ones in specific settings when wrapped in a better control layer.[^3] It means teams can build proprietary advantage in runtime behavior, evaluation loops, safety policy, and operational discipline. And it means the next wave of progress in agents will likely come less from asking, “Which model should we switch to?” and more from asking, “What exactly should our harness control, persist, verify, and forbid?”

PY’s framing gets to the heart of it. If you build agents today, you are already doing harness engineering whether you call it that or not. The real choice is whether you do it implicitly, buried in ad hoc code and fragile conventions, or explicitly, as a system you can reason about, test, and improve.

That is the frontier now.

References

[^1]: PY (@thecommitlog), Rethinking AI Agents: The Rise of Harness Engineering, YouTube. [^2]: Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng, Natural-Language Agent Harnesses, arXiv:2603.25723, 2026. [^3]: Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn, Meta-Harness: End-to-End Optimization of Model Harnesses, arXiv:2603.28052, 2026. [^4]: Anthropic, Building Effective AI Agents, 2024. [^5]: Vivek Trivedy, LangChain, Improving Deep Agents with harness engineering, 2026. [^6]: Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang, Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale, arXiv:2601.10338, 2026.

AItechengineering

Paul Boutin