Harness Engineering: The System Around AI Matters More Than AI

TL;DR — Harness engineering is everything around your AI agent except the model: memory, tools, permissions, hooks, observability. LangChain gained 13.7 benchmark points by changing only the harness (52.8% to 66.5%, same model). Most developers only have Layer 1 (CLAUDE.md). Production needs all 5.

Two lines of config. Same AI model. Completely different reliability:

# CLAUDE.md approach (can be ignored)
"Never delete production database tables."
# Claude reads this, weighs it against 200K tokens of context, may ignore it.

# Hook approach (always enforced)
# PreToolUse hook: command contains "DROP TABLE" + env=production → exit 2 → BLOCKED.

The first is advice. The second is enforcement.

One lives in a markdown file that competes with thousands of other tokens for the model’s attention. The other is a shell script that runs before every command and cannot be bypassed. The gap between these two approaches is the gap most teams don’t know exists.

That gap has a name now: harness engineering.

What is harness engineering? (And why prompt engineering isn’t enough)

Harness engineering is the discipline of building constraints, tools, feedback loops, and observability around an AI agent to make it reliable in production. The formula, popularized by LangChain and refined on Martin Fowler’s site: Agent = Model + Harness. The model is a commodity. The harness is your competitive advantage.

Mitchell Hashimoto, creator of Terraform and Ghostty, defined the core idea: anytime you find an agent makes a mistake, you engineer a solution so the agent never makes that mistake again. In Ghostty’s repository, each line in the AGENTS.md file corresponds to a specific past agent failure that’s now prevented (HumanLayer Blog, Mar 2026).

The industry has moved through three distinct eras:

Era	Years	Focus	Key Question	Limitation
Prompt Engineering	2022-2024	Crafting better instructions	”How do I phrase this?”	Instructions get diluted in long contexts
Context Engineering	2025	Curating what the model sees	”What information does it need?”	Knowing isn’t doing. Context alone doesn’t prevent bad actions
Harness Engineering	2026	Building systems around the agent	”What can it do, and what can’t it?”	Emerging discipline, still being defined

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do.

Key insight: Harness engineering is the discipline of building constraints, tools, feedback loops, and observability around AI agents. The core formula: Agent = Model + Harness. The term emerged in early 2026, formalized by Birgitta Böckeler on Martin Fowler’s site (Apr 2026), OpenAI, and LangChain. Mitchell Hashimoto’s AGENTS.md pattern in Ghostty became one of the most cited examples of the practice.

How did LangChain gain 13.7 benchmark points without changing the model?

By improving three harness components, LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 (a benchmark of 89 real-world terminal tasks) while keeping the same model, gpt-5.2-codex. They went from Top 30 to Top 5. No fine-tuning. No model swap. Just harness changes (LangChain Blog, Feb 2026).

Here are the three changes:

1. Context injection. LangChain’s LocalContextMiddleware maps the environment upfront and injects it directly into the agent’s context. Before this change, the agent wasted steps trying to understand its surroundings.

2. Self-verification loops. After each action, the agent verifies its output against task-specific criteria before moving on. Not just “run the tests.” The agent checks whether the output matches what the task actually asked for.

3. Compute allocation. This one is counterintuitive: running at maximum reasoning budget (xhigh) scored only 53.9%, while the high setting scored 63.6%. More compute caused timeouts that hurt overall performance. The harness needed to manage how much thinking the agent does, not just what it thinks about.

Setting	Score	Notes
Before harness changes	52.8%	Baseline, Top 30
After harness changes (high reasoning)	66.5%	Top 5, +13.7pp
Max reasoning (xhigh)	53.9%	Worse than baseline, timeouts

If you’re evaluating AI coding tools by comparing model benchmarks alone, you’re measuring the wrong variable.

Key insight: LangChain improved their coding agent from 52.8% to 66.5% on Terminal Bench 2.0 (+13.7 percentage points) by changing only the harness while keeping gpt-5.2-codex fixed. Running at maximum reasoning budget actually scored worse (53.9%) due to timeouts (LangChain Blog, Feb 2026).

What are the 5 layers of an AI agent harness?

A production harness has five layers: Memory, Tools, Permissions, Hooks, and Observability. Most developers I talk to in the Claude Code community have Layer 1 and maybe part of Layer 2. That leaves three layers of reliability on the table.

Here’s the complete map for Claude Code:

Layer	What It Is	Problem It Solves	Claude Code Implementation
1. Memory	Persistent context across sessions	Agent “forgets” your conventions every session	CLAUDE.md, MEMORY.md, .claude/commands/
2. Tools	Extended capabilities beyond built-ins	Agent can’t access your APIs, databases, or services	MCP servers, custom tools
3. Permissions	What the agent is allowed to do	Agent edits sensitive files or runs dangerous commands	settings.json allow/deny lists
4. Hooks	Automated enforcement at lifecycle points	Instructions get ignored under context pressure	PreToolUse/PostToolUse hooks
5. Observability	Knowing what the agent actually did	No visibility into agent decisions or cost	Session logs, cost tracking, action audit

Think of it like your CI/CD pipeline. You built that infrastructure once, and the whole team benefits on every push. A harness works the same way for AI agent sessions.

OpenAI demonstrated this at scale. Their Codex team shipped roughly one million lines of production code, with zero lines written by human hands, over five months. Their harness included AGENTS.md files, reproducible dev environments, and mechanical invariants in CI. Development throughput was roughly one-tenth the time a human team would have needed (InfoQ, Feb 2026).

Each layer deserves its own deep-dive. For the full implementation blueprint, see 5 Layers of a Production-Ready Claude Code Harness.

Key insight: OpenAI’s Codex team shipped roughly one million lines of production code with zero human-written lines over five months, using a harness of AGENTS.md files, reproducible environments, and CI invariants. Throughput: 3.5 merged PRs per engineer per day (InfoQ, Feb 2026).

Where is your harness right now?

Most developers have a CLAUDE.md file and maybe a few MCP servers. That’s Layer 1 and part of Layer 2 out of five. Run this checklist to find out where you stand.

Answer yes or no:

#	Question	Layer
1	Do you have a CLAUDE.md with project conventions and constraints?	Memory
2	Do you have MCP servers connecting Claude Code to external tools?	Tools
3	Do you have settings.json with explicit allow/deny lists?	Permissions
4	Do you have at least one PreToolUse hook that blocks dangerous actions?	Hooks
5	Can you see what Claude did in each session and how much it cost?	Observability

Your score:

1/5: You’re in the majority. Most developers stop at CLAUDE.md.
2-3/5: Ahead of most. You’ve started building real infrastructure.
4-5/5: Production-ready. You’re doing harness engineering whether you knew the name or not.

Be honest about question 4. If the answer is no, your agent can still rm -rf your project directory. CLAUDE.md says “don’t do that.” A hook actually prevents it.

Here’s why this matters: an ETH Zurich study (Feb 2026) tested context files across 138 real-world tasks from 12 Python repositories. Human-written context files improved agent success by only about 4%. LLM-generated ones actually reduced success by ~3% while increasing inference costs by over 20% (MarkTechPost, Feb 2026). Instructions alone aren’t enough. You need enforcement layers.

HumanLayer keeps their CLAUDE.md under 60 lines (HumanLayer Blog, Mar 2026). Fewer instructions, more hooks. That’s the direction.

Key insight: An ETH Zurich study tested context files across 138 real-world tasks from 12 Python repositories. Human-written context files improved agent success by only ~4%, while LLM-generated ones reduced success by ~3% and increased inference costs by over 20% (MarkTechPost, Feb 2026).

Get weekly Claude Code tips — One practical tip per week. No fluff, no spam. Subscribe to AI Developer Weekly →

How do you start building a harness today?

You don’t need all 5 layers at once. Start with three high-impact changes that take less than 30 minutes total. Each one covers a different layer and gives you immediate improvement.

Quick Win 1: Create a MEMORY.md (5 minutes)

MEMORY.md is a lightweight index that points to where knowledge lives in your project. Unlike CLAUDE.md (which holds static rules), MEMORY.md tracks evolving state: recent decisions, architectural changes, active work.

Keep each entry under 150 characters:

- [Auth](src/lib/auth/) — Clerk, not NextAuth. Migrated March 2026.
- [DB](prisma/schema.prisma) — PostgreSQL on Supabase. All queries via Prisma.
- [Deploy](docs/deploy.md) — Vercel preview for PRs, production on main.
- [Testing](vitest.config.ts) — Vitest unit, Playwright E2E. Min 80% coverage.
- [API](src/app/api/) — Server Actions preferred over API routes for mutations.

Quick Win 2: Add one PreToolUse guardrail hook (15 minutes)

This hook blocks Claude Code from editing sensitive files. Copy-paste ready:

#!/bin/bash
# Blocks edits to .env, credentials, and CI config

INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')

SENSITIVE=('.env' 'credentials' '.github/workflows' 'secrets')

for pattern in "${SENSITIVE[@]}"; do
  if [[ "$FILE_PATH" == *"$pattern"* ]]; then
    echo "BLOCKED: Cannot edit sensitive file: $FILE_PATH" >&2
    exit 2
  fi
done

exit 0

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "bash .claude/hooks/block-sensitive-files.sh"
          }
        ]
      }
    ]
  }
}

For more guardrail patterns, see the Claude Code Hooks guide and the 17 Hook Events deep-dive.

Quick Win 3: Enable cost awareness (10 minutes)

Track what each session costs so you notice anomalies early. Boris Cherny, creator of Claude Code, calls verification “probably the most important thing” for quality: “Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result” (X thread, 2026).

Cost tracking is the observability layer that tells you when a session is burning tokens without progress. Start simple: review ~/.claude/projects/ after each session to check what Claude did and how much it cost. For automated tracking, see How I Cut My Claude Code Bill by 60%.

Key insight: Boris Cherny, creator of Claude Code, calls verification “probably the most important thing” for quality: “Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result” (X thread, 2026).

Try it now: Pick one quick win above and implement it before your next Claude Code session. Quick Win 2 is copy-paste ready and takes 3 minutes.

FAQ

What is harness engineering?

Harness engineering is the discipline of building constraints, tools, feedback loops, and observability around AI agents to make them reliable in production. The formula: Agent = Model + Harness. The term emerged in early 2026, formalized by Birgitta Böckeler on Martin Fowler’s site, OpenAI, and LangChain.

What is the difference between harness engineering and prompt engineering?

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do. They’re not replacements. They’re layers. A production AI workflow uses all three, but harness engineering provides the strongest reliability guarantees because it uses enforcement (hooks, permissions) rather than suggestions (prompts, context).

Do I need harness engineering for Claude Code?

Yes. Claude Code is itself a harness that Anthropic built around their model. But it’s the inner harness. You need an outer harness tailored to your project: CLAUDE.md for conventions, hooks for guardrails, MCP servers for tools, permissions for safety boundaries, and observability for cost control.

Is harness engineering only for Claude Code?

No. The principles apply to any AI coding agent: Cursor, GitHub Copilot, OpenAI Codex, Windsurf, Cline. Claude Code happens to offer the most programmable harness surface (17 hook events, MCP protocol, skills system), which is why examples here use it. The concepts transfer directly to other tools.