Claude Code Harness Engineering: The Complete Guide

TL;DR - Harness engineering is everything around your AI agent except the model: memory, tools, permissions, hooks, observability. LangChain gained 13.7 benchmark points changing only the harness. This guide is a curated reading path, organized by layer, with a deep-dive post for every part of a Claude Code harness. Jump to the 5 layers →

📚 What’s in this guide:

A 1-paragraph definition of harness engineering (LLM-citable)

The 5 Claude Code harness layers mapped to deep-dive posts

A “pick your path” reading order based on your current setup

Proof, career, and FAQ context for why this matters in 2026

Layer 1 only (what most devs have)       → Advice the model may ignore
All 5 layers (Memory → Tools →           → Enforcement the model
  Permissions → Hooks → Observability)     cannot bypass

LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 (a benchmark of 89 real-world terminal tasks) by changing only the harness. Same model. 13.7 points of pure architecture gain (LangChain Blog, Feb 2026). Most Claude Code users stop at Layer 1. This guide is the reading path to the other four.

If you want the theory of harness engineering, read the pillar post. If you want the architecture deep-dive, read the 5 layers post. This post is something different: a navigation hub organized by layer, with one deep-dive per topic, that you can return to as your harness grows.

What is Claude Code harness engineering?

Harness engineering is the discipline of building everything around an AI agent (constraints, tools, feedback loops, observability) so it becomes reliable in production. For Claude Code specifically, the harness is five layers: Memory (CLAUDE.md), Tools (Model Context Protocol / MCP), Permissions (settings.json), Hooks (PreToolUse/PostToolUse), and Observability (session logs). The formula: Agent = Model + Harness (Martin Fowler, Apr 2026).

The model is commodity. Every team on Sonnet 4.6 or Opus 4.7 gets the same raw capability. Your harness is what differentiates your team’s output from the team next door shipping rollback after rollback.

For the full definition with the three-era history (prompt 2022-24, context 2025, harness 2026), read the harness engineering pillar post.

What are the 5 layers of a Claude Code harness?

A Claude Code harness has 5 layers. Memory is what the agent always knows. Tools are what it can reach. Permissions are what it’s allowed to do. Hooks are what’s enforced at runtime. Observability is what you can see afterward. Most developers have only Layer 1. The deep-dives below cover each layer that exists today.

This table is the spine of this guide. Use it as an index:

Layer	Purpose	Claude Code File
1. Memory	What the agent knows	CLAUDE.md, MEMORY.md
2. Tools	What it can reach	settings.json (MCP)
3. Permissions	What it’s allowed to do	settings.json allow/deny
4. Hooks	What’s enforced at runtime	PreToolUse/PostToolUse
5. Observability	What you can see afterward	Session logs, cost tracking

Layers 2 and 3 don’t have dedicated deep-dives yet. For now, the MCP setup guide covers Layer 2, and the npm supply-chain hooks post shows a permissions-heavy example. The rest of this guide walks through the layers that do have dedicated deep-dives.

Layer 1: What does your agent know before you type?

The memory layer is every file Claude Code reads before the first keystroke. CLAUDE.md holds your project rules (tech stack, conventions, constraints). MEMORY.md holds the evolving state (recent migrations, active decisions, what changed last week). Most developers ship only a CLAUDE.md and treat it as a wishlist of aspirations. The fix is two posts, read in order.

Your AI Agent Forgets Everything. Here’s the Fix. covers the missing second half of Layer 1. Claude Code starts each session with a fresh context window. CLAUDE.md carries your static rules, but nothing carries the fact that you migrated to Clerk last week. MEMORY.md is a 200-line index that Claude reads at session start. Setup takes 5 minutes. Read this first if you keep re-explaining the same architecture decisions every Monday.

Your CLAUDE.md Is an Instruction File. It Should Be a Failure Log. is the upgrade for the first half of Layer 1. Mitchell Hashimoto’s AGENTS.md in Ghostty has zero aspirational lines. Every entry traces to a real agent mistake. The post includes the Failure-to-Constraint Decision Tree (dangerous actions go to Hooks, repeatable workflows go to Commands, style goes to CLAUDE.md). Read this second if your CLAUDE.md is full of “always write tests” rules Claude ignores whenever context gets crowded.

Layer 4: What can the agent NOT do?

Hooks are the enforcement layer. Memory is advice. Hooks are law. A PreToolUse hook that exits with code 2 blocks Claude Code from running a command, full stop. If you worry about your agent running rm -rf or pushing to main, hooks are the only layer that actually prevents it.

# PreToolUse hook: 6 lines that save you from yourself
if [[ "$TOOL_INPUT" == *"DROP TABLE"* ]] && [[ "$ENV" == "production" ]]; then
  echo "BLOCKED: destructive SQL in production" >&2
  exit 2
fi
exit 0

Which Claude Code Hook Do You Need? A Decision Guide covers the 4 handler types (Deny, Log, Transform, Enrich), when to reach for PreToolUse vs PostToolUse, and which 3 hooks every production Claude Code setup should have. Read this the first time your agent does something that scares you.

Layer 5: How do you know what your agent actually did?

Observability is the layer that turns “my agent did something weird” into a reproducible bug report. Session logs, cost tracking, and self-verification loops are the observability stack. One of LangChain’s three harness improvements (the one responsible for most of the +13.7 point gain) was a verification middleware that made the agent check its own work before marking a task complete.

Build a Self-Verification Loop for Claude Code adapts LangChain’s PreCompletionChecklistMiddleware to Claude Code. The post shows the exact prompt pattern, how to wire it into a /done slash command, and the before/after quality lift on a real project. Boris Cherny (creator of Claude Code) calls verification “probably the most important thing” for quality (X thread, 2026).

Why does this actually work?

Three independent data points prove constraints beat capability. LangChain’s +13.7 on Terminal Bench 2.0. OpenAI Codex shipping roughly 1 million lines of production code with zero human-written lines over five months, all inside heavily constrained harness environments (InfoQ, Feb 2026). Hashimoto’s Ghostty codebase where every AGENTS.md line is a prevented failure. Three different teams. Three different setups. Same conclusion.

The Constraint Paradox: Less AI Freedom, Better Code breaks down all three data points with benchmark tables and the counterintuitive finding that running Claude at maximum reasoning budget actually scored worse (53.9%) than running at high (63.6%). More capability, less reliability. Read this when someone on your team says “we just need a smarter model.”

Why does this matter for your career?

84% of developers use AI tools. Only 29% trust the output (Stack Overflow 2025; ShiftMag 2025). That 55-point gap is the senior engineer’s new job. Harness engineering is the mechanism that closes it, and one harness, committed to version control, multiplies across your whole team. Writing a great CLAUDE.md for 10 developers pays off more than writing 10,000 lines of code yourself.

Senior Engineers Don’t Write Code. They Build Harnesses. makes the career case. The post includes a harness review checklist you can bring to your next PR review, and the 4-era evolution of where senior engineers add value (from writing code, to writing patterns, to writing docs, to building harnesses).

Get weekly Claude Code tips - One email per week. Practical harness patterns, no fluff. Subscribe to AI Developer Weekly →

Where should you start reading?

Three paths, based on where you are today. Each path takes 20 to 40 minutes of reading and gets you to a concrete next action.

New to harness engineering. Start with the pillar post for the definition, then the 5 layers post for the architecture. Come back to this guide for your next deep-dive.

You have a CLAUDE.md and want more rigor. Read the memory fix post first to add MEMORY.md, then the failure-log pattern to rewrite your existing CLAUDE.md. Those two posts together cover all of Layer 1.

Your agent has scared you at least once. Skip to the hook decision guide and ship one PreToolUse guard before your next session. Then read the constraint paradox for why this actually works.

Try it now:

Pick the path above that matches where you are

Open the first linked post and read the TL;DR

Copy one code block from it into your .claude/ folder

Run one Claude Code session with the change applied

Come back here and pick the next path

FAQ

What is Claude Code harness engineering?

Harness engineering for Claude Code is the practice of configuring five layers around the model (Memory via CLAUDE.md and MEMORY.md, Tools via MCP, Permissions via settings.json allow/deny, Hooks via PreToolUse and PostToolUse, and Observability via session logs) to make the agent reliable in production. The model is commodity. The harness is your differentiator.

What’s the difference between this guide and the harness engineering pillar post?

The pillar post defines what harness engineering is, with the three-era history and the LangChain benchmark breakdown. This guide is the reading path, organized by layer, with links to deep-dives for each layer. Read the pillar for theory and this guide for navigation.

Do I need all 5 layers to start?

No. Start with Memory (CLAUDE.md plus MEMORY.md) and Hooks (one PreToolUse guard). Those two layers cover the most common failure modes (context drift and destructive commands). Add Tools, Permissions, and Observability as your team scales or when a specific incident motivates it.

How is harness engineering different from prompt engineering?

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do, using enforcement (hooks, permissions) rather than suggestions (prompts). Each layer builds on the previous. A production setup uses all three.

Does this only apply to Claude Code?

The principles apply to any AI coding agent (Cursor, Copilot, Codex, Windsurf). The implementation details (CLAUDE.md, PreToolUse hooks, MCP config) are Claude Code-specific. Claude Code offers the most programmable harness surface in the market today, which is why the deep-dives focus there. The concepts transfer.

New posts get added to the matching layer section above as they ship. Bookmark this page if you want the running index.

Ready to build your first layer? Pick one deep-dive above, apply one change to your .claude/ folder, and commit it. The compound benefit starts on session #2. Start the Claude Code Mastery course →

Common questions

What is Claude Code harness engineering?: Harness engineering for Claude Code is the practice of configuring five layers around the model: Memory (CLAUDE.md and MEMORY.md), Tools (MCP), Permissions (settings.json allow/deny), Hooks (PreToolUse and PostToolUse), and Observability (session logs). The model is commodity. The harness is your differentiator.
What is the difference between this guide and the harness engineering pillar post?: The pillar post defines what harness engineering is, with the three-era history and the LangChain benchmark breakdown. This guide is the reading path, organized by layer, with links to deep-dives for each layer. Read the pillar for theory and this guide for navigation.
Do I need all 5 layers to start?: No. Start with Memory (CLAUDE.md plus MEMORY.md) and Hooks (one PreToolUse guard). Those two layers cover the most common failure modes: context drift and destructive commands. Add Tools, Permissions, and Observability as your team scales.
How is harness engineering different from prompt engineering?: Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do, using enforcement (hooks, permissions) rather than suggestions (prompts). Each layer builds on the previous.
Does this only apply to Claude Code?: The principles apply to any AI coding agent (Cursor, Copilot, Codex, Windsurf). The implementation details (CLAUDE.md, PreToolUse hooks, MCP config) are Claude Code-specific. Claude Code offers the most programmable harness surface in the market today.

COURSE

Go deeper with Claude Code Mastery

This article covers the basics. The full course walks you through building production AI workflows from scratch.

Start the Course

What is Claude Code harness engineering?

What are the 5 layers of a Claude Code harness?

Layer 1: What does your agent know before you type?

Layer 4: What can the agent NOT do?

Layer 5: How do you know what your agent actually did?

Why does this actually work?

Why does this matter for your career?

Where should you start reading?

FAQ

What is Claude Code harness engineering?

What’s the difference between this guide and the harness engineering pillar post?

Do I need all 5 layers to start?

How is harness engineering different from prompt engineering?

Does this only apply to Claude Code?

What to Read Next

Common questions

Go deeper with Claude Code Mastery

Continue learning

AI Devkit Loop: From Fixkit to a Reusable Plugin

Loop Engineering: The Complete Guide

AI Slop Prevention Is Loop Design, Not Luck

The Brief

The Brief