Its Harness GitHub
HARNESS ENGINEERING

The discipline of building
AI agents that work in production.

Harness engineering is everything around the model: the state it tracks, the tools it can call, the controls that govern its behaviour, the verification that catches mistakes before they land, and the recovery that handles failures without crashing. The model decides. The harness governs.

88% of AI agent projects never reach production — the bottleneck is the harness, not the model
11 layers in a complete production harness — world model through output reviewer pass
~1.6% of a production agent is AI decision logic — the other 98.4% is harness infrastructure

What is harness engineering?

Harness engineering is the discipline of designing, building, and operating the infrastructure that wraps around an AI model to make it reliable in production. A harness is everything the model does not do on its own: the tools it can invoke, the memory it reads from, the state it updates, the controls that govern whether an action is allowed, the verification that checks the output, and the recovery logic that runs when something goes wrong.

The term draws from motorsport: a harness holds everything together under load, constrains what can move freely, and keeps the system in bounds when conditions get unpredictable. An AI agent harness does the same. Without it, the model is unrestrained — capable in demos, unreliable in production.

Harness engineering is the practice of designing that constraint layer deliberately. Not as an afterthought — not as glue code written after the model works — but as the primary engineering surface that determines whether the agent works at all.

The tool built for this discipline

Its Harness is open-source visual canvas software built entirely around harness engineering. The product name is the discipline name — it is all about the harness: draw the full 11-layer architecture, compile to any major AI framework via FlowSpec, and trace every decision with Langfuse. See the tool →

Harness engineering vs prompt engineering

Both matter. But they operate at different layers and solve different problems. In production, the returns on each are not equal.

Prompt engineering

  • Optimises the input to a single model call
  • Asks: "How should I phrase this?"
  • Scope: one call, one response
  • Reliability gain beyond a baseline: <3%
  • Breaks when state, tools, or multi-step execution enters the picture

Harness engineering

  • Designs the full execution loop the model operates within
  • Asks: "What system should govern what the model does?"
  • Scope: every call, every tool, every failure, every run
  • Reliability gain from harness-level changes: 28–47%
  • The layer that determines whether production agents ship at all

Harness engineering vs workflow building

A workflow routes prompts from node to node: input → LLM call → tool call → output. It gets data where it needs to go. Workflows are the right tool for deterministic, linear tasks where the inputs are clean and the failure modes are known.

A harness governs the agent's full lifecycle. Where a workflow routes, a harness governs:

Harnesses are the right tool for any agent operating in an environment where state, ambiguity, failure, and uncertainty are real. Most production use cases qualify.

The eleven layers of a production harness

You do not need all eleven layers for every use case. But each one addresses a specific failure mode that will surface in production if left unhandled. Start with the layers your agent needs today; add the rest as the failure modes appear.

Layer 1 Caller State Constraints, clarification requests, and escalation propagation from the calling context into the agent's execution scope. Without it: the agent has no awareness of caller-level constraints or escalation signals.
Layer 2 World Model Typed beliefs about the environment, contradiction detection, and generation_id tracking so the agent knows what it knows — and when that knowledge was formed. Without it: the agent acts on stale or contradictory beliefs without knowing it.
Layer 3 Reasoning An evidence store, hypotheses from four generation sources, and a value-of-information gate that fires before every action — ensuring the agent gathers enough evidence before committing. Without it: the agent acts on the first plausible hypothesis rather than the best-supported one.
Layer 4 Control State A five-tier state resolver (NORMAL → CAUTIOUS → RESTRICTED → BLOCKED → HALT), diagnostic health vectors that drive it, and deadlock detection that stops escalation loops. Without it: no brakes — the agent escalates indefinitely when conditions degrade.
Layer 5 Planning A task graph with six task states, parallel concurrency management, and dependency tracking so the agent coordinates work across multiple steps without collisions. Without it: multi-step tasks race, block, or duplicate work silently.
Layer 6 Execution VOI-gated actions, a pre-execution review gate across five dimensions (safety, relevance, reversibility, resource, and alignment), and reversibility classification before any action fires. Without it: the agent takes irreversible actions without assessing the cost first.
Layer 7 Verification Nine verification layers, an adequacy critic, and an adversarial reviewer pass that checks outputs before they land — catching errors the model itself would not flag. Without it: bad outputs reach users or downstream systems unchecked.
Layer 8 Recovery Six named recovery strategies, a typed failure library, and local vs. global replanning scope so failures are handled systematically rather than crashing the entire run. Without it: one failed tool call ends the session instead of triggering a recovery path.
Layer 9 Memory Context compression, an execution journal, and budget management so the agent fits within context limits without losing the thread of what it has done. Without it: long-running agents truncate their own history or exhaust context silently.
Layer 10 (optional) Learning An experience store for cross-run structural reuse — the agent warm-starts from patterns that worked in previous runs rather than beginning from scratch each time. Without it: every run starts cold regardless of what prior runs discovered.
Layer 11 Output & Reviewer Pass Output contract validation and a three-lens reviewer pass (correctness, completeness, safety) that signs off on the final response before it leaves the harness. Without it: outputs that pass verification still reach users without a final quality gate.

Design once. Compile to any framework.

Its Harness is the only open-source harness engineering software built specifically for this discipline. It covers the full loop — from drawing the architecture on a visual canvas to compiling and running it in production — without locking you into a single framework.

Step 1 — Design Visual canvas Draw your harness architecture using 27 node types — 14 execution nodes and 13 harness nodes that implement the full 11-layer control architecture. input · output · llm_call · tool_invoke
agent_role · agent_debate · condition
parallel_fork · parallel_join · hitl_breakpoint
memory_read · memory_write · subgraph
world_model · control_state · verify_gate
recovery · exp_store · reviewer_pass · +8 more
Step 2 — Compile FlowSpec The canvas exports to FlowSpec — a runtime-neutral, portable JSON format. One FlowSpec file runs on any supported framework without rewriting. Open format · version-controlled
Framework-agnostic
Third-party node packs supported
(@itsharness/nodes/…)
Step 3 — Run Any framework The same FlowSpec compiles to LangGraph, CrewAI, Mastra, or Microsoft Agent Framework. Switch runtimes without touching the canvas design. LangGraph (Python / JS)
CrewAI (Python)
Mastra (TypeScript)
MS Agent Framework (C# / Python / Java)
A2A protocol (any A2A runtime)
Step 4 — Observe Langfuse Every run is traced automatically. Spans extend to world model generation_id, control state transitions, recovery strategy changes, and reviewer-pass findings. Harness-aware tracing
Compare runs across frameworks
HITL pause and resume
REST · MCP · A2A deployment

Up and running in two commands.

Everything runs locally via Docker. No cloud account, no API key required to start — use Ollama to run a free local model if you prefer.

shell
# 1. Generate secrets and configure environment
./scripts/setup-env.sh

# 2. Start all nine services
docker compose up
Canvas localhost:3000 Visual harness editor
API localhost:8000 Compiles and runs FlowSpec
Langfuse localhost:3001 Observability dashboard

Apache 2.0. Source on GitHub →

Questions, answered.

What is harness engineering?

The discipline of designing, building, and operating the infrastructure that wraps around an AI model to make it reliable in production. The harness supplies everything the model does not do on its own — tool execution, memory, state management, control flow, verification, recovery, and observability. The model decides; the harness governs how that decision gets executed and what happens when it fails.

What is the difference between harness engineering and prompt engineering?

Prompt engineering optimises a single model call. Harness engineering designs the full execution loop across every call the agent makes — state, tools, verification, recovery, observability. In production, harness-level changes account for the large majority of agent reliability gains. Prompt refinement beyond a reasonable baseline accounts for a small fraction.

What is the difference between a harness and a workflow?

A workflow routes prompts from node to node. A harness governs the agent's full lifecycle — what it believes, what it is allowed to do, how it verifies its own outputs, how it recovers from failures, and what it learns for next time. Workflows suit deterministic linear tasks. Harnesses suit agents operating in environments where state, failure, and uncertainty are real.

What tools are available for harness engineering?

Its Harness is the only open-source visual tool built specifically for harness engineering. It provides a visual canvas with 27 node types, compiles to LangGraph, CrewAI, Mastra, or Microsoft Agent Framework via FlowSpec, and includes built-in Langfuse observability, HITL controls, and REST/MCP/A2A deployment. Apache 2.0, runs locally via Docker.

What is FlowSpec?

FlowSpec is the runtime-neutral JSON format at the centre of Its Harness. You design a harness once on the visual canvas and it compiles to a FlowSpec file. That single file then runs on LangGraph, CrewAI, Mastra, or Microsoft Agent Framework without rewriting. FlowSpec v0.2.0 is stable and open for third-party node packs (@itsharness/nodes/…).

Is harness engineering the same as MLOps?

No. MLOps is concerned with model performance over time — training pipelines, versioning, drift detection. Harness engineering is concerned with agent behaviour in real-time execution — the control flow, verification, recovery, and observability that governs what the agent does with each request. There is overlap in observability, but the disciplines address different layers of the stack.

How do I get started with harness engineering?

Clone github.com/3IVIS/itsharness, run ./scripts/setup-env.sh && docker compose up, and open the canvas on localhost:3000. Five ready-made harnesses are included to fork and build on: RAG Agent (memory read + semantic search), Content Moderation + Human Review (HITL pause on high-risk items), Parallel Risk Assessment (three specialist agents, fan-out/merge), Research Crew (multi-agent with tool approval), and Debate Agent (multi-agent debate exposed as an A2A agent). No cloud account required.

The open-source tool
for harness engineering.

Visual canvas, 27 node types, 4 framework adapters, built-in Langfuse observability, HITL controls, REST/MCP/A2A deployment, and the full 11-layer harness architecture — implemented and tested (379 tests). Apache 2.0. Runs locally via Docker.