The 98% Problem: A Survey of Harness Engineering for AI Agents

Below the Model

Frontier models converged between 2023 and 2026. For most production tasks, swapping one top model family for another no longer changes the outcome much. The systems that win differ a layer down: in how they call the model, what context they feed it, which tools they expose, where execution happens, and how they measure outcomes. Practitioners call that layer the harness.

98.4%

of Claude Code’s codebase is harness, not model: the context, tools, permissions, sandboxing, and recovery wrapped around the ~1.6% that is actual AI decision logic^[17]. This is the 98% problem.

93%

of permission prompts get approved, which makes routine prompts unreliable as a safety control^[17].

built-in tools, max, in Claude Code’s curated tool surface (19 always on)^[17].

The first number gives this paper its title. A community dissection of Claude Code, the most documented production agent, estimates that about 1.6% of the code decides what the model does; the rest assembles context, dispatches tools, checks permissions, sandboxes execution, persists state, and recovers from failure^[17]. Lines of code measure where the engineering lives, not where the capability lives, so treat the ratio as an order-of-magnitude claim. It still names a real condition: the 98% problem. The layer that decides agent quality is the one nobody benchmarks, few teams staff, and every project rebuilds from scratch.

Definition. Harness engineering is the design and operation of the control, execution, safety, evaluation, and training infrastructure that turns one or more models into a dependable agentic system. Prompts are one input to one call; the harness governs every call across a task.

The discipline now has primary literature. Anthropic published four engineering guides on agent design^[3], context engineering^[4], tool design^[5], and long-running harnesses^[6]. OpenAI wrote up Codex harness practice^[2]. Böckeler published an early practitioner framework^[1], and two academic teams dissected production systems end to end^[17],[18]. Survey here means a synthesis of that primary engineering literature, not a systematic review.

One mental model organizes the field. Treat the harness as an operating system and the model as a process inside it. The OS decides what memory the process reads, which syscalls exist, which calls succeed, where execution happens, and what the process learns about the world. The pattern compresses to a rule: the model proposes, the harness disposes^[3],[17]. A design that lets the model grant its own permissions has handed it root.

One disclosure up front: parts of this survey were researched and maintained by GROOM, the system its final sections describe.

Anatomy of a Harness

We decompose the production agents with public end-to-end dissections^[17],[18] into the same eight subsystems around the model — the anatomy mapped in Figure 1, the interactive map that opens this survey.

The Agent Loop

The runtime shape descends from ReAct^[7]: assemble context, call the model, gate the proposal, execute in isolation, observe, repeat until a stop condition. Figure 2 shows where the harness inserts itself.

Figure 2 · The agent loop. Five lines of pseudocode. Each arrow hides a subsystem.

Three details separate production loops from textbook ReAct. Ingress guardrails mask sensitive input before the model reads it. The gate scores every proposed action and feeds denials back as text, so a blocked action becomes steering rather than a crash^[17]. Online evaluators can stop the loop on stalls, repeated states, or cost ceilings.

Two Axes of Control

Böckeler classifies every harness control on two axes: does it steer before the action or check after, and does it run as deterministic computation or as model inference^[1]. Figure 3 sorts every mechanism you will meet in this paper.

Figure 3 · Two axes of control^[1]. Mature harnesses fill all four quadrants. Guides without sensors never learn whether the rules worked. Sensors without guides repeat the same mistakes.

Context Engineering

Anthropic states the goal in one sentence: find the smallest set of high-signal tokens that maximizes the probability of the outcome you want^[4].

Context Rot

Million-token windows did not remove the limit. A model recalls in-window information worse as the window fills. The leading explanations: softmax attention spreads each token’s influence across more competitors, and training data skews short, so long-range recall is undertrained^[4]. The effect shows up reliably even where the mechanism is still debated. A fact the model read at position 400K can sit too dilute to drive behavior. Figure 4 builds the intuition.

Context-rot simulatorAttention thins as the window fills — harness patterns shrink what actually lands in it.

Raw task context200K tokens

Harness mitigations

Just-in-time retrievalload references on demand instead of preloadingCompactionsummarize older history under pressureStructured note-takingexternalize state to files outside the windowSub-agent fan-outisolate exploration; return summaries only

Window actually used: 36K tokensEstimated recall: raw 46% → harnessed 93%

Illustrative model for intuition — parameters are pedagogical, not measured.

Figure 4 · Context rot and its mitigations (interactive). The curve is a pedagogical model. The qualitative shape matches what the engineering sources report: recall falls as windows fill, and curation beats window size^[4],[17].

Four Patterns

Production systems combine four techniques, each against a different failure^[4],[8]:

Pattern	Mechanism	Use when
Just-in-time retrieval	Keep references in context; load content through tools on demand	Large corpora and codebases
Compaction	Summarize older history under pressure; tune recall first, precision second	Long conversations without milestones
Note-taking	Write state to files outside the window; reload on demand	Iterative work with deliverables
Sub-agent fan-out	Explore in fresh windows; return condensed summaries	Breadth-first research

Compaction tuning has an asymmetry worth naming. A summary that drops an unresolved bug costs you the task. A summary that keeps redundant tool output costs you tokens. Tune for recall, then trim^[4].

These four patterns manage rot inside the window. The knowledge an agent loads into the window rots too, on a slower clock, and no loop in this list addresses it. The paper returns to that problem with GROOM.

Layered Compaction

Claude Code generalizes the term: the LLM summary of pattern two is only the last and most expensive of five compaction layers, run in cost order before every model call, escalating only while pressure remains^[17]:

Layer	What it does	Cost
1 · Budget caps	Oversized tool results become content references	Near zero
2 · Snip	Trim the oldest history segments	Low
3 · Microcompact	Merge redundant structures, cache-aware	Medium
4 · Collapse	Show the model a compressed view; keep the full record	Medium
5 · Auto-compact	Full LLM summary of the history	One model call

Layer 4 carries the deepest design idea in this area: append-only state with projection at read time. The system never destroys transcripts. Compaction produces a view. You keep the ability to answer “why did the agent do that” weeks later, which the EU AI Act turns from a debugging convenience into an obligation^[24].

Tools, Protocols, and the Action Surface

Tool Design as a Measured Variable

The most-cited failure in production reports: bloated tool sets with overlapping functions. Anthropic’s litmus test catches it early. If you cannot say which tool fits a situation, the model cannot either, and the ambiguity surfaces as wrong-tool calls at runtime^[5].

The design rules that hold up across sources^[5],[17]:

Token-efficient outputs. Every tool result lives in context for the rest of the conversation.
One tool, one job. Merge overlapping tools or sharpen the boundary.
Risk annotations. Tag readOnly, destructive, idempotent, and let the permission system route on them.
Schema enforcement at decode time. Constrained decoding makes malformed calls unrepresentable: wrong types, missing fields, invalid enums. Semantic validity, like whether the path exists or the command is safe, still belongs to the gate.
Deferred schema loading. Keep tool names visible, load full schemas on demand, and hundreds of tools stop consuming the window.

Claude Code ships up to 54 built-in tools, 19 always on and 35 behind feature flags, so the surface the model sees at any moment is far smaller^[17]. Treat that as a reference point for a curated first-party surface. Past it, dispatch belongs to protocols.

Protocolization

Interoperability standardized after 2024. Two protocols matter:

Protocol	Layer	Status, mid-2026
MCP^[21]	Model to tools, resources, and prompts	De facto standard. Native in Claude Code, Cursor, OpenAI’s runtime, and the major clouds.
A2A^[22]	Agent to agent, across vendors	Emerging. Adopt when federation reaches your roadmap.

Writing bespoke plumbing for a third-party service in 2026 usually signals a design problem; use its MCP server. Core first-party tools remain hand-built, and should be. The deeper consequence of protocolization: reasoning, tool transport, and federation now separate into layers with independent standards.

Safety: Permissions, Isolation, Defense in Depth

Layered Control

Claude Code stacks seven independent safety layers. Any single layer can block an action^[17]:

Schema pre-filtering. Deny-listed tools never appear to the model. It cannot attempt what it cannot see.
Deny-first rules. No allow rule outranks a deny rule.
Permission modes. Coarse dials from read-only planning to full autonomy.
Safety classifier. A two-stage model judges risky actions without user prompts.
Sandbox isolation. Bounds the damage when the layers above fail.
Permission non-restoration. Session grants expire across resume; every approval re-evaluates.
Lifecycle hooks. External policy can deny, modify, or log any action.

Layer 1 deserves the emphasis. Restricting an agent means constructing it with a smaller tool schema. Prompting it to behave leaves the capability in place.

Approval Fatigue

Users approve 93% of permission prompts, per Anthropic auto-mode telemetry reported in the Claude Code study^[17]. At that rate a prompt works as a habit, and a habit catches nothing. The fix runs in two directions: prompt only for consequential actions, and add checks that work without user attention. The same telemetry shows trust widening with track record: auto-approval grows from roughly 20% of actions in young installations to above 40% after several hundred sessions^[17].

Sandboxing Tiers

A model with shell access can generate any command. Isolation tiers bound what the worst command can do:

Tier	Boundary	Pick when
Process	None	You wrote and trust the code
OS primitives	seccomp, Landlock, Seatbelt filters	Local-first harnesses confining their own shell
Container	Namespaces, shared kernel	Low-risk tools; kernel escapes stay a threat
gVisor	User-space kernel	Shrinks host-kernel exposure without removing it; costs syscall compatibility and I/O
Firecracker^[20]	MicroVM, boots under 125 ms	Hypervisor isolation per tool call
Managed sandbox	Hosted VM service	You want the isolation without the platform work

Inside any tier: least privilege, ephemeral credentials, explicit egress control.

Memory and Long-Horizon Operation

Three-Tier Memory

Production systems settled on a three-tier model that descends from MemGPT’s OS analogy^[8]:

Tier	Relation to the window	Holds
Core	Always in context	Identity, the current task
Archival	Queried on demand	Embeddings, indexes
Recall	Loaded by trigger	Structured notes, fact stores

Production post-mortems repeat one lesson: memory fails at retrieval, since the fact sat in storage and never reached the window at the right moment. Architect retrieval first. And when someone says “the agent learned X,” ask which layer learned it: model weights almost never, harness configuration sometimes, contextual memory in most cases.

The Shift-Handoff Pattern

Work that spans many context windows needs a handoff protocol, since each new session starts blank. Anthropic models the pattern on human shift changes^[6]. An initializer session produces durable artifacts: a granular feature list with pass flags, a progress log, an environment bootstrap script, a baseline commit, and a conventions file. Every later session follows the same steps:

Read the progress log and the git history.
Run the test suite before touching anything.
Pick the highest-priority open item. Do that one item.
Flip its pass flag, commit, update the log.

One constraint does outsized work: the agent may flip a feature’s pass bit and may never delete a feature. An append-only work list with one mutable bit per item resists the failure where a stuck agent shrinks the scope to claim completion.

Orchestration: Sub-Agents

One context window cannot hold every problem. The decomposition that survived production is supervisor and workers^[17],[18]: a supervisor owns the plan and the right to commit changes, while workers explore sub-tasks in their own clean windows and send back condensed summaries. Figure 5 shows the topology.

Figure 5 · Supervisor and workers. Each worker gets a restricted tool schema and a clean window. Summaries flow up; full histories land on disk.

The rule that makes this scale is the summary-only return. A worker that burns 30K tokens exploring returns 1.5K tokens of findings. Skip the rule and every worker’s exploration lands in the supervisor’s window, which multiplies the context-rot problem the workers were meant to solve. Full histories persist in sidechain transcripts, auditable on disk and invisible to the parent at runtime^[17].

Two implementation lessons from the terminal-agent rebuild^[18]. Build one concrete agent class and vary it by construction parameters: allowed tools, prompt, model. Class hierarchies hit the diamond problem the first time a worker needs mixed capabilities. And construct prompts and schemas before the first call, because lazy construction races tool discovery.

The runtime choice that commits you is architectural. LangGraph makes control flow an explicit graph you can test and replay. Claude Code refuses a planning graph and lets the model reason freely inside a deterministic shell^[17]. Regulated workflows want the graph. Open-ended engineering benefits from the latitude. The vendor shortlist lives in Putting It into Practice, where it belongs.

Evaluation and the Benchmark Trust Crisis

Agent evaluation scores trajectories: sequences of decisions, tool calls, and intermediate states. The benchmark ecosystem grew fast, and each benchmark measures a narrow slice:

Benchmark	Measures
AgentBench^[12]	Interactive reasoning across eight environments
GAIA^[11]	General assistance with tools and browsing
WebArena^[10] + BrowserGym^[15]	Web agents on realistic sites
OSWorld^[13]	Real computer use across applications
SWE-bench^[9]	Resolving real GitHub issues
τ-bench^[14]	Tool use under policy, with the pass^k reliability metric

τ-bench’s pass^k deserves adoption everywhere: all of k attempts must succeed, the complement of the familiar pass@k. pass@1 reports an average, so it cannot tell an agent that solves every task most of the time from one that always fails a fixed slice. pass^k separates them, and on τ-bench it falls steeply with k while pass@1 holds^[14].

Then the correction arrived. Xue et al. re-ran published web agents under live conditions and the reported gains shrank^[16]. Rule-based evaluators misjudge success in both directions, and audits keep pulling exploitable flaws out of the benchmarks everyone cites. Teams now run portfolio evaluation plus adversarial audits of the evaluators themselves. Operationally that means trace-native development: curated suites, production traces, and simulators feeding the same evaluators offline and online. The cheapest version is golden-trajectory replay: save full traces, replay identical prompts after every harness change, diff the trajectories.

Failure Modes

The most-reported killers in the production sources are no longer hallucinations. They are control failures: the agent did something it should have skipped, skipped something it owed you, or succeeded at a cost nobody approved. Each row of the catalog below is an empty quadrant in Figure 3, a guide without a sensor or a sensor without a guide. Distilled from the production sources^{[1],[5],[17],[18]}:

Failure	You see	Fix
Prompt injection	Agent obeys a malicious page or tool output	Execution rails, provenance tags, approval for writes^[23]
Tool overload	Wrong-tool calls, inflated latency	Fewer tools, sharper boundaries^[5]
Context poisoning	Stale notes drive current actions	Recall-first compaction, bounded scratchpads
Retry storms	Cost spikes with zero progress	Step budgets, dead-end detection
Unsafe execution	Host mutation, data exfiltration	Isolation tiers^[20]
Reward hacking	Benchmark gains without capability	Separate evaluator process, exploit suites
Coordination failure	Deadlocks, duplicated work	Role contracts, single-writer state
Silent failure	Regressions ship unnoticed	Loud failure modes, post-action audits^[1]

When you stand up the safety side from scratch, order the work like this:

Remove unnecessary tools. The cheapest safety gain is a smaller action space.
Type what remains. Constrained decoding rejects a malformed call before anything runs.
Gate irreversible actions. Mutations, external messages, money, production.
Isolate side effects. Assume any command; bound what it can touch.
Trace everything. You cannot debug what you did not record.
Evaluate trajectories, since per-turn checks miss loops, drift, and gaming.
Red-team the evaluator. Benchmarks attract exploits like any other surface.
Then consider RL. Training amplifies whatever the first seven steps missed.

Each step costs less and returns more than the next. Teams run the list backwards more often than forwards, and the wreckage funds the evaluation literature.

The Training Frontier

Research moved from scoring outcomes to scoring process: reward models over trajectory steps, verifier-guided search, reinforcement learning on long-horizon interactions. For practitioners one sequencing rule matters. Implement the utility in the harness first, as budgets, gates, and verifiers. Train against it second, once the signal is trustworthy. RL-trained tool agents learn to exploit flawed evaluators, and their traces show the exploit reasoning in plain text.

The result we find most consequential closes the loop on the harness itself. Self-Harness^[19] has agents improve their own harnesses: mine execution traces for failure patterns, propose a minimal modification, accept only what passes regression tests. No weight updates.

Model	Terminal-Bench-2.0, before	After self-harnessing
MiniMax M2.5	40.5%	61.9%
Qwen3.5-35B	23.8%	38.1%
GLM-5	42.9%	57.1%

Two implications follow if the result replicates. Harnesses are model-specific artifacts, since each model fails its own way. And the audit for expired scaffolding, the workarounds that encode last year’s model limits, can run automatically instead of on a calendar^[19].

GROOM: A Self-Maintaining Knowledge Harness

Self-Harness closes the maintenance loop on the harness itself: its tools, its prompts, its policies. The knowledge that harness runs on gets no such loop. It rots on a slower, quieter clock. Internal wikis, convention docs, research digests are written once, consulted forever, and revised never, each one decaying while the agents that load it degrade without complaint. We did not theorize this failure. We lived it: writing this survey rotted our own corpus out from under us. So we built the loop that was missing, and we are open-sourcing it^[25]. GROOM, for Gated Refresh of Organizational Memory, is a markdown knowledge base whose consumption mechanism doubles as its maintenance trigger. The name states the design. The organizational memory is the corpus an agent grounds on: the institution’s knowledge, the part that lives outside the model’s weights and that nobody owns. The refresh is the maintenance, triggered by consultation itself: reading the corpus is what keeps it current, so it stays fresh in proportion to how often it is used and no one has to remember to tend it. The refresh is gated because letting an agent edit a live corpus is dangerous: every change must clear a git checkpoint and a deterministic validator before it lands, or it is rolled back. A groom keeps the harness in working order; the name is also the job description.

Figure 6 · GROOM message sequence. (a) A read activates the skill; the launcher checks its gates and returns within 100 ms while a detached GROOM agent reads the wiki, applies one op, and journals the run. (b) The four core maintenance operations as standing orders. The read is never blocked.

Design

The corpus is a wiki in the plainest sense: a folder of cross-linked markdown pages, one topic each, that an agent reads on demand rather than loading whole. It applies this survey’s findings to its own substrate. Each page’s frontmatter carries a one-line summary that agents read to decide whether to load the page, anupdated date, and a confidence grade (established, emerging, contested) that tells a consuming agent how hard to lean on a claim.

Maintenance runs perform one operation, each a focused agent task (a fifth, iterate, rewrites the single weakest page; the composite all chains them):

Op	Does	Invariant
lint	Repairs frontmatter, links, style	Never changes knowledge
prune	Cuts duplication, merges overlap	Net lines must go down
expand	Ingests vendor and spec changes	3 to 6 files per run
research	Ingests recent arXiv work	Citation-gated; zero additions is a valid outcome

The registry is the filesystem: every prompt file is an op, so extending the pipeline means dropping a markdown file. The maintenance agent runs inside the discipline this paper describes: schema-scoped tools, prefix-locked shell access, a system-prompt fence around the wiki directory, an append-only journal recording model and cost per run. And the keystone — every run is wrapped in a git checkpoint behind a deterministic, token-free validator. An edit commits only if it reports success, passes structural checks (frontmatter, links, index reachability) and fact-level canaries (load-bearing facts that must survive), satisfies its postcondition, and touched nothing outside the wiki; otherwise the tree resets to the checkpoint. The critic is code, not the generator. Across nine injected fault classes this gate recovers the corpus byte-identically where an ungated maintainer corrupts it every time; structural checks alone miss the semantic losses the canaries catch.

Consumption as the Maintenance Trigger

GROOM installs as a skill, a lazy-loaded instruction file that costs about one line of context until a relevant question arrives. On activation, the skill fires a launcher. The launcher checks a config toggle and a debounce stamp, then exits in under 100 ms. An atomic mkdir claim serializes simultaneous triggers to exactly one run — the debounce stamp alone is a check-then-write race, a bug we found by profiling (it held only 28–59% of the time under an 8-way race) and fixed. Eligible triggers spawn a detached maintenance agent in the background. The conversation that fired the refresh never waits. The next one reads a fresher wiki. Figure 7 animates the cycle, including both skip paths.

Relation to Self-Harness

The kinship with Self-Harness^[19] is a shared design rule, not an equivalence. Self-Harness mines failure traces and validates against task benchmarks; GROOM triggers on consumption and validates against structural invariants. The common move is putting the maintenance loop inside the artifact it maintains, gated by checks the artifact defines. Whether structural gates suffice for knowledge correctness is open. We do show the stakes are real: injecting staleness into the corpus collapses a consuming agent’s answer accuracy on the affected facts from 100% to 0% while untouched controls hold — but isolating the benefit of consumption-triggered over scheduled maintenance remains future work.

The closest artifact is the now-popular pattern of an LLM-maintained markdown wiki^[30]: GROOM shares that substrate but adds what it omits for safe autonomous operation, namely a checkpoint gate around every edit, fact-level canaries, and consumption as the trigger.

Does Grooming Generalize?

The maintenance evaluation runs on one corpus, so we tested whether the benefit of a groomed corpus holds elsewhere, and connected it to standard information retrieval. GROOM is retrieval-agnostic — it maintains clean markdown, and any retriever consults it — so we measured retrieval quality directly across three unrelated agent knowledge bases (an internal API/SDK reference, a cloud deploy and SRE runbook, and a SaaS support KB) with two pluggable retrievers, BM25 and a dense neural model. Each corpus was scored in two states: groomed (clean), and degraded with the structural entropy grooming removes — duplicate and near-duplicate pages, boilerplate.

Across the domain-by-retriever cells, grooming delivers a 45–51% relative gain in recall@1, and the effect is retriever-agnostic rather than an artifact of lexical matching. Macro-averaged across the three domains, grooming lifts recall@1 from 0.52 to 0.78 for BM25 and from 0.56 to 0.81 for the dense retriever, with matching gains in MRR and nDCG@5, in every cell. A groomed corpus is also about 40% smaller, which is what the “load the whole corpus if it fits the window” strategy pays for. The honest framing: this measures the corpus-state difference grooming targets, clean versus noisy, not GROOM’s operations producing the cleanup live — that experiment is future work.

Availability

GROOM ships with the harness-engineering wiki as its first instance, a behavior test suite and a reproducible evaluation harness that cost no agent calls, and cron examples for deployments where consultation is rare. The pipeline is content- and retrieval-agnostic: GROOM maintains clean, well-linked markdown, and how an agent consults it — progressive disclosure, full-context load, BM25, or a dense retriever — is a pluggable layer, not GROOM’s concern. Point corpus in the config (or GROOM_CORPUS) at any knowledge base to maintain it, or run an init scaffolder to bootstrap a fresh one that passes the validator from its first commit. One install maintains any knowledge base.

Code: github.com/beconfident-ai/groom
Wiki instance: github.com/beconfident-ai/groom/tree/master/wiki

The Regulatory Landscape

A model cannot be compliant. It is a stochastic component. The artifacts an auditor asks for live in the harness: trace logs, approval records, policy decisions, dataset lineage. Teams that treat compliance as an architecture input get those artifacts for the cost of design choices made on day one. Teams that retrofit pay in months. Figure 8 puts the dates on one line.

Figure 8 · EU AI Act application timeline^[24]. The broad applicability date lands just under eight weeks after this paper’s publication.

The Act’s operational duties map directly onto harness subsystems: human oversight to the permission stack, incident logging to append-only traces, technical documentation to evaluator records. NIST’s AI Risk Management Framework^[29] gives US-facing teams the matching vocabulary, and the OWASP agentic Top 10^[23] translates both into engineering threat language.

The harness must produce, as a matter of routine operation:

Immutable trace logs for every model call, tool call, and approval.
Approval records with who, when, and on what basis.
Dataset lineage covering evaluator versions and their history.
Sandbox and credential boundary documentation.
Benchmark audit trails behind any capability claim.
Operator runbooks for incidents.
PII masking at ingestion. Dashboard-level masking runs after the model already read the data.

Putting It into Practice

Adoption order matters more than framework choice. The Failure Modes list ordered the safety build; these phases order the whole program. Each is a precondition for the next:

Phase	Build	Exit test
1 · Foundation	One runtime, MCP for tools, one trace store, one sandbox tier	An agent finishes a real task with traces you can replay
2 · Hardening	Budgets, approval gates, trace-level evaluators, a red-team pass	You can answer “what did it do and why” for any run
3 · Specialization	Internal simulators mirroring your tools and permissions	Offline scores predict production behavior
4 · Optimization	PRMs, verifier-guided search, RL on narrow domains	Gains survive the exploit suite

Frameworks

When you pick the phase-one runtime, the mid-2026 shortlist:

Runtime	Distinct strength	Fit
LangGraph^[26]	Typed state, durable execution, interrupts	You want control flow inspectable and replayable
OpenAI Agents SDK^[27]	Handoffs, guardrails, first-party tracing	You build on the Responses API
Google ADK^[28]	Multi-language, hierarchical multi-agent	Enterprise teams inside Google Cloud
Claude Agent SDK	Inherits Claude Code’s loop, permissions, hooks	Coding agents on Anthropic models

CrewAI, smolagents, and PydanticAI cover lighter use. Bedrock AgentCore and Foundry Agent Service trade low-level control for managed operation.

Seven habits compound across all four phases:

Build the eval harness before the agent.
Track tokens per task on the same dashboard as latency and success rate.
Restrict by schema, never by prompt.
Keep policy in config and hooks, out of code, so it ships without redeploys.
Label every scaffold with the model limitation it patches, and audit the labels.
Treat your benchmarks as attack surfaces.
Treat the knowledge your agents consult as a maintained artifact, not a static document. GROOM is our implementation of that habit.

Open Problems

What the discipline cannot do yet, in rough order of pain. The first two are where GROOM will be tested first:

Cross-session continuity. Shift-handoff scaffolding carries notes across sessions. Nobody has shown multi-week collaboration that accumulates skill.
Silent failure. Graceful degradation hides regressions. Harnesses lack an equivalent of mutation testing: a sensor that never fires means either quality or blindness, and you cannot tell without injecting faults^[1].
Evaluator trust as an arms race. Agents trained against evaluators attack the measurement. Red-team the eval harness like any other surface^[16].
Harness portability. If harnesses are model-specific^[19], switching models costs a scaffold re-derivation, and protocol layers^[21],[22] soften this without solving it.
Capability preservation. Current harnesses optimize short-term throughput. Designing them to deepen the operator’s understanding instead of eroding it remains the most underrated question on this list.

Conclusion

Production AI in 2026 wins or loses in the harness. The discipline converged on a recognizable shape: a thin reactive loop surrounded by deep infrastructure for context, tools, safety, memory, and measurement, governed by the rule that the model proposes and the harness disposes. The 98% is the product, not the overhead.

The frontier is reflexive: harnesses now measure, maintain, and improve themselves, and GROOM contributes that loop at the knowledge layer. The wiki beneath this survey is one GROOM keeps current.

GROOM is open source. Point it at your own knowledge base (github.com/beconfident-ai/groom), or read the wiki it maintains. We’re building the rest of this at BeConfident Labs.

Author Contributions Statement

Gui Dávid (Head of AI, BeConfident Labs): conception, survey research and synthesis, GROOM design and implementation, visualizations, writing.

BeConfident Labs: the production agent infrastructure that motivated and stress-tested these patterns. Agentic tooling assisted the interactive figures, inside the kind of harness this paper surveys. A human verified every claim and reference against primary sources.

References

Böckeler, B. (2026). Harness Engineering for Coding Agent Users. martinfowler.com. martinfowler.com/articles/harness-engineering.html
OpenAI (2026). Harness engineering: leveraging Codex in an agent-first world. openai.com/index/harness-engineering/
Anthropic (2024). Building Effective Agents. www.anthropic.com/engineering/building-effective-agents
Anthropic (2025). Effective context engineering for AI agents. www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic (2025). Writing effective tools for agents — with agents. www.anthropic.com/engineering/writing-tools-for-agents
Anthropic (2025). Effective harnesses for long-running agents. www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. arxiv.org/abs/2210.03629
Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. arxiv.org/abs/2310.08560
Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770. arxiv.org/abs/2310.06770
Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854. arxiv.org/abs/2307.13854
Mialon, G., et al. (2023). GAIA: a benchmark for General AI Assistants. arXiv:2311.12983. arxiv.org/abs/2311.12983
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688. arxiv.org/abs/2308.03688
Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972. arxiv.org/abs/2404.07972
Yao, S., et al. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. arxiv.org/abs/2406.12045
Le Sellier De Chezelles, T., et al. (2024). The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467. arxiv.org/abs/2412.05467
Xue, T., et al. (2025). An Illusion of Progress? Assessing the Current State of Web Agents. arXiv:2504.01382. arxiv.org/abs/2504.01382
Liu, J., et al. (2026). Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv:2604.14228. arxiv.org/abs/2604.14228
Bui, N. D. Q., et al. (2026). Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv:2603.05344. arxiv.org/abs/2603.05344
Zhang, H., et al. (2026). Self-Harness: Harnesses That Improve Themselves. arXiv:2606.09498. arxiv.org/abs/2606.09498
Agache, A., et al. (2020). Firecracker: Lightweight Virtualization for Serverless Applications. NSDI ’20. www.usenix.org/conference/nsdi20/presentation/agache
Model Context Protocol. Specification and documentation. modelcontextprotocol.io
A2A Protocol. Agent-to-agent interoperability specification. a2a-protocol.org
OWASP GenAI Security Project. Top 10 for Agentic Applications. genai.owasp.org
European Commission. The EU Artificial Intelligence Act — regulatory framework and application timeline. digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
BeConfident Labs (2026). GROOM: a self-maintaining knowledge harness. github.com/beconfident-ai/groom github.com/beconfident-ai/groom
LangChain. LangGraph: agent orchestration framework. Documentation. www.langchain.com/langgraph
OpenAI. Agents SDK for Python. Documentation. openai.github.io/openai-agents-python/
Google. Agent Development Kit (ADK). Documentation. adk.dev
NIST (2023). AI Risk Management Framework 1.0, with the Generative AI Profile (2024). www.nist.gov/itl/ai-risk-management-framework
Karpathy, A. (2026). LLM Wiki: an LLM-maintained markdown knowledge base. Public gist, April 2026. gist.github.com/karpathy/442a6bf555914893e9891c11519de94f