Below the Model

Frontier models converged between 2023 and 2026. For most production tasks, swapping one top model family for another no longer changes the outcome much. The systems that win differ a layer down: in how they call the model, what context they feed it, which tools they expose, where execution happens, and how they measure outcomes. Practitioners call that layer the harness.

98.4%
of Claude Code’s codebase is harness, not model: the context, tools, permissions, sandboxing, and recovery wrapped around the ~1.6% that is actual AI decision logic[17]. This is the 98% problem.
93%
of permission prompts get approved, which makes routine prompts unreliable as a safety control[17].
54
built-in tools, max, in Claude Code’s curated tool surface (19 always on)[17].

The first number gives this paper its title. A community dissection of Claude Code, the most documented production agent, estimates that about 1.6% of the code decides what the model does; the rest assembles context, dispatches tools, checks permissions, sandboxes execution, persists state, and recovers from failure[17]. Lines of code measure where the engineering lives, not where the capability lives, so treat the ratio as an order-of-magnitude claim. It still names a real condition: the 98% problem. The layer that decides agent quality is the one nobody benchmarks, few teams staff, and every project rebuilds from scratch.

Definition. Harness engineering is the design and operation of the control, execution, safety, evaluation, and training infrastructure that turns one or more models into a dependable agentic system. Prompts are one input to one call; the harness governs every call across a task.

The discipline now has primary literature. Anthropic published four engineering guides on agent design[3], context engineering[4], tool design[5], and long-running harnesses[6]. OpenAI wrote up Codex harness practice[2]. Böckeler published an early practitioner framework[1], and two academic teams dissected production systems end to end[17],[18]. Survey here means a synthesis of that primary engineering literature, not a systematic review.

One mental model organizes the field. Treat the harness as an operating system and the model as a process inside it. The OS decides what memory the process reads, which syscalls exist, which calls succeed, where execution happens, and what the process learns about the world. The pattern compresses to a rule: the model proposes, the harness disposes[3],[17]. A design that lets the model grant its own permissions has handed it root.

One disclosure up front: parts of this survey were researched and maintained by GROOM, the system its final sections describe.

Anatomy of a Harness

We decompose the production agents with public end-to-end dissections[17],[18] into the same eight subsystems around the model. Figure 1 maps them. Click a subsystem for its job, the pattern that makes it work, and the failure you get when you skip it.

Click a subsystem to inspect it.

Context Engine

Curates the smallest set of high-signal tokens for each model call. Windows grew to 1M tokens but context rot — recall degradation as attention thins quadratically — did not vanish. Production systems run layered compaction: cheap trims first, full LLM summarization only under pressure, with full history preserved as an append-only record.

Key pattern:Append-only state, projection at read time — compaction is a view, not a write.
Canonical failure:Compaction-as-truncation: chopping history destroys architectural decisions and unresolved bugs.
Figure 1 · Anatomy of a harness (interactive). Eight subsystems around a small, swappable model.

The Agent Loop

The runtime shape descends from ReAct[7]: assemble context, call the model, gate the proposal, execute in isolation, observe, repeat until a stop condition. Figure 2 shows where the harness inserts itself.

Assemblecompaction · retrievalModel callproposes an actionGaterisk score · approvalExecuteinside a sandboxObserveappend to tracedenied: the reason returns as text and the model re-plansloop until: done · budget exhausted · evaluator veto · abort
Figure 2 · The agent loop. Five lines of pseudocode. Each arrow hides a subsystem.

Three details separate production loops from textbook ReAct. Ingress guardrails mask sensitive input before the model reads it. The gate scores every proposed action and feeds denials back as text, so a blocked action becomes steering rather than a crash[17]. Online evaluators can stop the loop on stalls, repeated states, or cost ceilings.

Two Axes of Control

Böckeler classifies every harness control on two axes: does it steer before the action or check after, and does it run as deterministic computation or as model inference[1]. Figure 3 sorts every mechanism you will meet in this paper.

Static guidesInferred guidesComputational sensorsInferential sensorsschemas · allowlistsconvention files · sandbox configcost zero at runtimeskills · retrieved examplesagentic memoryadapts, costs tokenstests · type checkers · linterscheap, drift-freeLLM-as-judge · review subagentscovers semantics, needs calibrationsteer beforecheck afterdeterministicmodel inference
Figure 3 · Two axes of control[1]. Mature harnesses fill all four quadrants. Guides without sensors never learn whether the rules worked. Sensors without guides repeat the same mistakes.

Context Engineering

Anthropic states the goal in one sentence: find the smallest set of high-signal tokens that maximizes the probability of the outcome you want[4].

Context Rot

Million-token windows did not remove the limit. A model recalls in-window information worse as the window fills. The leading explanations: softmax attention spreads each token’s influence across more competitors, and training data skews short, so long-range recall is undertrained[4]. The effect shows up reliably even where the mechanism is still debated. A fact the model read at position 400K can sit too dilute to drive behavior. Figure 4 builds the intuition.

Context-rot simulatorAttention thins as the window fills — harness patterns shrink what actually lands in it.
200K tokens
Harness mitigations
0%25%50%75%100%8K30K100K300K1Mtokens in the window (log scale)estimated recall
Window actually used: 36K tokensEstimated recall: raw 46%harnessed 93%
Illustrative model for intuition — parameters are pedagogical, not measured.
Figure 4 · Context rot and its mitigations (interactive). The curve is a pedagogical model. The qualitative shape matches what the engineering sources report: recall falls as windows fill, and curation beats window size[4],[17].

Four Patterns

Production systems combine four techniques, each against a different failure[4],[8]:

PatternMechanismUse when
Just-in-time retrievalKeep references in context; load content through tools on demandLarge corpora and codebases
CompactionSummarize older history under pressure; tune recall first, precision secondLong conversations without milestones
Note-takingWrite state to files outside the window; reload on demandIterative work with deliverables
Sub-agent fan-outExplore in fresh windows; return condensed summariesBreadth-first research

Compaction tuning has an asymmetry worth naming. A summary that drops an unresolved bug costs you the task. A summary that keeps redundant tool output costs you tokens. Tune for recall, then trim[4].

These four patterns manage rot inside the window. The knowledge an agent loads into the window rots too, on a slower clock, and no loop in this list addresses it. The paper returns to that problem with GROOM.

Layered Compaction

Claude Code generalizes the term: the LLM summary of pattern two is only the last and most expensive of five compaction layers, run in cost order before every model call, escalating only while pressure remains[17]:

LayerWhat it doesCost
1 · Budget capsOversized tool results become content referencesNear zero
2 · SnipTrim the oldest history segmentsLow
3 · MicrocompactMerge redundant structures, cache-awareMedium
4 · CollapseShow the model a compressed view; keep the full recordMedium
5 · Auto-compactFull LLM summary of the historyOne model call

Layer 4 carries the deepest design idea in this area: append-only state with projection at read time. The system never destroys transcripts. Compaction produces a view. You keep the ability to answer “why did the agent do that” weeks later, which the EU AI Act turns from a debugging convenience into an obligation[24].

Tools, Protocols, and the Action Surface

Tool Design as a Measured Variable

The most-cited failure in production reports: bloated tool sets with overlapping functions. Anthropic’s litmus test catches it early. If you cannot say which tool fits a situation, the model cannot either, and the ambiguity surfaces as wrong-tool calls at runtime[5].

The design rules that hold up across sources[5],[17]:

  • Token-efficient outputs. Every tool result lives in context for the rest of the conversation.
  • One tool, one job. Merge overlapping tools or sharpen the boundary.
  • Risk annotations. Tag readOnly, destructive, idempotent, and let the permission system route on them.
  • Schema enforcement at decode time. Constrained decoding makes malformed calls unrepresentable: wrong types, missing fields, invalid enums. Semantic validity, like whether the path exists or the command is safe, still belongs to the gate.
  • Deferred schema loading. Keep tool names visible, load full schemas on demand, and hundreds of tools stop consuming the window.

Claude Code ships up to 54 built-in tools, 19 always on and 35 behind feature flags, so the surface the model sees at any moment is far smaller[17]. Treat that as a reference point for a curated first-party surface. Past it, dispatch belongs to protocols.

Protocolization

Interoperability standardized after 2024. Two protocols matter:

ProtocolLayerStatus, mid-2026
MCP[21]Model to tools, resources, and promptsDe facto standard. Native in Claude Code, Cursor, OpenAI’s runtime, and the major clouds.
A2A[22]Agent to agent, across vendorsEmerging. Adopt when federation reaches your roadmap.

Writing bespoke plumbing for a third-party service in 2026 usually signals a design problem; use its MCP server. Core first-party tools remain hand-built, and should be. The deeper consequence of protocolization: reasoning, tool transport, and federation now separate into layers with independent standards.

Safety: Permissions, Isolation, Defense in Depth

Layered Control

Claude Code stacks seven independent safety layers. Any single layer can block an action[17]:

  1. Schema pre-filtering. Deny-listed tools never appear to the model. It cannot attempt what it cannot see.
  2. Deny-first rules. No allow rule outranks a deny rule.
  3. Permission modes. Coarse dials from read-only planning to full autonomy.
  4. Safety classifier. A two-stage model judges risky actions without user prompts.
  5. Sandbox isolation. Bounds the damage when the layers above fail.
  6. Permission non-restoration. Session grants expire across resume; every approval re-evaluates.
  7. Lifecycle hooks. External policy can deny, modify, or log any action.

Layer 1 deserves the emphasis. Restricting an agent means constructing it with a smaller tool schema. Prompting it to behave leaves the capability in place.

Approval Fatigue

Users approve 93% of permission prompts, per Anthropic auto-mode telemetry reported in the Claude Code study[17]. At that rate a prompt works as a habit, and a habit catches nothing. The fix runs in two directions: prompt only for consequential actions, and add checks that work without user attention. The same telemetry shows trust widening with track record: auto-approval grows from roughly 20% of actions in young installations to above 40% after several hundred sessions[17].

Sandboxing Tiers

A model with shell access can generate any command. Isolation tiers bound what the worst command can do:

TierBoundaryPick when
ProcessNoneYou wrote and trust the code
OS primitivesseccomp, Landlock, Seatbelt filtersLocal-first harnesses confining their own shell
ContainerNamespaces, shared kernelLow-risk tools; kernel escapes stay a threat
gVisorUser-space kernelShrinks host-kernel exposure without removing it; costs syscall compatibility and I/O
Firecracker[20]MicroVM, boots under 125 msHypervisor isolation per tool call
Managed sandboxHosted VM serviceYou want the isolation without the platform work

Inside any tier: least privilege, ephemeral credentials, explicit egress control.

Memory and Long-Horizon Operation

Three-Tier Memory

Production systems settled on a three-tier model that descends from MemGPT’s OS analogy[8]:

TierRelation to the windowHolds
CoreAlways in contextIdentity, the current task
ArchivalQueried on demandEmbeddings, indexes
RecallLoaded by triggerStructured notes, fact stores

Production post-mortems repeat one lesson: memory fails at retrieval, since the fact sat in storage and never reached the window at the right moment. Architect retrieval first. And when someone says “the agent learned X,” ask which layer learned it: model weights almost never, harness configuration sometimes, contextual memory in most cases.

The Shift-Handoff Pattern

Work that spans many context windows needs a handoff protocol, since each new session starts blank. Anthropic models the pattern on human shift changes[6]. An initializer session produces durable artifacts: a granular feature list with pass flags, a progress log, an environment bootstrap script, a baseline commit, and a conventions file. Every later session follows the same steps:

  1. Read the progress log and the git history.
  2. Run the test suite before touching anything.
  3. Pick the highest-priority open item. Do that one item.
  4. Flip its pass flag, commit, update the log.

One constraint does outsized work: the agent may flip a feature’s pass bit and may never delete a feature. An append-only work list with one mutable bit per item resists the failure where a stuck agent shrinks the scope to claim completion.

Orchestration: Sub-Agents

One context window cannot hold every problem. The decomposition that survived production is supervisor and workers[17],[18]: a supervisor owns the plan and the right to commit changes, while workers explore sub-tasks in their own clean windows and send back condensed summaries. Figure 5 shows the topology.

Supervisorowns the plan + write authorityWorker · researchread-only tool schemaWorker · codeedit + test toolsWorker · browserweb tools onlysidechain transcriptsfull worker histories: auditable on disk, invisible to the parent at runtime
Figure 5 · Supervisor and workers. Each worker gets a restricted tool schema and a clean window. Summaries flow up; full histories land on disk.

The rule that makes this scale is the summary-only return. A worker that burns 30K tokens exploring returns 1.5K tokens of findings. Skip the rule and every worker’s exploration lands in the supervisor’s window, which multiplies the context-rot problem the workers were meant to solve. Full histories persist in sidechain transcripts, auditable on disk and invisible to the parent at runtime[17].

Two implementation lessons from the terminal-agent rebuild[18]. Build one concrete agent class and vary it by construction parameters: allowed tools, prompt, model. Class hierarchies hit the diamond problem the first time a worker needs mixed capabilities. And construct prompts and schemas before the first call, because lazy construction races tool discovery.

The runtime choice that commits you is architectural. LangGraph makes control flow an explicit graph you can test and replay. Claude Code refuses a planning graph and lets the model reason freely inside a deterministic shell[17]. Regulated workflows want the graph. Open-ended engineering benefits from the latitude. The vendor shortlist lives in Putting It into Practice, where it belongs.

Evaluation and the Benchmark Trust Crisis

Agent evaluation scores trajectories: sequences of decisions, tool calls, and intermediate states. The benchmark ecosystem grew fast, and each benchmark measures a narrow slice:

BenchmarkMeasures
AgentBench[12]Interactive reasoning across eight environments
GAIA[11]General assistance with tools and browsing
WebArena[10] + BrowserGym[15]Web agents on realistic sites
OSWorld[13]Real computer use across applications
SWE-bench[9]Resolving real GitHub issues
τ-bench[14]Tool use under policy, with the pass^k reliability metric

τ-bench’s pass^k deserves adoption everywhere: all of k attempts must succeed, the complement of the familiar pass@k. pass@1 reports an average, so it cannot tell an agent that solves every task most of the time from one that always fails a fixed slice. pass^k separates them, and on τ-bench it falls steeply with k while pass@1 holds[14].

Then the correction arrived. Xue et al. re-ran published web agents under live conditions and the reported gains shrank[16]. Rule-based evaluators misjudge success in both directions, and audits keep pulling exploitable flaws out of the benchmarks everyone cites. Teams now run portfolio evaluation plus adversarial audits of the evaluators themselves. Operationally that means trace-native development: curated suites, production traces, and simulators feeding the same evaluators offline and online. The cheapest version is golden-trajectory replay: save full traces, replay identical prompts after every harness change, diff the trajectories.

Failure Modes

The most-reported killers in the production sources are no longer hallucinations. They are control failures: the agent did something it should have skipped, skipped something it owed you, or succeeded at a cost nobody approved. Each row of the catalog below is an empty quadrant in Figure 3, a guide without a sensor or a sensor without a guide. Distilled from the production sources[1],[5],[17],[18]:

FailureYou seeFix
Prompt injectionAgent obeys a malicious page or tool outputExecution rails, provenance tags, approval for writes[23]
Tool overloadWrong-tool calls, inflated latencyFewer tools, sharper boundaries[5]
Context poisoningStale notes drive current actionsRecall-first compaction, bounded scratchpads
Retry stormsCost spikes with zero progressStep budgets, dead-end detection
Unsafe executionHost mutation, data exfiltrationIsolation tiers[20]
Reward hackingBenchmark gains without capabilitySeparate evaluator process, exploit suites
Coordination failureDeadlocks, duplicated workRole contracts, single-writer state
Silent failureRegressions ship unnoticedLoud failure modes, post-action audits[1]

When you stand up the safety side from scratch, order the work like this:

  1. Remove unnecessary tools. The cheapest safety gain is a smaller action space.
  2. Type what remains. Constrained decoding rejects a malformed call before anything runs.
  3. Gate irreversible actions. Mutations, external messages, money, production.
  4. Isolate side effects. Assume any command; bound what it can touch.
  5. Trace everything. You cannot debug what you did not record.
  6. Evaluate trajectories, since per-turn checks miss loops, drift, and gaming.
  7. Red-team the evaluator. Benchmarks attract exploits like any other surface.
  8. Then consider RL. Training amplifies whatever the first seven steps missed.

Each step costs less and returns more than the next. Teams run the list backwards more often than forwards, and the wreckage funds the evaluation literature.

The Training Frontier

Research moved from scoring outcomes to scoring process: reward models over trajectory steps, verifier-guided search, reinforcement learning on long-horizon interactions. For practitioners one sequencing rule matters. Implement the utility in the harness first, as budgets, gates, and verifiers. Train against it second, once the signal is trustworthy. RL-trained tool agents learn to exploit flawed evaluators, and their traces show the exploit reasoning in plain text.

The result we find most consequential closes the loop on the harness itself. Self-Harness[19] has agents improve their own harnesses: mine execution traces for failure patterns, propose a minimal modification, accept only what passes regression tests. No weight updates.

ModelTerminal-Bench-2.0, beforeAfter self-harnessing
MiniMax M2.540.5%61.9%
Qwen3.5-35B23.8%38.1%
GLM-542.9%57.1%

Two implications follow if the result replicates. Harnesses are model-specific artifacts, since each model fails its own way. And the audit for expired scaffolding, the workarounds that encode last year’s model limits, can run automatically instead of on a calendar[19].

GROOM: A Self-Maintaining Knowledge Harness

Self-Harness closes the maintenance loop on the harness itself: its tools, its prompts, its policies. The knowledge that harness runs on gets no such loop. It rots on a slower, quieter clock. Internal wikis, convention docs, research digests are written once, consulted forever, and revised never, each one decaying while the agents that load it degrade without complaint. We did not theorize this failure. We lived it: writing this survey rotted our own corpus out from under us. So we built the loop that was missing, and we are open-sourcing it[25]. GROOM, for Gated Refresh of Organizational Memory, is a markdown knowledge base whose consumption mechanism doubles as its maintenance trigger. The name states the design. The organizational memory is the corpus an agent grounds on: the institution’s knowledge, the part that lives outside the model’s weights and that nobody owns. The refresh is the maintenance, triggered by consultation itself: reading the corpus is what keeps it current, so it stays fresh in proportion to how often it is used and no one has to remember to tend it. The refresh is gated because letting an agent edit a live corpus is dangerous: every change must clear a git checkpoint and a deterministic validator before it lands, or it is rolled back. A groom keeps the harness in working order; the name is also the job description.

Figure 6 · GROOM message sequence. (a) A read activates the skill; the launcher checks its gates and returns within 100 ms while a detached GROOM agent reads the wiki, applies one op, and journals the run. (b) The four core maintenance operations as standing orders. The read is never blocked.

Design

The corpus is a wiki in the plainest sense: a folder of cross-linked markdown pages, one topic each, that an agent reads on demand rather than loading whole. It applies this survey’s findings to its own substrate. Each page’s frontmatter carries a one-line summary that agents read to decide whether to load the page, anupdated date, and a confidence grade (established, emerging, contested) that tells a consuming agent how hard to lean on a claim.

Maintenance runs perform one operation, each a focused agent task (a fifth, iterate, rewrites the single weakest page; the composite all chains them):

OpDoesInvariant
lintRepairs frontmatter, links, styleNever changes knowledge
pruneCuts duplication, merges overlapNet lines must go down
expandIngests vendor and spec changes3 to 6 files per run
researchIngests recent arXiv workCitation-gated; zero additions is a valid outcome

The registry is the filesystem: every prompt file is an op, so extending the pipeline means dropping a markdown file. The maintenance agent runs inside the discipline this paper describes: schema-scoped tools, prefix-locked shell access, a system-prompt fence around the wiki directory, an append-only journal recording model and cost per run. And the keystone — every run is wrapped in a git checkpoint behind a deterministic, token-free validator. An edit commits only if it reports success, passes structural checks (frontmatter, links, index reachability) and fact-level canaries (load-bearing facts that must survive), satisfies its postcondition, and touched nothing outside the wiki; otherwise the tree resets to the checkpoint. The critic is code, not the generator. Across nine injected fault classes this gate recovers the corpus byte-identically where an ungated maintainer corrupts it every time; structural checks alone miss the semantic losses the canaries catch.

Consumption as the Maintenance Trigger

GROOM installs as a skill, a lazy-loaded instruction file that costs about one line of context until a relevant question arrives. On activation, the skill fires a launcher. The launcher checks a config toggle and a debounce stamp, then exits in under 100 ms. An atomic mkdir claim serializes simultaneous triggers to exactly one run — the debounce stamp alone is a check-then-write race, a bug we found by profiling (it held only 28–59% of the time under an 8-way race) and fixed. Eligible triggers spawn a detached maintenance agent in the background. The conversation that fired the refresh never waits. The next one reads a fresher wiki. Figure 7 animates the cycle, including both skip paths.

GROOM · stale-while-revalidate maintenance loop
yesnonext consultation reads fresher wikiconversation never blocks · exit <100msAgent consultsthe wikiSkill fireslauncherenabled?stampfresh?Spawn detachedagentMAINTENANCE OPlintpruneexpandresearchWiki updated+ journaled

Agent loads the GROOM skill and consults the wiki. The read itself is the maintenance trigger.

Figure 7 · The GROOM cycle (interactive). Play the loop, or flip the gates to watch the launcher skip. The op selector cycles lint, prune, expand, research across iterations.

Relation to Self-Harness

The kinship with Self-Harness[19] is a shared design rule, not an equivalence. Self-Harness mines failure traces and validates against task benchmarks; GROOM triggers on consumption and validates against structural invariants. The common move is putting the maintenance loop inside the artifact it maintains, gated by checks the artifact defines. Whether structural gates suffice for knowledge correctness is open. We do show the stakes are real: injecting staleness into the corpus collapses a consuming agent’s answer accuracy on the affected facts from 100% to 0% while untouched controls hold — but isolating the benefit of consumption-triggered over scheduled maintenance remains future work.

The closest artifact is the now-popular pattern of an LLM-maintained markdown wiki[30]: GROOM shares that substrate but adds what it omits for safe autonomous operation, namely a checkpoint gate around every edit, fact-level canaries, and consumption as the trigger.

Does Grooming Generalize?

The maintenance evaluation runs on one corpus, so we tested whether the benefit of a groomed corpus holds elsewhere, and connected it to standard information retrieval. GROOM is retrieval-agnostic — it maintains clean markdown, and any retriever consults it — so we measured retrieval quality directly across three unrelated agent knowledge bases (an internal API/SDK reference, a cloud deploy and SRE runbook, and a SaaS support KB) with two pluggable retrievers, BM25 and a dense neural model. Each corpus was scored in two states: groomed (clean), and degraded with the structural entropy grooming removes — duplicate and near-duplicate pages, boilerplate.

Across the domain-by-retriever cells, grooming delivers a 45–51% relative gain in recall@1, and the effect is retriever-agnostic rather than an artifact of lexical matching. Macro-averaged across the three domains, grooming lifts recall@1 from 0.52 to 0.78 for BM25 and from 0.56 to 0.81 for the dense retriever, with matching gains in MRR and nDCG@5, in every cell. A groomed corpus is also about 40% smaller, which is what the “load the whole corpus if it fits the window” strategy pays for. The honest framing: this measures the corpus-state difference grooming targets, clean versus noisy, not GROOM’s operations producing the cleanup live — that experiment is future work.

Availability

GROOM ships with the harness-engineering wiki as its first instance, a behavior test suite and a reproducible evaluation harness that cost no agent calls, and cron examples for deployments where consultation is rare. The pipeline is content- and retrieval-agnostic: GROOM maintains clean, well-linked markdown, and how an agent consults it — progressive disclosure, full-context load, BM25, or a dense retriever — is a pluggable layer, not GROOM’s concern. Point corpus in the config (or GROOM_CORPUS) at any knowledge base to maintain it, or run an init scaffolder to bootstrap a fresh one that passes the validator from its first commit. One install maintains any knowledge base.

The Regulatory Landscape

A model cannot be compliant. It is a stochastic component. The artifacts an auditor asks for live in the harness: trace logs, approval records, policy decisions, dataset lineage. Teams that treat compliance as an architecture input get those artifacts for the cost of design choices made on day one. Teams that retrofit pay in months. Figure 8 puts the dates on one line.

Aug 2024enters into forceFeb 2025prohibited practices applyAug 2025GPAI model obligationsthis paper · June 2026Aug 2, 2026broadly applicable · just under 8 weeks out2027+high-risk extensions
Figure 8 · EU AI Act application timeline[24]. The broad applicability date lands just under eight weeks after this paper’s publication.

The Act’s operational duties map directly onto harness subsystems: human oversight to the permission stack, incident logging to append-only traces, technical documentation to evaluator records. NIST’s AI Risk Management Framework[29] gives US-facing teams the matching vocabulary, and the OWASP agentic Top 10[23] translates both into engineering threat language.

The harness must produce, as a matter of routine operation:

  • Immutable trace logs for every model call, tool call, and approval.
  • Approval records with who, when, and on what basis.
  • Dataset lineage covering evaluator versions and their history.
  • Sandbox and credential boundary documentation.
  • Benchmark audit trails behind any capability claim.
  • Operator runbooks for incidents.
  • PII masking at ingestion. Dashboard-level masking runs after the model already read the data.

Putting It into Practice

Adoption order matters more than framework choice. The Failure Modes list ordered the safety build; these phases order the whole program. Each is a precondition for the next:

PhaseBuildExit test
1 · FoundationOne runtime, MCP for tools, one trace store, one sandbox tierAn agent finishes a real task with traces you can replay
2 · HardeningBudgets, approval gates, trace-level evaluators, a red-team passYou can answer “what did it do and why” for any run
3 · SpecializationInternal simulators mirroring your tools and permissionsOffline scores predict production behavior
4 · OptimizationPRMs, verifier-guided search, RL on narrow domainsGains survive the exploit suite

Frameworks

When you pick the phase-one runtime, the mid-2026 shortlist:

RuntimeDistinct strengthFit
LangGraph[26]Typed state, durable execution, interruptsYou want control flow inspectable and replayable
OpenAI Agents SDK[27]Handoffs, guardrails, first-party tracingYou build on the Responses API
Google ADK[28]Multi-language, hierarchical multi-agentEnterprise teams inside Google Cloud
Claude Agent SDKInherits Claude Code’s loop, permissions, hooksCoding agents on Anthropic models

CrewAI, smolagents, and PydanticAI cover lighter use. Bedrock AgentCore and Foundry Agent Service trade low-level control for managed operation.

Seven habits compound across all four phases:

  • Build the eval harness before the agent.
  • Track tokens per task on the same dashboard as latency and success rate.
  • Restrict by schema, never by prompt.
  • Keep policy in config and hooks, out of code, so it ships without redeploys.
  • Label every scaffold with the model limitation it patches, and audit the labels.
  • Treat your benchmarks as attack surfaces.
  • Treat the knowledge your agents consult as a maintained artifact, not a static document. GROOM is our implementation of that habit.

Open Problems

What the discipline cannot do yet, in rough order of pain. The first two are where GROOM will be tested first:

  • Cross-session continuity. Shift-handoff scaffolding carries notes across sessions. Nobody has shown multi-week collaboration that accumulates skill.
  • Silent failure. Graceful degradation hides regressions. Harnesses lack an equivalent of mutation testing: a sensor that never fires means either quality or blindness, and you cannot tell without injecting faults[1].
  • Evaluator trust as an arms race. Agents trained against evaluators attack the measurement. Red-team the eval harness like any other surface[16].
  • Harness portability. If harnesses are model-specific[19], switching models costs a scaffold re-derivation, and protocol layers[21],[22] soften this without solving it.
  • Capability preservation. Current harnesses optimize short-term throughput. Designing them to deepen the operator’s understanding instead of eroding it remains the most underrated question on this list.

Conclusion

Production AI in 2026 wins or loses in the harness. The discipline converged on a recognizable shape: a thin reactive loop surrounded by deep infrastructure for context, tools, safety, memory, and measurement, governed by the rule that the model proposes and the harness disposes. The 98% is the product, not the overhead.

The frontier is reflexive: harnesses now measure, maintain, and improve themselves, and GROOM contributes that loop at the knowledge layer. The wiki beneath this survey is one GROOM keeps current.

GROOM is open source. Point it at your own knowledge base (github.com/beconfident-ai/groom), or read the wiki it maintains. We’re building the rest of this at BeConfident Labs.

Author Contributions Statement

Gui Dávid (Head of AI, BeConfident Labs): conception, survey research and synthesis, GROOM design and implementation, visualizations, writing.

BeConfident Labs: the production agent infrastructure that motivated and stress-tested these patterns. Agentic tooling assisted the interactive figures, inside the kind of harness this paper surveys. A human verified every claim and reference against primary sources.

References

  1. Böckeler, B. (2026). Harness Engineering for Coding Agent Users. martinfowler.com. martinfowler.com/articles/harness-engineering.html
  2. OpenAI (2026). Harness engineering: leveraging Codex in an agent-first world. openai.com/index/harness-engineering/
  3. Anthropic (2024). Building Effective Agents. www.anthropic.com/engineering/building-effective-agents
  4. Anthropic (2025). Effective context engineering for AI agents. www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  5. Anthropic (2025). Writing effective tools for agents — with agents. www.anthropic.com/engineering/writing-tools-for-agents
  6. Anthropic (2025). Effective harnesses for long-running agents. www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
  7. Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. arxiv.org/abs/2210.03629
  8. Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. arxiv.org/abs/2310.08560
  9. Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770. arxiv.org/abs/2310.06770
  10. Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854. arxiv.org/abs/2307.13854
  11. Mialon, G., et al. (2023). GAIA: a benchmark for General AI Assistants. arXiv:2311.12983. arxiv.org/abs/2311.12983
  12. Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688. arxiv.org/abs/2308.03688
  13. Xie, T., et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972. arxiv.org/abs/2404.07972
  14. Yao, S., et al. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. arxiv.org/abs/2406.12045
  15. Le Sellier De Chezelles, T., et al. (2024). The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467. arxiv.org/abs/2412.05467
  16. Xue, T., et al. (2025). An Illusion of Progress? Assessing the Current State of Web Agents. arXiv:2504.01382. arxiv.org/abs/2504.01382
  17. Liu, J., et al. (2026). Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv:2604.14228. arxiv.org/abs/2604.14228
  18. Bui, N. D. Q., et al. (2026). Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv:2603.05344. arxiv.org/abs/2603.05344
  19. Zhang, H., et al. (2026). Self-Harness: Harnesses That Improve Themselves. arXiv:2606.09498. arxiv.org/abs/2606.09498
  20. Agache, A., et al. (2020). Firecracker: Lightweight Virtualization for Serverless Applications. NSDI ’20. www.usenix.org/conference/nsdi20/presentation/agache
  21. Model Context Protocol. Specification and documentation. modelcontextprotocol.io
  22. A2A Protocol. Agent-to-agent interoperability specification. a2a-protocol.org
  23. OWASP GenAI Security Project. Top 10 for Agentic Applications. genai.owasp.org
  24. European Commission. The EU Artificial Intelligence Act — regulatory framework and application timeline. digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
  25. BeConfident Labs (2026). GROOM: a self-maintaining knowledge harness. github.com/beconfident-ai/groom github.com/beconfident-ai/groom
  26. LangChain. LangGraph: agent orchestration framework. Documentation. www.langchain.com/langgraph
  27. OpenAI. Agents SDK for Python. Documentation. openai.github.io/openai-agents-python/
  28. Google. Agent Development Kit (ADK). Documentation. adk.dev
  29. NIST (2023). AI Risk Management Framework 1.0, with the Generative AI Profile (2024). www.nist.gov/itl/ai-risk-management-framework
  30. Karpathy, A. (2026). LLM Wiki: an LLM-maintained markdown knowledge base. Public gist, April 2026. gist.github.com/karpathy/442a6bf555914893e9891c11519de94f