Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

About

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang• 2026

Related benchmarks

Task	Dataset	Result
Software Engineering	SWE-bench verified (All)	Success Rate87.5	32
Agentic Task Completion	Terminal-Bench All 2	Pass@177	7
Agentic Task Completion	Terminal-Bench Med. 2 (55 tasks)	Pass@188.2	7
Agentic Task Completion	Terminal-Bench Easy 4 tasks 2	Pass@1100	7
Agentic Task Completion	Terminal-Bench Hard 2 (30 tasks)	Pass@153.3	7
Software Task Solving	SWE-bench Verified	Succ/Mtok (All)1.64	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord