Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

About

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang• 2026

Related benchmarks

TaskDatasetResultRank
Software EngineeringSWE-bench verified (All)
Success Rate87.5
32
Agentic Task CompletionTerminal-Bench All 2
Pass@177
7
Agentic Task CompletionTerminal-Bench Med. 2 (55 tasks)
Pass@188.2
7
Agentic Task CompletionTerminal-Bench Easy 4 tasks 2
Pass@1100
7
Agentic Task CompletionTerminal-Bench Hard 2 (30 tasks)
Pass@153.3
7
Software Task SolvingSWE-bench Verified
Succ/Mtok (All)1.64
4
Showing 6 of 6 rows

Other info

Follow for update