Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents
About
Autonomous LLM agents fail because long-horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal-as-Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal -- rather than unconstrained generation -- as the control policy whenever a task is in coverage. Each node encodes a state-conditioned action macro mined and merge-checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre-execution gates over structured tool context and bounded history, updated under experience-grounded monotonicity so previously rejected unsafe contexts cannot be re-admitted. At runtime, a lightweight traverser matches the base model's intent to child macros, executes one macro at a time under global and node-local gating, and when stalled performs risk-aware shortest-path recovery to a feasible success leaf; the visited path forms a compact spine memory that replaces transcript replay. Evaluated in a unified OpenHands sandbox on 15+ software, web, reasoning, and safety/security benchmarks, GBT improves success while driving violations toward zero and reducing cost. On SWE-bench Verified (Protocol A, 500 issues), GBT-SE raises success from 34.6% to 73.6%, reduces violations from 2.8% to 0.2%, and cuts token/character usage from 208k/820k to 126k/490k; with the same distilled tree, 8B executors more than double success on SWE-bench Verified (14.0%58.8%) and WebArena (9.1%37.3%).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Agent Harm Evaluation | AgentHarm public | HarmScore9.6 | 8 | |
| Agent Safety Evaluation | Agent-SafetyBench | Agent-SafetyBench Score72.3 | 8 | |
| Agent Security Evaluation | ASB (Agent Security Benchmark) | ASR-d (ASB)7 | 8 | |
| Reasoning | GPQA Protocol A (test) | Accuracy87.3 | 5 | |
| Software Engineering | SWE-bench Verified 500 issues (Protocol A) | Success Rate (SR)73.6 | 4 | |
| Web navigation | WebArena Protocol A 812 tasks | Success Rate (SR)66.9 | 4 |