Learning Efficient Guardrails for Compliance
About
Autonomous web agents are increasingly deployed for long-horizon tasks, yet their ability to adhere to real-world policies remains critically underexplored compared to standard safety objectives. To address this gap, we introduce PolicyGuardBench, a benchmark of 60k policy-trajectory pairs designed to evaluate compliance through both full-trajectory and novel prefix-based violation detection tasks. Using this dataset, we train PolicyGuard, a lightweight guardrail model that achieves strong detection accuracy while maintaining high inference efficiency. Notably, our model demonstrates robust generalization capabilities, preserving high performance even on unseen domains. These contributions establish a comprehensive framework for studying policy compliance, showing that accurate and generalizable guardrails are feasible at small scales.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Violation Detection | PolicyGuardBench | Safety F187.59 | 30 | |
| Prefix-based violation detection | PolicyGuardBench | Accuracy (N=1)91.01 | 16 | |
| Policy-trajectory compliance evaluation | PolicyGuardBench | F1 Score87.59 | 10 |