Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Efficient Guardrails for Compliance

About

Autonomous web agents are increasingly deployed for long-horizon tasks, yet their ability to adhere to real-world policies remains critically underexplored compared to standard safety objectives. To address this gap, we introduce PolicyGuardBench, a benchmark of 60k policy-trajectory pairs designed to evaluate compliance through both full-trajectory and novel prefix-based violation detection tasks. Using this dataset, we train PolicyGuard, a lightweight guardrail model that achieves strong detection accuracy while maintaining high inference efficiency. Notably, our model demonstrates robust generalization capabilities, preserving high performance even on unseen domains. These contributions establish a comprehensive framework for studying policy compliance, showing that accurate and generalizable guardrails are feasible at small scales.

Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, Muhao Chen• 2025

Related benchmarks

TaskDatasetResultRank
Violation DetectionPolicyGuardBench
Safety F187.59
30
Prefix-based violation detectionPolicyGuardBench
Accuracy (N=1)91.01
16
Policy-trajectory compliance evaluationPolicyGuardBench
F1 Score87.59
10
Showing 3 of 3 rows

Other info

Follow for update