Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Benchmarking Misuse Mitigation Against Covert Adversaries

About

Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.

Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani• 2025

Related benchmarks

TaskDatasetResultRank
Trace-level safety monitoringImpossibleBench
ROC-AUC94.9
40
Case-level safety detectionImpossibleBench
ROC AUC99.4
40
Sabotage detectionImpossibleBench
Average Precision99
40
Trace-level detectionImpossibleBench
AP81.5
40
Distributed Misuse DetectionDM-Cyber 20x
Trace-level AP22.9
10
Distributed Misuse DetectionDM-Cyber 20x Trace-level
Trace-level ROCAUC75.2
10
Distributed Misuse DetectionDM-Cyber 100x (Trace-level)
ROCAUC (Trace-level)74.4
9
Distributed Misuse DetectionDM-Bio 100x
Trace-level AP2.6
9
Distributed Misuse DetectionDM-Bio 20x
Trace-level AP14.2
9
Distributed Misuse DetectionDM-Bio 20x (Trace-level)
Trace-level ROCAUC67.6
9
Showing 10 of 12 rows

Other info

Follow for update