Benchmarking Misuse Mitigation Against Covert Adversaries

About

Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because the individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to highlight stateful defenses as a promising countermeasure.

Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani• 2025

Related benchmarks

Task	Dataset	Result
Biosecurity Misuse Evaluation	BSD Biosecurity	Misuse Rate41.6	49
Trace-level safety monitoring	ImpossibleBench	ROC-AUC94.9	40
Case-level safety detection	ImpossibleBench	ROC AUC99.4	40
Sabotage detection	ImpossibleBench	Average Precision99	40
Trace-level detection	ImpossibleBench	AP81.5	40
Distributed Misuse Detection	DM-Cyber 20x	Trace-level AP22.9	10
Distributed Misuse Detection	DM-Cyber 20x Trace-level	Trace-level ROCAUC75.2	10
Distributed Misuse Detection	DM-Cyber 100x (Trace-level)	ROCAUC (Trace-level)74.4	9
Distributed Misuse Detection	DM-Bio 100x	Trace-level AP2.6	9
Distributed Misuse Detection	DM-Bio 20x	Trace-level AP14.2	9

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord