Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

About

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan• 2024

Related benchmarks

TaskDatasetResultRank
Case-level safety detectionImpossibleBench
ROC AUC100
40
Sabotage detectionImpossibleBench
Average Precision100
40
Trace-level detectionImpossibleBench
AP81.7
40
Trace-level safety monitoringImpossibleBench
ROC-AUC93.5
40
Distributed Misuse DetectionDM-Cyber 20x Trace-level
Trace-level ROCAUC93.1
10
Distributed Misuse DetectionDM-Cyber 20x
Trace-level AP33.2
10
Distributed Misuse DetectionDM-Bio 20x
Trace-level AP24.6
9
Distributed Misuse DetectionDM-Bio 100x
Trace-level AP4
9
Distributed Misuse DetectionDM-Bio 20x (Trace-level)
Trace-level ROCAUC83.7
9
Distributed Misuse DetectionDM-Cyber 100x (Trace-level)
ROCAUC (Trace-level)73.3
9
Showing 10 of 12 rows

Other info

Follow for update