Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improving Alignment and Robustness with Circuit Breakers

About

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy57.59
1891
Multimodal EvaluationMME--
658
Jailbreak AttackHarmBench
Attack Success Rate (ASR)33.75
487
Natural Language InferenceRTE
Accuracy71.04
448
Instruction FollowingMT-Bench
MT-Bench Score8.21
215
ReasoningARC Easy--
187
Jailbreak AttackMaliciousInstruct
ASR2
161
Safety EvaluationAdvBench--
117
Safety EvaluationHarmBench
Harmbench Score3.75
112
ReasoningGSM8K--
106
Showing 10 of 84 rows
...

Other info

Code

Follow for update