Representation Bending for Large Language Model Safety
About
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)7.5 | 376 | |
| Safety Evaluation | HarmBench | Harmbench Score0.31 | 76 | |
| General Capability | MTBench | MTBench Score9.14 | 43 | |
| Over-refusal | XSTest | XSTest Score84.89 | 42 | |
| Over-refusal | Wildjailbreak (Benign) | Wildjailbreak Benign Refusal Rate89.2 | 42 | |
| General Capability | MMLU | MMLU Accuracy78.89 | 31 | |
| Safety Evaluation | WildGuard (test) | Wildguard Test Score7.34 | 27 | |
| General Capability | 8 capability benchmarks Aggregate | Average Capability65.9 | 26 | |
| Safety | 8 jailbreak attacks (Aggregated) | Average ASR3.13 | 15 | |
| Misuse Detection | Misuse Categories Cybercrime (Phishing) | AUC0.99 | 9 |