Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

About

Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.

Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench
Attack Success Rate (ASR)34
376
Jailbreak Attack EvaluationHarmBench (400 random samples)
ASR11.5
18
Jailbreak attack success rateHarmBench (50 randomly sampled questions)
ASR84
8
Showing 3 of 3 rows

Other info

Follow for update