SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals

About

Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.

Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	TriviaQA	Accuracy61	117
Question Answering	TruthfulQA	Accuracy65.2	73
Question Answering	GSM8K	Accuracy66.1	36
Safety Performance	JBB	--	35
Safety Performance	WildJailbreak	Selective Refusal Score (Δs)65.4	11
Jailbreak Attack Robustness	Jailbreak Attack Evaluation Set Llama-3 8B	GCG Robustness Score68.3	6
Safety Detection	SorryBench clean condition (held-out)	Detection Rate95	3
Safety Detection	SorryBench wrapped with HarmBench templates (held-out)	Detection Rate86.9	3
Safety Detection	XSTest benign safety-adjacent	False Positive Rate23	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord