Steering Language Model Refusal with Sparse Autoencoders

About

Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general language model capabilities than previously understood. Our findings reveal important open questions about the nature of safety-relevant features in language models and the feasibility of isolating them for targeted intervention. While SAE-based steering shows promise as a flexible approach to enhancing language model safety, our results highlight the critical need to understand and address the mechanisms behind these capability tradeoffs before such techniques can be practically deployed.

Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	--	624
Question Answering	TriviaQA	Accuracy62.2	117
Question Answering	TruthfulQA	Accuracy67	73
Question Answering	GSM8K	Accuracy76.2	36
Safety Performance	JBB	--	35
Safety Performance	WildJailbreak	Selective Refusal Score (Δs)48.2	11
Jailbreaking	Jailbreak	LG4 ASR6	8
Jailbreaking	Sorrybench	LG4 ASR12.9	8
Jailbreaking	AdvBench	LG4 ASR2.1	8
Jailbreak Attack Robustness	Jailbreak Attack Evaluation Set Llama-3 8B	GCG Robustness Score72.5	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord