Graph-Regularized Sparse Autoencoders for LLM Safety Steering

About

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	TriviaQA	Accuracy70	117
Question Answering	TruthfulQA	Accuracy66.5	73
Question Answering	GSM8K	Accuracy74.2	36
Safety Performance	JBB	--	35
Safety Performance	WildJailbreak	Selective Refusal Score (Δs)90.1	11
Jailbreak Attack Robustness	Jailbreak Attack Evaluation Set Llama-3 8B	GCG Robustness Score100	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord