Graph-Regularized Sparse Autoencoders for LLM Safety Steering
About
Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | TriviaQA | Accuracy70 | 117 | |
| Question Answering | TruthfulQA | Accuracy66.5 | 73 | |
| Question Answering | GSM8K | Accuracy74.2 | 36 | |
| Safety Performance | JBB | -- | 35 | |
| Safety Performance | WildJailbreak | Selective Refusal Score (Δs)90.1 | 11 | |
| Jailbreak Attack Robustness | Jailbreak Attack Evaluation Set Llama-3 8B | GCG Robustness Score100 | 6 |