Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

About

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringTriviaQA
Accuracy70
117
Question AnsweringTruthfulQA
Accuracy66.5
73
Question AnsweringGSM8K
Accuracy74.2
36
Safety PerformanceJBB--
35
Safety PerformanceWildJailbreak
Selective Refusal Score (Δs)90.1
11
Jailbreak Attack RobustnessJailbreak Attack Evaluation Set Llama-3 8B
GCG Robustness Score100
6
Showing 6 of 6 rows

Other info

Follow for update