Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

About

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

Hao Wang, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Daisuke Kawahara• 2026

Related benchmarks

TaskDatasetResultRank
Adversarial DetectionNIPS to Medical cross-domain 17
Precision (SSA-CWA)98.9
10
Adversarial DetectionLLaVA to Medical cross-domain
SSA-CWA Precision100
10
Adversarial Attack DetectionNIPS M-Attack in-domain 17
Precision99
10
SSA-CWA to FOA-Attack Cross-Attack DetectionNIPS 17
Precision100
6
SSA-CWA to FOA-Attack Cross-Attack Detectionllava
Precision98
6
SSA-CWA to M-Attack Cross-Attack DetectionNIPS 17
Precision100
6
SSA-CWA to M-Attack Cross-Attack Detectionllava
Precision98
6
SSA-CWA to FOA-Attack Cross-Attack Detectionmedical
Precision97.8
6
SSA-CWA to M-Attack Cross-Attack Detectionmedical
Precision97.8
6
Adversarial Attack DetectionLLaVA M-Attack in-domain
Precision98.8
5
Showing 10 of 19 rows

Other info

GitHub

Follow for update