BadActs: A Universal Backdoor Defense in the Activation Space

About

Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.

Biao Yi, Sishuo Chen, Yiming Li, Tong Li, Baolei Zhang, Zheli Liu• 2024

Related benchmarks

Task	Dataset	Result
Backdoor Defense	AGNews	Attack Success Rate57.24	105
Backdoor Defense	SST-2	CACC89.6	65
Backdoor Attack Classification	HSOL	ASR62.77	50
Backdoor Trigger Detection	SST-2	--	48
Backdoor Sample Detection	Yelp	AU-ROC0.9982	16
Backdoor Sample Detection	HSOL	AU-ROC0.9891	16
Backdoor Sample Detection	AGNews	AU-ROC99.42	16
Backdoor purification	SST-2	CACC89.84	12
Backdoor purification	Yelp	Clean Accuracy94.6	12

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord