Detecting Backdoors in Pre-trained Encoders

About

Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained on clean data) may inherit backdoor behaviors from encoders. Existing backdoor detection methods mainly focus on supervised learning settings and cannot handle pre-trained encoders especially when input labels are not available. In this paper, we propose DECREE, the first backdoor detection approach for pre-trained encoders, requiring neither classifier headers nor input labels. We evaluate DECREE on over 400 encoders trojaned under 3 paradigms. We show the effectiveness of our method on image encoders pre-trained on ImageNet and OpenAI's CLIP 400 million image-text pairs. Our method consistently has a high detection accuracy even if we have only limited or no access to the pre-training dataset.

Shiwei Feng, Guanhong Tao, Siyuan Cheng, Guangyu Shen, Xiangzhe Xu, Yingqi Liu, Kaiyuan Zhang, Shiqing Ma, Xiangyu Zhang• 2023

Related benchmarks

Task	Dataset	Result
Backdoor Detection	STL-10 target attack	True Positives (TP)111	9
Backdoor Detection	GTSRB target attack dataset	True Positives (TP)111	7
Backdoor Detection	SVHN (target attack dataset)	TP108	7
Backdoor Defense	BadCLIP ViT	ASR16.44	6
Backdoor Defense	Drupe	Attack Success Rate (ASR)14.87	6
Backdoor Defense	BadVision	ASR58	6
Backdoor Defense	BadEncoder	Attack Success Rate (ASR)15.11	6
Backdoor Defense	BadCLIP RN	Attack Success Rate (ASR)38.22	6
Backdoor Sample Detection	Drupe	TPR74.66	3
Backdoor Sample Detection	BadCLIP ViT	TPR82.97	3

Showing 10 of 13 rows

Other info

Code

Follow for update

@wizwand_team Discord