Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

About

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang• 2026

Related benchmarks

TaskDatasetResultRank
Backdoor DetectionStable Diffusion ObjRepAtt attacks v1.5
Precision100
23
Backdoor DetectionStable Diffusion StyleAtt attacks v1.5
Precision83.33
10
Backdoor DetectionStable Diffusion Overall All Attacks v1.5
Precision84.79
6
Backdoor DetectionStable Diffusion FixIMgAtt attacks v1.5
Precision66.67
6
Backdoor DetectionStable Diffusion PatchAtt attacks v1.5
Precision85.71
5
Backdoor DetectionBackdoor Attack Targets ObjRep, FixImg, Patch, Style
F1 Score (ObjRep)92.12
4
Backdoor DetectionObjRepAtt EvilEdit
Precision86.36
3
Backdoor DetectionObjRepAtt Rick_TPA
Precision96.15
3
Backdoor DetectionFixImgAtt Villan
Precision92.31
3
Backdoor DetectionPatchAtt BadT2I
Precision80.65
3
Showing 10 of 15 rows

Other info

Follow for update