BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

About

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang• 2026

Related benchmarks

Task	Dataset	Result
Backdoor Detection	Stable Diffusion ObjRepAtt attacks v1.5	Precision100	23
Backdoor Detection	Stable Diffusion StyleAtt attacks v1.5	Precision83.33	10
Backdoor Detection	Stable Diffusion Overall All Attacks v1.5	Precision84.79	6
Backdoor Detection	Stable Diffusion FixIMgAtt attacks v1.5	Precision66.67	6
Backdoor Detection	Stable Diffusion PatchAtt attacks v1.5	Precision85.71	5
Backdoor Detection	Backdoor Attack Targets ObjRep, FixImg, Patch, Style	F1 Score (ObjRep)92.12	4
Backdoor Detection	ObjRepAtt EvilEdit	Precision86.36	3
Backdoor Detection	ObjRepAtt Rick_TPA	Precision96.15	3
Backdoor Detection	FixImgAtt Villan	Precision92.31	3
Backdoor Detection	PatchAtt BadT2I	Precision80.65	3

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord