LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

About

As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.

Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen• 2025

Related benchmarks

Task	Dataset	Result
Safety Evaluation	UnsafeBench	F1 Score60.87	39
Content Moderation	MMDS (test)	Accuracy95.76	27
Harm Recognition	Violence imagery dataset	F1 Score100	15
Multimodal Safety	JailBreakV	F1 Score99.61	15
Multimodal Safety	MMDS-Q	F1 Score99.13	15
Multimodal Safety	MMDS-R	F1 Score96.53	15
Multimodal Safety	VLSBench	F1 Score99.39	15
Multimodal Safety	MM-Safety	F1 Score99.32	15
Multimodal Safety	Multimodal Safety Suite Avg	Macro-average F188.42	15
Harm Recognition	COCO Crime Scene slice	F1 Score80.95	15

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord