Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

About

As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.

Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen• 2025

Related benchmarks

TaskDatasetResultRank
Content ModerationMMDS (test)
Accuracy95.76
27
Attack Success Rate EvaluationMMDS MMRT (test)--
7
Multimodal Safety EvaluationMM-SafetyBench
Text-only Recall95.3
6
Multimodal Safety EvaluationVLGuard (test)
Accuracy86.78
6
Over-safety measurementWildChat
User Score15.1
2
Unsafe-input detectionActorAttack (600)
Recall87.83
2
Unsafe-input detectionSafeDialBench EN
Recall99.07
2
Showing 7 of 7 rows

Other info

Follow for update