Cross-Modal Backdoors in Multimodal Large Language Models

About

Developers increasingly construct multimodal large language models (MLLMs) by assembling pretrained components,introducing supply-chain attack surfaces.Existing security research primarily focuses on poisoning backbones such as encoders or large language models (LLMs),while the security risks of lightweight connectors remain unexplored.In this work,we propose a novel cross-modal backdoor attack that exploits this overlooked vulnerability.By poisoning only the connector using a single seed sample and several augmented variants from one modality,the adversary can subsequently activate the backdoor using inputs from other modalities.To achieve this,we first poison the connector to associate a compact latent region with a malicious target output.To activate the backdoor from other modalities,we further extract a malicious centroid from the poisoned latent representations and perform input-side optimization to steer inputs toward this latent anchor,without requiring repeated API queries or full-model access.Extensive evaluations on representative connector-based MLLM architectures,including PandaGPT and NExT-GPT,demonstrate both the effectiveness and cross-modal transferability of the proposed attack.The attack achieves up to 99.9% attack success rate (ASR) in same-modality settings,while most cross-modal settings exceed 95.0% ASR under bounded perturbations.Moreover,the attack remains highly stealthy,producing negligible leakage on clean inputs,and maintaining weight-cosine similarity above 0.97 relative to benign connectors.We further show that existing defense strategies fail to effectively mitigate this threat without incurring substantial utility degradation.These findings reveal a fundamental vulnerability in multimodal alignment: a single compromised connector can establish a reusable latent-space backdoor pathway across modalities,highlighting the need for safer modular MLLM design.

Runhe Wang, Li Bai, Haibo Hu, Songze Li• 2026

Related benchmarks

Task	Dataset	Result
Backdoor Attack Success Rate	Cross-modal Backdoor Evaluation Set	Exact ASR99.9	18
Attack Success Rate	PandaGPT Image Modality	Exact ASR99.5	8
Attack Success Rate	PandaGPT Audio Modality	Exact ASR99.2	3
Attack Success Rate	PandaGPT Text Modality	Exact ASR99.4	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord