Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

About

Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.

Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam, Zaber Hakim, Chris Thomas• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringAVQA
AVQA Clean Accuracy95.6
7
Audio-Visual Question AnsweringMUSIC-AVQA
Music-AVQA Clean Accuracy80.7
7
Audio-Visual Question AnsweringAVQA (subset 2000 samples)
ASR Accuracy96.03
7
Audio-Visual Question AnsweringMusic-AVQA 2000 samples
ASR Rate13.8
7
Audio-Visual Scene-Aware DialogAVSD (val)
ASR (%)59.48
7
Showing 5 of 5 rows

Other info

Follow for update