Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

About

Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari• 2025

Related benchmarks

TaskDatasetResultRank
Harmfulness EvaluationFigstep-audio Harmful
ASR52.4
15
Harmfulness EvaluationSORRY-Bench audio
ASR Accuracy38.41
15
Harmfulness EvaluationAdvBench-audio Harmful
ASR Score3.27
15
Helpfulness evaluationFigstep-audio Harmful-Safe
BRR70
15
Helpfulness AssessmentAdvBench-audio Safe
BRR86.44
3
Harmfulness AssessmentAJailBench
ASR54
3
Showing 6 of 6 rows

Other info

Follow for update