Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

About

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee• 2025

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMAU v05.15.25 (test)
Sound Score66.9
53
Multimodal Audio UnderstandingMMAU mini v05.15.25 (test)
Sound Accuracy70.3
25
Multimodal Audio ReasoningMMAR
Mean Score50.8
22
Audio UnderstandingMMAU mini original (test)
Accuracy (Sound Domain)58.26
21
Audio ReasoningMMAU-Pro
Average Score40.6
18
Massive Multi-discipline Audio UnderstandingMMAU
Speech Score59.16
17
Audio ReasoningMMAU mini 1.0 (test)
Sound Score70.27
15
ReasoningVoiceBench
MMSU Accuracy (Audio)60.87
13
Audio Perception and ReasoningMMAR within CAFE framework (overall)
Perception Accuracy23.19
13
Speech-to-text reasoning and semantic understandingVoiceBench (test)
Alpaca Eval3.73
13
Showing 10 of 17 rows

Other info

Follow for update