Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

About

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee• 2025

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMAU v05.15.25 (test)
Sound Score66.9
53
Audio-Language UnderstandingMMAU 1.0 (test)
Accuracy65.2
27
Audio-Language ReasoningMMAR 1.0 (test)
Accuracy46.4
27
Audio-Language Understanding (MCQ)MMAU-Pro 1.0 (test)
Accuracy43.5
27
Multimodal Audio UnderstandingMMAU mini v05.15.25 (test)
Sound Accuracy70.3
25
Multimodal Audio ReasoningMMAR
Mean Score50.8
22
Audio UnderstandingMMAU mini original (test)
Accuracy (Sound Domain)58.26
21
Instruction FollowingSpeech-IFEval
IF Rate93.89
18
Audio ReasoningMMAU-Pro
Average Score40.6
18
Massive Multi-discipline Audio UnderstandingMMAU
Speech Score59.16
17
Showing 10 of 20 rows

Other info

Follow for update