DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

About

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee• 2025

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMAU v05.15.25 (test)	Sound Score66.9	53
Audio-Language Understanding	MMAU 1.0 (test)	Accuracy65.2	27
Audio-Language Reasoning	MMAR 1.0 (test)	Accuracy46.4	27
Audio-Language Understanding (MCQ)	MMAU-Pro 1.0 (test)	Accuracy43.5	27
Multimodal Audio Understanding	MMAU mini v05.15.25 (test)	Sound Accuracy70.3	25
Multimodal Audio Reasoning	MMAR	Mean Score50.8	22
Audio Understanding	MMAU mini original (test)	Accuracy (Sound Domain)58.26	21
Instruction Following	Speech-IFEval	IF Rate93.89	18
Audio Reasoning	MMAU-Pro	Average Score40.6	18
Massive Multi-discipline Audio Understanding	MMAU	Speech Score59.16	17

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord