DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
About
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Understanding | MMAU v05.15.25 (test) | Sound Score66.9 | 53 | |
| Multimodal Audio Understanding | MMAU mini v05.15.25 (test) | Sound Accuracy70.3 | 25 | |
| Multimodal Audio Reasoning | MMAR | Mean Score50.8 | 22 | |
| Audio Understanding | MMAU mini original (test) | Accuracy (Sound Domain)58.26 | 21 | |
| Audio Reasoning | MMAU-Pro | Average Score40.6 | 18 | |
| Massive Multi-discipline Audio Understanding | MMAU | Speech Score59.16 | 17 | |
| Audio Reasoning | MMAU mini 1.0 (test) | Sound Score70.27 | 15 | |
| Reasoning | VoiceBench | MMSU Accuracy (Audio)60.87 | 13 | |
| Audio Perception and Reasoning | MMAR within CAFE framework (overall) | Perception Accuracy23.19 | 13 | |
| Speech-to-text reasoning and semantic understanding | VoiceBench (test) | Alpaca Eval3.73 | 13 |