DIFFA: Large Language Diffusion Models Can Listen and Understand

About

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.

Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li• 2025

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMAU v05.15.25 (test-mini)	Sound Score46.25	54
Speech-to-Text Question-Answering	WebQ	Accuracy43.4	26
Speech-to-Text Question-Answering	TriviaQA	Accuracy36	26
Speech-to-Text Question-Answering	LlamaQ	Accuracy58.3	26
Audio Understanding	MMAR (comprehensive evaluation)	Sound Score37.58	25
General Audio Understanding	VoiceBench	AlpacaEval Score3.78	19
General Audio Understanding	MMSU 1.0 (test)	Perception Semantics52.67	16
Speech-to-Text Question-Answering	OBQA	Accuracy35.6	16
Speech Reasoning	MMSU S→T only	Accuracy29.6	9
Speech-to-Text Question-Answering	LlamaQ, TriviaQA, WebQ, OBQA S→T Average	Accuracy43.3	9

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord