Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fun-Audio-Chat Technical Report

About

Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech Other
WER3.73
123
Automatic Speech RecognitionLibriSpeech Clean
WER1.6
107
Question AnsweringMMLU-Pro
Accuracy61.12
91
Question AnsweringMMLU-Redux
Accuracy74.7
57
Automatic Speech RecognitionFleurs En
WER7.61
49
Speech-to-Speech Question-AnsweringLlama Questions
Accuracy77.76
27
Audio UnderstandingMMAU (test)--
25
Speech-to-Speech Question-AnsweringTriviaQA
Accuracy49.02
22
Speech RecognitionCommon Voice EN
WER7.79
16
Spoken Dialogue EvaluationURO-Bench English Basic Track
Repeat Rate97.18
16
Showing 10 of 56 rows

Other info

Follow for update