Fun-Audio-Chat Technical Report

About

Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech Other	WER3.73	123
Automatic Speech Recognition	LibriSpeech Clean	WER1.6	107
Question Answering	MMLU-Pro	Accuracy61.12	91
Question Answering	MMLU-Redux	Accuracy74.7	57
Automatic Speech Recognition	Fleurs En	WER7.61	49
Speech-to-Speech Question-Answering	Llama Questions	Accuracy77.76	27
Audio Understanding	MMAU (test)	--	25
Speech-to-Speech Question-Answering	TriviaQA	Accuracy49.02	22
Speech Recognition	Common Voice EN	WER7.79	16
Spoken Dialogue Evaluation	URO-Bench English Basic Track	Repeat Rate97.18	16

Showing 10 of 56 rows

Other info

Follow for update

@wizwand_team Discord