DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

About

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You• 2026

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	Accuracy66.2	844
Automatic Speech Recognition	LibriSpeech Other	WER4.3	123
Question Answering	TriviaQA	Accuracy60.3	117
Automatic Speech Recognition	VoxPopuli	WER7.1	38
Speech-to-Text Question-Answering	WebQ	Accuracy49.7	26
Speech-to-Text Question-Answering	LlamaQ	Accuracy68.5	26
Speech-to-Text Question-Answering	TriviaQA	Accuracy33.5	26
Speech-to-Speech Question-Answering	WebQ	Accuracy61.5	25
Automatic Speech Recognition	LS Clean	WER3	25
Speech-to-Speech Question-Answering	TriviaQA	Accuracy45.4	22

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord