Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

About

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
Accuracy66.2
756
Question AnsweringTriviaQA
Accuracy60.3
85
Automatic Speech RecognitionLibriSpeech Other
WER4.3
75
Automatic Speech RecognitionVoxPopuli
WER7.1
27
Automatic Speech RecognitionLS Clean
WER3
25
Speech-to-Text Question-AnsweringLlamaQ
Accuracy68.5
9
Speech-to-Text Question-AnsweringWebQ
Accuracy49.7
9
Speech-to-Text Question-AnsweringLlamaQ, TriviaQA, WebQ, OBQA S→T Average
Accuracy57.6
9
Speech-to-Text Question-AnsweringOBQA
Accuracy51.3
9
Speech ReasoningMMSU S→T only
Accuracy39
9
Showing 10 of 17 rows

Other info

Follow for update