DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
About
Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy66.2 | 756 | |
| Question Answering | TriviaQA | Accuracy60.3 | 85 | |
| Automatic Speech Recognition | LibriSpeech Other | WER4.3 | 75 | |
| Automatic Speech Recognition | VoxPopuli | WER7.1 | 27 | |
| Automatic Speech Recognition | LS Clean | WER3 | 25 | |
| Speech-to-Text Question-Answering | LlamaQ | Accuracy68.5 | 9 | |
| Speech-to-Text Question-Answering | WebQ | Accuracy49.7 | 9 | |
| Speech-to-Text Question-Answering | LlamaQ, TriviaQA, WebQ, OBQA S→T Average | Accuracy57.6 | 9 | |
| Speech-to-Text Question-Answering | OBQA | Accuracy51.3 | 9 | |
| Speech Reasoning | MMSU S→T only | Accuracy39 | 9 |