From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
About
Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | WenetSpeech (meeting) | WER27.59 | 23 | |
| Automatic Speech Recognition | AISHELL-1 | WER5.78 | 22 | |
| Automatic Speech Recognition | WenetSpeech net | WER19.85 | 20 | |
| Automatic Speech Recognition | AISHELL-2 | WER6.8 | 15 | |
| Automated Audio Captioning | MACS | AAC Score48.87 | 12 | |
| Spoken Dialogue | URO-Bench Basic Track | Understanding Accuracy57.63 | 12 | |
| Audio Question Answering | AudioSet En | Audio QA Score26.73 | 12 | |
| Automated Audio Captioning | Clotho | AAC Score12.63 | 12 | |
| Speech-to-Speech | URO-Bench Pro Task | Understanding Score32.38 | 12 | |
| Audio Question Answering | Listen, Quizzing (LQ.) | Audio-QA Score40.07 | 12 |