Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

About

Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionWenetSpeech (meeting)
WER27.59
23
Automatic Speech RecognitionAISHELL-1
WER5.78
22
Automatic Speech RecognitionWenetSpeech net
WER19.85
20
Automatic Speech RecognitionAISHELL-2
WER6.8
15
Automated Audio CaptioningMACS
AAC Score48.87
12
Spoken DialogueURO-Bench Basic Track
Understanding Accuracy57.63
12
Audio Question AnsweringAudioSet En
Audio QA Score26.73
12
Automated Audio CaptioningClotho
AAC Score12.63
12
Speech-to-SpeechURO-Bench Pro Task
Understanding Score32.38
12
Audio Question AnsweringListen, Quizzing (LQ.)
Audio-QA Score40.07
12
Showing 10 of 15 rows

Other info

Follow for update