Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

About

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.

Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe• 2025

Related benchmarks

TaskDatasetResultRank
Spoken Dialogue System (SDS) Semantic Quality EvaluationEval2000 (test)
ROUGE-L8.4
6
Audio Quality EvaluationEval2000
UTMOS2.03
6
Speaking Style ConsistencyEval2000 (test)
Emotion Rank2.81
5
Showing 3 of 3 rows

Other info

Follow for update