How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

About

Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.

Hui Lu, Xueyuan Chen, Huimeng Wang, Shuhai Peng, Shiyin Kang, Xixin Wu, Zhiyong Wu• 2026

Related benchmarks

Task	Dataset	Result
Smooth Turn Taking	Full-Duplex Bench v1.0	TOR0.983	11
User Interruption	Full-Duplex-Bench 1.0	TOR1	8
Spoken Question Answering	AlpacaEval	--	7
Spoken Question Answering	WebQuestions	Text Score30.3	6
Spoken Question Answering	LlamaQ	Text Score57.3	6
Spoken Question Answering	TriviaQA	Text Score19.6	6
Backchannel	Full-Duplex-Bench v1.5	Respond Rate5	4
Interruption	Full-Duplex-Bench v1.5	Respond Rate72	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord