Beyond Words: Multimodal LLM Knows When to Speak

About

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin• 2025

Related benchmarks

Task	Dataset	Result
Response type prediction	Short-Clips (test)	Affirmation Score62.21	30
Response type prediction	Full-Videos (test)	Affm. Precision31.55	15
Dialogue Act Classification	ICSI Audio+Text	Affirmation Precision11.08	3
Backchannel Detection	Short-Clips reaction set (test)	Binary Classification Accuracy58.61	2
Dialogue Act Classification	Multimediate Video+Audio+Text	Affirmation Precision15.5	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord