Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Words: Multimodal LLM Knows When to Speak

About

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin• 2025

Related benchmarks

TaskDatasetResultRank
Response type predictionShort-Clips (test)
Affirmation Score62.21
30
Response type predictionFull-Videos (test)
Affm. Precision31.55
15
Dialogue Act ClassificationICSI Audio+Text
Affirmation Precision11.08
3
Backchannel DetectionShort-Clips reaction set (test)
Binary Classification Accuracy58.61
2
Dialogue Act ClassificationMultimediate Video+Audio+Text
Affirmation Precision15.5
2
Showing 5 of 5 rows

Other info

Follow for update