Beyond Words: Multimodal LLM Knows When to Speak
About
Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response type prediction | Short-Clips (test) | Affirmation Score62.21 | 30 | |
| Response type prediction | Full-Videos (test) | Affm. Precision31.55 | 15 | |
| Dialogue Act Classification | ICSI Audio+Text | Affirmation Precision11.08 | 3 | |
| Backchannel Detection | Short-Clips reaction set (test) | Binary Classification Accuracy58.61 | 2 | |
| Dialogue Act Classification | Multimediate Video+Audio+Text | Affirmation Precision15.5 | 2 |