OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
About
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| State Inference | OmniMMI | SI Score9 | 7 | |
| Action Prediction | OmniMMI | AP3 | 7 | |
| Multi-turn Dependency Reasoning | OmniMMI | Rank 1 Score35.67 | 7 | |
| Dynamic State Grounding | OmniMMI | Rank 1 Count33.5 | 7 | |
| Personality Trait | OmniMMI | PT Score68.5 | 3 | |
| Personality Attribute | OmniMMI | PA Score25.5 | 2 |