OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

About

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng• 2025

Related benchmarks

Task	Dataset	Result
State Inference	OmniMMI	SI Score9	7
Action Prediction	OmniMMI	AP3	7
Multi-turn Dependency Reasoning	OmniMMI	Rank 1 Score35.67	7
Dynamic State Grounding	OmniMMI	Rank 1 Count33.5	7
Streaming Video Understanding	OmniMMI	SG (1st)35.7	6
Personality Trait	OmniMMI	PT Score68.5	3
Personality Attribute	OmniMMI	PA Score25.5	2

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord