ViSpeak: Visual Instruction Feedback in Streaming Videos

About

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.

Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng• 2025

Related benchmarks

Task	Dataset	Result
Streaming Video Understanding	StreamingBench	Overall62.6	308
Long Video Understanding	VideoMME	Accuracy55	97
Streaming Video Understanding	OVOBench Realtime	Average Score66.3	38
Streaming Video Understanding	OVO-Bench 1.0 (test)	OCR75.2	21
Egocentric Visual Question Answering	EGOPOINTVQA (test)	Reference Accuracy65.5	19
Video Question Answering	OVO-Bench Backward Tracing	EPM59.93	17
Streaming Video Understanding	OVOBench	Accuracy (Proactive Forwarding)54.3	17
Real-time Streaming	OVO-Bench	RTVP66.3	17
Video Question Answering	OVO-Bench	Overall Accuracy61.91	17
Video Question Answering	OVO-Bench Real-Time Visual Perception	OCR75.2	17

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord