UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

About

With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	--	635
Temporal Grounding	Charades-STA	mIoU44.7	120
Reasoning Video Object Segmentation	ReVOS Reasoning	J&F Score61.8	75
Video Referring Segmentation	ReVOS Referring	J&F Score67.6	51
Video Referring Description	VideRefer-Bench-D Single-Frame	SC4.53	17
Region Captioning	VideoRefer-D (test)	Average Score3.61	16
Reasoning Video Object Segmentation	ReVOS Overall (Entire Dataset)	J&F Score64.8	14
Video object referring question answering	VideRefer-Bench-Q	SQ75.8	14
Multi-grained Video Cooperative Understanding	UFVideo-Bench PixRQA	SAvg (Score Average)3.35	4
Multi-grained Video Cooperative Understanding	UFVideo-Bench PixTRQA	tIoU49.64	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord