Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

About

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu• 2024

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy64.6
247
Video UnderstandingVideoMME--
192
Long Video UnderstandingLongVideoBench (val)
Accuracy54.8
139
Video Question AnsweringNEXT-QA
Overall Accuracy54.8
105
Video Question AnsweringVideoMME
Accuracy56
99
Video UnderstandingMVBench (test)
Accuracy61.1
97
Video Question AnsweringEgoSchema
Accuracy62.7
88
Long-form Video UnderstandingLongVideoBench
Accuracy54.8
82
Video Question AnsweringEgoSchema (test)
Accuracy62.7
80
Long Video UnderstandingMLVU
Accuracy61
72
Showing 10 of 35 rows

Other info

Follow for update