Elysium: Exploring Object-level Perception in Videos via MLLM
About
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models. All codes and datasets are available at https://github.com/Hon-Wong/Elysium.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy67.5 | 481 | |
| Visual Object Tracking | LaSOT (test) | AUC56.1 | 444 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy82.86 | 345 | |
| Video Question Answering | MSVD-QA | Accuracy75.8 | 340 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy89.07 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.9212 | 333 | |
| Video Question Answering | ActivityNet-QA | Accuracy43.4 | 319 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy83.62 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy82.92 | 291 | |
| Visual Object Tracking | UAV123 (test) | AUC56.6 | 188 |