Understanding Long Videos with Multimodal Language Models
About
Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we explore injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos, and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Code: https://github.com/kahnchana/mvu
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ActivityNet-QA (test) | Accuracy42.2 | 275 | |
| Video Question Answering | NExT-QA (test) | Accuracy51.2 | 204 | |
| Video Question Answering | EgoSchema (Full) | Accuracy37.6 | 193 | |
| Multiple-choice Video Question Answering | EgoSchema | Accuracy37.6 | 61 | |
| Video Question Answering | LongVideoBench | Accuracy50.4 | 34 | |
| Video Question Answering | EgoSchema 5031 videos (test) | Top-1 Accuracy61.3 | 26 | |
| Video Question Answering | Next-QA v1 (test) | Overall Acc73.3 | 24 | |
| Multi-choice Video Question Answering | EgoSchema Subset 500 questions | Accuracy60.3 | 10 | |
| Robot Control | MetaWorld | Door Open Success Rate66.7 | 6 | |
| Video Question Answering | EgoSchema ES-S (public subset) | Accuracy55.8 | 4 |