Revisiting the "Video" in Video-Language Understanding
About
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT (test) | R@127.8 | 234 | |
| Video Question Answering | NExT-QA (test) | Accuracy54.3 | 204 | |
| Video Question Answering | NExT-QA (val) | Overall Acc54.3 | 176 | |
| Video Question Answering | NEXT-QA | Overall Accuracy65.8 | 105 | |
| Video-to-Text retrieval | DiDeMo (test) | R@126.1 | 92 | |
| Video-to-Text retrieval | ActivityNet (test) | R@117.7 | 63 | |
| Video Question Answering | MSRVTT-MC | Accuracy93.2 | 61 | |
| Video Question Answering | NExT-QA Main Dataset | Accuracy0.543 | 48 | |
| Video Question Answering | STAR (test) | Interaction Score50.63 | 42 | |
| Video Question Answering | MSR-VTT | Accuracy34.3 | 42 |