Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Revisiting the "Video" in Video-Language Understanding

About

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.

Shyamal Buch, Crist\'obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles• 2022

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT (test)
R@127.8
234
Video Question AnsweringNExT-QA (test)
Accuracy54.3
204
Video Question AnsweringNExT-QA (val)
Overall Acc54.3
176
Video Question AnsweringNEXT-QA
Overall Accuracy65.8
105
Video-to-Text retrievalDiDeMo (test)
R@126.1
92
Video-to-Text retrievalActivityNet (test)
R@117.7
63
Video Question AnsweringMSRVTT-MC
Accuracy93.2
61
Video Question AnsweringNExT-QA Main Dataset
Accuracy0.543
48
Video Question AnsweringSTAR (test)
Interaction Score50.63
42
Video Question AnsweringMSR-VTT
Accuracy34.3
42
Showing 10 of 26 rows

Other info

Code

Follow for update