Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Verbs in Action: Improving verb understanding in video-language models

About

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics 400 (test)
Top-1 Accuracy58.8
245
Video ClassificationKinetics 400 (val)
Top-1 Acc59.4
204
Video Question AnsweringNExT-QA (val)
Overall Acc51.5
176
Video Question AnsweringNEXT-QA
Overall Accuracy58.6
105
Video ClassificationKinetics 400 (test)
Top-1 Acc58.8
97
Video Question AnsweringNExT-QA Main Dataset
Accuracy0.586
48
Video Question AnsweringNExT-QA ATPhard
Overall Accuracy39.3
27
Video Question AnsweringNext-QA v1 (test)
Overall Acc51.5
24
Video Question AnsweringDeVE-QA (test)
Accuracy (QA)49.5
21
Action ClassificationKinetics 400 (test)
Accuracy58.8
13
Showing 10 of 16 rows

Other info

Follow for update