Verbs in Action: Improving verb understanding in video-language models

About

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy58.8	245
Video Classification	Kinetics 400 (val)	Top-1 Acc59.4	204
Video Question Answering	NExT-QA (val)	Overall Acc51.5	176
Video Question Answering	NEXT-QA	Overall Accuracy58.6	105
Video Classification	Kinetics 400 (test)	Top-1 Acc58.8	97
Video Question Answering	NExT-QA Main Dataset	Accuracy0.586	48
Video Question Answering	NExT-QA ATPhard	Overall Accuracy39.3	33
Video Question Answering	Next-QA v1 (test)	Overall Acc51.5	24
Video Question Answering	DeVE-QA (test)	Accuracy (QA)49.5	21
Action Classification	Kinetics 400 (test)	Accuracy58.8	13

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord