ActBERT: Learning Global-Local Video-Text Representations

About

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

Linchao Zhu, Yi Yang• 2020

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	Recall@116.3	406
Video Question Answering	MSRVTT-QA (test)	Accuracy85.7	376
Text-to-Video Retrieval	MSR-VTT (test)	R@116.3	265
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1056.9	211
Text-to-Video Retrieval	MSRVTT (test)	Recall@50.428	178
Text-to-Video Retrieval	MSRVTT	R@18.6	144
Text-to-Video Retrieval	YouCook2	Recall@1038	117
Video Captioning	YouCook2	METEOR13.3	108
Video Captioning	YouCook II (val)	CIDEr65	98
Text-to-Video Retrieval	MSRVTT	R@116.3	75

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord