Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

About

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding. Code is available at https://github.com/shvdiwnkozbw/Video-Representation-via-Multi-level-Optimization.

Rui Qian, Yuxi Li, Huabin Liu, John See, Shuangrui Ding, Xian Liu, Dian Li, Weiyao Lin• 2021

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101 (test)	Accuracy79.1	357
Action Recognition	UCF101 (mean of 3 splits)	Accuracy79.1	357
Action Recognition	HMDB51 (test)	Accuracy0.476	249
Action Classification	HMDB51 (over all three splits)	Accuracy47.6	121
Video Retrieval	UCF101 (1)	Top-1 Acc41.5	97
Video Retrieval	HMDB51 (test)	Recall@120.7	76
Video Retrieval	UCF101	Top-1 Acc41.5	63
Video Retrieval	UCF101 (test)	--	55
Action Recognition	UCF101 1 (test)	Accuracy79.1	50
Video Retrieval	HMDB51 (first split)	Top-1 Accuracy20.7	49

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord