Progressive Video Summarization via Multimodal Self-supervised Learning

About

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.

Li Haopeng, Ke Qiuhong, Gong Mingming, Tom Drummond• 2022

Related benchmarks

Task	Dataset	Result
Video Summarization	TVSum	F-Measure61.8	213
Video Summarization	TVSum	Kendall's Tau0.181	99
Video Summarization	SumMe (TVT)	Kendall's Tau0.192	44
Video Summarization	SumMe (various)	F-score50.7	35
Video Summarization	SumMe	Kendall's tau0.192	35
Video Summarization	SumMe	Kendall's τ0.192	32
Video Summarization	TVSum (5-fold cross-val)	Kendall's Tau0.181	32
Video Summarization	TVSum	Kendall's τ0.181	24
Video highlight detection	Mr.HiSum	mAP (rho=50%)59.48	23
Video Summarization	TVSum (5FCV)	Kendall's Tau0.181	19

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord