Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multimodal Pretraining for Dense Video Captioning

About

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut• 2020

Related benchmarks

TaskDatasetResultRank
Video CaptioningYouCook2
METEOR18.32
104
Segment-level Video CaptioningYouCook2
BLEU-412.04
17
Segment-level Video CaptioningViTT-All (test)
BLEU-122.45
9
Segment-level Video CaptioningViTT Cooking (test)
BLEU-124.92
9
Video CaptioningViTT
BLEU-122.37
2
Showing 5 of 5 rows

Other info

Code

Follow for update