Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

About

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut• 2019

Related benchmarks

TaskDatasetResultRank
Video CaptioningYouCook2
METEOR25.9
104
Video CaptioningYouCook II (val)
CIDEr112
98
Video CaptioningYoucook2 (test)
CIDEr112
42
Video Level SummarizationYouCook2
METEOR17.77
21
Segment-level Video CaptioningYouCook2
BLEU-415.2
17
Segment-level Video CaptioningViTT-All (test)
BLEU-143.34
9
Segment-level Video CaptioningViTT Cooking (test)
BLEU-141.61
9
Showing 7 of 7 rows

Other info

Follow for update