Streamlined Dense Video Captioning

About

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.

Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han• 2019

Related benchmarks

Task	Dataset	Result
Dense Video Captioning	ActivityNet Captions (val)	METEOR13.07	54
Dense Video Captioning	ActivityNet Captions	METEOR8.82	48
Video Captioning	ActivityNet Captions (val)	METEOR13.07	22
Dense Video Captioning	ActivityNet Captions extended results (test)	METEOR13.07	19
Event Proposal Generation	ActivityNet Captions (val)	Recall Avg55.58	13
Caption Localization	ActivityNet Captions (val)	Recall (avg)55.58	11
Event localization	ActivityNet (val)	Average Recall55.58	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord