Streaming Dense Video Captioning

About

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.

Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid• 2024

Related benchmarks

Task	Dataset	Result
Dense Video Captioning	YouCook2 (val)	CIDEr32.9	43
Event localization	YouCook2 (val)	--	24
Video Captioning	ActivityNet Captions (val)	METEOR10	22
Video Level Summarization	YouCook2	METEOR7.1	21
Event Captioning	YouCook2 1.0 (val)	METEOR7.1	12
Event localization	ViTT (test)	--	8
Dense Video Captioning	ViTT (test)	SODA_c10	7
Event Captioning	ViTT (test)	CIDEr25.2	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord