Live Video Captioning

About

Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.

Eduardo Blanco-Fern\'andez, Carlos Guti\'errez-\'Alvarez, Nadia Nasri, Saturnino Maldonado-Basc\'on, Roberto J. L\'opez-Sastre• 2024

Related benchmarks

Task	Dataset	Result	Rank
Dense Video Captioning	ActivityNet Captions (val)	METEOR1.56		54
Caption Localization	ActivityNet Captions (val)	Recall (avg)19.32		11

Showing 2 of 2 rows

Other info

Code

Follow for update

@wizwand_team Discord