MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

About

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal• 2020

Related benchmarks

Task	Dataset	Result
Video Captioning	YouCook2	METEOR15.9	108
Video Captioning	YouCook II (val)	CIDEr35.74	98
Video Paragraph Captioning	ActivityNet Captions ae (val)	METEOR15.7	43
Video Paragraph Captioning	ActivityNet Captions ae (test)	BLEU@410.54	24
Segment-level Video Captioning	YouCook2	BLEU-48	17
Visual Abductive Reasoning	VAR (test)	BLEU@42.86	14
Video Captioning	ActivityNet Captions	CIDEr22.2	10
Video Paragraph Captioning	ActivityNet Captions	BLEU@410.33	9

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord