Object Relational Graph with Teacher-Recommended Learning for Video Captioning

About

Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the ground-truth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha• 2020

Related benchmarks

Task	Dataset	Result
Video Captioning	MSVD	CIDEr95.2	157
Video Captioning	MSR-VTT (test)	CIDEr95.2	142
Video Captioning	MSVD (test)	CIDEr95.2	111
Video Captioning	MSRVTT	CIDEr50.9	107
Video Captioning	VATEX	CIDEr49.7	76
Video Captioning	MSRVTT	CIDEr50.9	68
Video Captioning	VATEX (test)	CIDEr49.7	66
Video Captioning	MSRVTT (test)	CIDEr50.9	61
Video Captioning	MSRVTT (full)	CIDEr50.9	20
Video Captioning	VATEX online evaluation (test)	CIDEr49.7	15

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord