Learning Language-Visual Embedding for Movie Understanding with Natural-Language

About

Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.

Atousa Torabi, Niket Tandon, Leonid Sigal• 2016

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	Recall@14.2	406
Video Question Answering	MSRVTT-QA (test)	--	376
Text-to-Video Retrieval	MSR-VTT (test)	R@14.2	265
Text-to-Video Retrieval	LSMDC (test)	R@51.30e+3	245
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1019.9	211
Text-to-Image Retrieval	MSCOCO (1K test)	R@137.2	118
Video Question Answering	MSR-VTT	Accuracy60.2	42
Video-Text Retrieval	MSR-VTT	--	28
Image Annotation	COCO 1000 (test)	R@144.6	18
Movie Retrieval	LSMDC 17 (public test)	Recall@14.3	16

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord