Localizing Moments in Video with Natural Language

About

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell• 2017

Related benchmarks

Task	Dataset	Result
Moment Retrieval	QVHighlights (test)	R@1 (IoU=0.5)11.41	235
Video Moment Retrieval	TACOS (test)	Recall@1 (0.5 Threshold)5.58	111
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)21.36	85
Temporal Grounding	Charades-STA (test)	Recall@1 (IoU=0.5)17.46	68
Video Grounding	QVHighlights (test)	mAP (IoU=0.5)24.94	64
Video Grounding	TACOS	Recall@1 (IoU=0.5)5.58	45
Video Grounding	ActivityNet Captions	R@1 (IoU=0.5)21.36	43
Moment Retrieval	QVHighlights v1 (test)	R1@0.511.41	19
Video Grounding	TACOS	IoU@0.55.58	19
Single-sentence video grounding	ActivityNet Captions	IoU@0.521.36	17

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord