Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

About

Given an untrimmed video and a sentence description, temporal sentence localization aims to automatically determine the start and end points of the described sentence within the video. The problem is challenging as it needs the understanding of both video and sentence. Existing research predominantly employs a costly "scan and localize" framework, neglecting the global video context and the specific details within sentences which play as critical issues for this problem. In this paper, we propose a novel Attention Based Location Regression (ABLR) approach to solve the temporal sentence localization from a global perspective. Specifically, to preserve the context information, ABLR first encodes both video and sentence via Bidirectional LSTM networks. Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure, but also sentence attention which highlights the crucial details for temporal localization. Finally, a novel attention based location regression network is designed to predict the temporal coordinates of sentence query from the previous attention. ABLR is jointly trained in an end-to-end manner. Comprehensive experiments on ActivityNet Captions and TACoS datasets demonstrate both the effectiveness and the efficiency of the proposed ABLR approach.

Yitian Yuan, Tao Mei, Wenwu Zhu• 2018

Related benchmarks

TaskDatasetResultRank
Video Moment RetrievalTACOS (test)
Recall@1 (0.5 Threshold)9.4
70
Video GroundingActivityNet Captions
R@1 (IoU=0.5)36.79
43
Video GroundingTACOS
IoU@0.59.4
19
Single-sentence video groundingActivityNet Captions
IoU@0.536.79
17
Natural Language Video LocalizationActivityNet Caption (test)
IoU @ 0.536.79
16
Single-sentence video groundingTACOS
IoU @ 0.5 Threshold9.4
16
Video GroundingActivityNet Caption
IoU@0.536.79
14
Natural Language Video LocalizationTACOS (test)
IoU @ 0.59.4
10
Video GroundingTACOS (test)
Recall@1 (IoU=0.5)9.4
8
Video GroundingActivityNet-Captions (val 2)
R@1 (IoU=0.5)36.79
4
Showing 10 of 10 rows

Other info

Follow for update