Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

About

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, Shilei Wen• 2019

Related benchmarks

TaskDatasetResultRank
Moment RetrievalCharades-STA (test)
R@0.536.7
172
Video GroundingCharades-STA
R@1 IoU=0.50.367
113
Natural Language Video LocalizationCharades-STA (test)
R@1 (IoU=0.5)36.7
61
Natural Language Video LocalizationActivityNet Caption (test)
IoU @ 0.536.9
16
Video GroundingActivityNet Caption
IoU@0.536.9
14
Natural Language Video LocalizationTACOS (test)
IoU @ 0.515.95
10
Video Temporal GroundingActivityNet Captions (val)
Recall@0.536.9
10
Video GroundingActivityNet Captions (val 1)
R@1 (IoU=0.5)36.9
5
Showing 8 of 8 rows

Other info

Follow for update