Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
About
The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Moment Retrieval | Charades-STA (test) | R@0.536.7 | 172 | |
| Video Grounding | Charades-STA | R@1 IoU=0.50.367 | 113 | |
| Natural Language Video Localization | Charades-STA (test) | R@1 (IoU=0.5)36.7 | 61 | |
| Natural Language Video Localization | ActivityNet Caption (test) | IoU @ 0.536.9 | 16 | |
| Video Grounding | ActivityNet Caption | IoU@0.536.9 | 14 | |
| Natural Language Video Localization | TACOS (test) | IoU @ 0.515.95 | 10 | |
| Video Temporal Grounding | ActivityNet Captions (val) | Recall@0.536.9 | 10 | |
| Video Grounding | ActivityNet Captions (val 1) | R@1 (IoU=0.5)36.9 | 5 |