Weakly Supervised Dense Event Captioning in Videos
About
Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training. Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and most real-world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to demonstrate the ability of our model on both dense event captioning and sentence localization in videos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dense Video Captioning | ActivityNet (val) | CIDEr18.77 | 16 | |
| Video Paragraph Grounding | ActivityNet-Captions (test) | R@0.30.4198 | 12 | |
| Video Paragraph Grounding | Charades-CD-OOD (test) | R@0.335.86 | 11 | |
| Temporal Grounding | ActivityNet-CG (test-Trivial) | R@1 (IoU=0.5)11.03 | 7 | |
| Temporal Grounding | ActivityNet-CG (Novel-Composition) | R@1 (IoU=0.5)2.89 | 7 | |
| Temporal Grounding | ActivityNet-CG (Novel-Word) | Recall@1 (IoU=0.5)3.09 | 7 | |
| Temporal Grounding | Charades-CG Trivial (test) | IoU@0.515.33 | 7 | |
| Temporal Grounding | Charades-CG (Novel-Composition) | IoU@0.50.0361 | 7 | |
| Temporal Grounding | Charades-CG (Novel-Word) | IoU@0.52.79 | 7 |