COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
About
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | YouCook2 | Recall@1052.3 | 117 | |
| Video Captioning | YouCook2 | METEOR19.34 | 104 | |
| Video Captioning | YouCook II (val) | CIDEr57.2 | 98 | |
| Text-to-Video Retrieval | YouCook2 (val) | R@177.2 | 66 | |
| Text-to-Video Retrieval | Youcook2 (test) | Recall@1052.3 | 54 | |
| Video Captioning | Youcook2 (test) | CIDEr57 | 42 | |
| Video Paragraph Captioning | ActivityNet Captions ae (test) | BLEU@410.85 | 24 | |
| Video Level Summarization | YouCook2 | METEOR19.85 | 21 | |
| Video Captioning | ActivityNet-Captions ae MART (test) | BLEU@317.43 | 9 | |
| Video Paragraph Captioning | ActivityNet Captions | BLEU@410.85 | 9 |