Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

About

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox (1) __INSTITUTION_4__ University of Freiburg, (2) University of Maryland Baltimore County)• 2020

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalYouCook2
Recall@1052.3
117
Video CaptioningYouCook2
METEOR19.34
104
Video CaptioningYouCook II (val)
CIDEr57.2
98
Text-to-Video RetrievalYouCook2 (val)
R@177.2
66
Text-to-Video RetrievalYoucook2 (test)
Recall@1052.3
54
Video CaptioningYoucook2 (test)
CIDEr57
42
Video Paragraph CaptioningActivityNet Captions ae (test)
BLEU@410.85
24
Video Level SummarizationYouCook2
METEOR19.85
21
Video CaptioningActivityNet-Captions ae MART (test)
BLEU@317.43
9
Video Paragraph CaptioningActivityNet Captions
BLEU@410.85
9
Showing 10 of 14 rows

Other info

Code

Follow for update