HierVL: Learning Hierarchical Video-Language Embeddings
About
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-term Action Anticipation | Ego4D v1 (test) | ED@Z=20 Verb0.723 | 31 | |
| Action Recognition | Charades-Ego first-person (test) | mAP0.338 | 21 | |
| Multi-Instance Retrieval | Epic Kitchens 100 | mAP (Avg)18.9 | 19 | |
| Long Term Anticipation | Ego4D LTA v1 (test) | ED@Z=20 Verb0.724 | 18 | |
| Multi-Instance Retrieval | EPIC-KITCHENS 100 (test) | -- | 15 | |
| Long-Term Anticipation (LTA) | Ego4D (test) | Verb Anticipation Accuracy72.4 | 9 | |
| Action Recognition | Charades-Ego | mAP26 | 8 | |
| Long Term Anticipation | Ego4D (test) | Verb ED0.7223 | 6 | |
| Video Classification | HowTo100M | Accuracy64.6 | 4 |