Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video ReCap: Recursive Captioning of Hour-Long Videos

About

Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringEgoSchema (Full)
Accuracy50.23
193
Video Question AnsweringVideoMME
Accuracy33.3
99
Video Question AnsweringEgoSchema
Accuracy34.4
88
Action CaptioningXRF IMU v2 (test)
BLEU@449.1
16
Action CaptioningUWash (test)
B@40.676
16
Action CaptioningXRF Wi-Fi v2 (test)
BLEU@40.215
15
Action CaptioningWiFiTAD (test)
B@425.7
15
Video Question AnsweringHourVideo
Accuracy29.5
11
Video Segment DescriptionEgo4D HCap v1 (test)
CIDEr46.88
8
Video SummaryEgo4D HCap v1 (test)
CIDEr31.06
8
Showing 10 of 14 rows

Other info

Code

Follow for update