Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Long-Form Video Understanding

About

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

Chao-Yuan Wu, Philipp Kr\"ahenb\"uhl• 2021

Related benchmarks

TaskDatasetResultRank
Action DetectionAVA v2.2 (val)--
99
Long-form Video UnderstandingLVU
Relation Attribute Accuracy54.8
44
Action DetectionAVA V2.2--
42
Action LocalizationAVA 2.2
mAP (center)31
25
Long-form Video UnderstandingLVU (test)
Relation Top-1 Acc54.76
16
Action RecognitionAVA 2.2
mAP31
16
Long-form Video UnderstandingLVU 1.0 (test)
Director Accuracy51.2
14
Action RecognitionAVA v2.1 (val)--
14
Long Video Understanding (Classification & Regression)LVU 53 (test)
Place Accuracy56.9
10
Long-form Video ClassificationLVU
Relation Accuracy53.1
10
Showing 10 of 15 rows

Other info

Code

Follow for update