Towards Long-Form Video Understanding
About
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.
Chao-Yuan Wu, Philipp Kr\"ahenb\"uhl• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Detection | AVA v2.2 (val) | -- | 99 | |
| Long-form Video Understanding | LVU | Relation Attribute Accuracy54.8 | 44 | |
| Action Detection | AVA V2.2 | -- | 42 | |
| Action Localization | AVA 2.2 | mAP (center)31 | 25 | |
| Long-form Video Understanding | LVU (test) | Relation Top-1 Acc54.76 | 16 | |
| Action Recognition | AVA 2.2 | mAP31 | 16 | |
| Long-form Video Understanding | LVU 1.0 (test) | Director Accuracy51.2 | 14 | |
| Action Recognition | AVA v2.1 (val) | -- | 14 | |
| Long Video Understanding (Classification & Regression) | LVU 53 (test) | Place Accuracy56.9 | 10 | |
| Long-form Video Classification | LVU | Relation Accuracy53.1 | 10 |
Showing 10 of 15 rows