Towards Long-Form Video Understanding

About

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

Chao-Yuan Wu, Philipp Kr\"ahenb\"uhl• 2021

Related benchmarks

Task	Dataset	Result
Action Detection	AVA v2.2 (val)	--	99
Long-form Video Understanding	LVU	Relation Attribute Accuracy54.8	44
Action Detection	AVA V2.2	--	42
Action Localization	AVA 2.2	mAP (center)31	25
Long-form Video Understanding	LVU (test)	Relation Top-1 Acc54.76	16
Action Recognition	AVA 2.2	mAP31	16
Long-form Video Understanding	LVU 1.0 (test)	Director Accuracy51.2	14
Action Recognition	AVA v2.1 (val)	--	14
Long Video Understanding (Classification & Regression)	LVU 53 (test)	Place Accuracy56.9	10
Long-form Video Classification	LVU	Relation Accuracy53.1	10

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord