Timeception for Complex Action Recognition

About

This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.

Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders• 2018

Related benchmarks

Task	Dataset	Result
Action Recognition	Charades	mAP0.411	64
Action Recognition	Charades (test)	mAP0.411	53
Action Recognition	Charades v1 (test)	--	52
Action Recognition	Breakfast	Top-1 Accuracy71.3	28
Single-label activity classification	Breakfast	Accuracy71.3	21
Video Action Recognition	Breakfast	Top-1 Accuracy71.3	18
Action Recognition	Charades v1 (val)	mAP41.1	15
Human Activity Recognition	Breakfast	Accuracy71.3	14
Long-form Video Classification	Breakfast	Top-1 Accuracy71.3	14
Action Recognition	Breakfast (1357:335)	Accuracy86.9	13

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord