Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Real-time Online Video Detection with Temporal Smoothing Transformers

About

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6\times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.

Yue Zhao, Philipp Kr\"ahenb\"uhl• 2022

Related benchmarks

TaskDatasetResultRank
Online Action DetectionTHUMOS14 (test)
mAP71.2
86
Action AnticipationEPIC-KITCHENS 100 (test)
Overall Action Top-5 Recall17.6
59
Online Action DetectionTHUMOS 14
Mean F-AP71.2
37
Action AnticipationEPIC-Kitchens-100 Unseen
Verb Recall@529.6
15
Action AnticipationTHUMOS 2014
mAP (Avg)56.8
14
Action AnticipationTHUMOS-14 (test)--
14
Online Action DetectionCrossTask
mAP33.4
12
Action AnticipationTHUMOS 2014 (test)
mAP56.8
11
Action AnticipationEPIC-Kitchens-100 Tail
Verb Recall@523.2
9
Action AnticipationEpicKitchens-100
Top-5 Acc (Verb)30.8
8
Showing 10 of 13 rows

Other info

Follow for update