Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video-based Human-Object Interaction Detection from Tubelet Tokens

About

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of $16.14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.

Danyang Tu, Wei Sun, Xiongkuo Min, Guangtao Zhai, Wei Shen• 2022

Related benchmarks

TaskDatasetResultRank
HOI DetectionVidHOI (val)
mAP Full26.92
23
Video Human-Object Interaction DetectionVidHOI (test)
Full Interaction AP26.92
10
Sub-activity detectionCAD-120
Sub-activity Accuracy (%)94.7
6
Showing 3 of 3 rows

Other info

Follow for update