Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

About

Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}

Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos• 2023

Related benchmarks

Task	Dataset	Result
Video Action Detection	UCF101-24 1.0 (test)	F-mAP@0.590.7	17
Video Action Detection	JHMDB21 1.0 (test)	f-mAP@0.588.4	17
Action Detection	UCF-101-24 (test)	F1 Score (IoU=0.5)90.7	15
Spatio-temporal action detection	AVA v2.2 (val)	mAP31.4	10
Action Detection	JHMDB closed-set	F@0.588.4	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord