Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

About

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc86.9	1239
Object Detection	COCO (val)	--	637
Instance Segmentation	COCO (val)	APmk48.6	485
Action Recognition	Something-Something v2	Top-1 Accuracy76.5	363
Image Classification	iNaturalist 2018	Top-1 Accuracy87.3	291
Video Classification	Kinetics 400 (val)	--	204
Video Action Classification	Something-Something v2	Top-1 Acc75.1	145
Image Classification	iNaturalist 2019	Top-1 Acc88.5	122
Action Detection	AVA v2.2 (val)	mAP43.3	99
Video Classification	Kinetics-600 (val)	Accuracy88.8	84

Showing 10 of 25 rows

Other info

Code

Follow for update

@wizwand_team Discord