EVA-02: A Visual Representation for Neon Genesis

About

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02.

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU60.1	3069
Object Detection	COCO 2017 (val)	AP64.1	2843
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy85.8	2238
Object Detection	COCO (test-dev)	mAP64.5	1239
Person Re-Identification	Market 1501	mAP88.14	1136
Object Detection	COCO (val)	--	637
Instance Segmentation	COCO (val)	--	485
Object Detection	LVIS (val)	mAP65.2	170
Semantic segmentation	GTA5 to {Cityscapes, Mapillary, BDD} (test)	mIoU (Cityscapes)62.1	94
Person Re-Identification	LTCC General	mAP45.9	82

Showing 10 of 55 rows

Other info

Follow for update

@wizwand_team Discord