Training-free Transformer Architecture Search

About

Recently, Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks. The progresses are highly relevant to the architecture design, then it is worthwhile to propose Transformer Architecture Search (TAS) to search for better ViTs automatically. However, current TAS methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space according to our experimental observations. In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS (TF-TAS) scheme. Firstly, we observe that the properties of multi-head self-attention (MSA) and multi-layer perceptron (MLP) in ViTs are quite different and that the synaptic diversity of MSA affects the performance notably. Secondly, based on the observation, we devise a modular strategy in TF-TAS that evaluates and ranks ViT architectures from two theoretical perspectives: synaptic diversity and synaptic saliency, termed as DSS-indicator. With DSS-indicator, evaluation results are strongly correlated with the test accuracies of ViT models. Experimental results demonstrate that our TF-TAS achieves a competitive performance against the state-of-the-art manually or automatically design ViT architectures, and it promotes the searching efficiency in ViT search space greatly: from about $24$ GPU days to less than $0.5$ GPU days. Moreover, the proposed DSS-indicator outperforms the existing cutting-edge zero-cost approaches (e.g., TE-score and NASWOT).

Qinqin Zhou, Kekai Sheng, Xiawu Zheng, Ke Li, Xing Sun, Yonghong Tian, Jie Chen, Rongrong Ji• 2022

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP39.7	2843
Image Classification	ImageNet-1k (val)	Top-1 Accuracy80.5	920
Image Classification	ImageNet 1k (test)	Top-1 Accuracy75.3	880
Image Classification	ImageNet	Top-1 Accuracy83.5	431
Image Classification	ImageNet	Top-1 Accuracy82.2	343
Zero-shot performance prediction	GLUE FlexiBERT search space (500 models) (aggregate)	Spearman Correlation0.177	11
Image Classification	CIFAR-100	Top-1 Accuracy91.2	6
Ranking Correlation	Autoformer search space	Kendall's Tau14.5	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord