A Learned Performance Model for Tensor Processing Units

About

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, Mike Burrows• 2020

Related benchmarks

Task	Dataset	Result
Latency Prediction	NNLQ Out-of-domain AlexNet	MAPE (avg)10.55	8
Latency Prediction	NNLQ Out-of-domain EfficientNet	MAPE (Avg)16.74	8
Latency Prediction	NNLQ Out-of-domain GoogleNet	MAPE (avg)8.1	8
Latency Prediction	NNLQ Out-of-domain MobileNetV3	MAPE (%)9.97	8
Latency Prediction	NNLQ Out-of-domain MnasNet	MAPE (avg)11.61	8
Latency Prediction	NNLQ Out-of-domain MobileNetV2	MAPE (avg)12.68	8
Latency Prediction	NNLQ Out-of-domain Average	MAPE (Average)21.2	8
Latency Prediction	NNLQ Out-of-domain SqueezeNet	MAPE (avg)24.6	8
Latency Prediction	NNLQ Out-of-domain VGG	MAPE (avg)38.73	8
Latency Prediction	NNLQ Out-of-domain NasBench201	MAPE (avg)58.94	8

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord