Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Learned Performance Model for Tensor Processing Units

About

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, Mike Burrows• 2020

Related benchmarks

TaskDatasetResultRank
Latency PredictionNNLQ Out-of-domain AlexNet
MAPE (avg)10.55
8
Latency PredictionNNLQ Out-of-domain EfficientNet
MAPE (Avg)16.74
8
Latency PredictionNNLQ Out-of-domain GoogleNet
MAPE (avg)8.1
8
Latency PredictionNNLQ Out-of-domain MobileNetV3
MAPE (%)9.97
8
Latency PredictionNNLQ Out-of-domain MnasNet
MAPE (avg)11.61
8
Latency PredictionNNLQ Out-of-domain MobileNetV2
MAPE (avg)12.68
8
Latency PredictionNNLQ Out-of-domain Average
MAPE (Average)21.2
8
Latency PredictionNNLQ Out-of-domain SqueezeNet
MAPE (avg)24.6
8
Latency PredictionNNLQ Out-of-domain VGG
MAPE (avg)38.73
8
Latency PredictionNNLQ Out-of-domain NasBench201
MAPE (avg)58.94
8
Showing 10 of 24 rows

Other info

Follow for update