Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

About

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen• 2020

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy72
1460
Mathematical ReasoningGSM8K
Accuracy6.4
983
Code GenerationHumanEval
Pass@117.7
850
Multi-task Language UnderstandingMMLU
Accuracy26.3
842
Commonsense ReasoningWinoGrande
Accuracy67.6
776
Language UnderstandingMMLU
Accuracy36.7
756
Question AnsweringARC Challenge
Accuracy45.8
749
Commonsense ReasoningPIQA
Accuracy77.6
647
Question AnsweringARC Easy
Normalized Acc64
385
Reading ComprehensionRACE high
Accuracy43.5
295
Showing 10 of 48 rows

Other info

Follow for update