[2006.16668] GShard: Scaling Giant Models with Conditional Computation and Automatic Shardingopen searchopen navigation menucontact arXivsubscribe to arXiv mailings

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to Engl

7 mentions: @lepikhin@deliprao@arankomatsuzaki@AlexRoseJo@XiXiDu@ref_sys
Date: 2020/07/01 02:21

Referring Tweets

@lepikhin t.co/V3XPuaBKcG We scaled the Transformer model with Sparsely-Gated Mixture-of-Experts using GShard, and trained a 600B multilingual translation model in about 4 days (for 100 languages) achieving 13.5 BLEU gain compared to the baseline. t.co/oOHRK7iiHm
@deliprao “In weeks decades happen” 13+ BLeU point improvements in this new work. I have a feeling Google is quietly sitting on a GPT-100 implementation and doesn’t bother telling anyone. t.co/5trXwvKOz2 t.co/qrmqQzVRux
@XiXiDu 600 billion parameters: t.co/0M4zrVVq8z "We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."
@arankomatsuzaki GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding GShard enables to scale up multilingual NMT Transformer with Sparsely-Gated MoE beyond 600 billion parameters using automatic sharding. t.co/HrM9liZYAR t.co/KiDoJogk3x
@AlexRoseJo +6 BLEU and reducing TPU years from 200 to just 22! 🙈 I need one of those google brain ssh keys. Sparsely gated mixture of experts slices up the FFN layers into many small networks that are more easily parallelizable and allows very wide networks. t.co/axgOMGEeXh t.co/Rx5nGgCR9M

Related Entries

Read more [1904.00962] Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
0 users, 12 mentions 2019/04/08 11:15
Read more Baidu Research
0 users, 8 mentions 2019/07/31 15:47
Read more [1908.02265] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Langu...
0 users, 4 mentions 2019/08/07 05:17
Read more [2001.09977] Towards a Human-like Open-Domain Chatbotcontact arXivarXiv Twitter
0 users, 9 mentions 2020/01/28 05:21
Read more [2005.12126] Learning to Simulate Dynamic Environments with GameGANopen searchopen navigation menuco...
0 users, 10 mentions 2020/05/27 12:52