MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism - NVIDIA ADLR

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2

13 mentions: @chipro@ctnzr@ivan_bezdomny@kentenglish@gwern@TheRealRPuri@pareshkharya@DataScientistFr
Date: 2019/08/13 00:00

Referring Tweets

@chipro "8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case." Whoa https://t.co/p2ItGUyKD1 https://t.co/RlE1WSLxRm
@ctnzr Here’s how we trained an 8.3B parameter GPT-2. We alternate row- and column- partitioning in the Transformer in order to remove synchronization and use hybrid model/data parallelism. 15 PFlops sustained on 512 GPUs. Details and code: https://t.co/7eXA6r15yX https://t.co/sEk4q0hU7T
@ivan_bezdomny Our team, training the biggest neural net to date (Transformer language model with 8.3B trainable parameters). Now let's find out what we can do with this capacity. https://t.co/xRvi6HBgKW

Bookmark Comments

id:Ryobot "largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2"

Related Entries

Read more 子どもの言語獲得のモデル化とNN Language ModelsNN
Read more Generalized Language Models
Read more GitHub - facebookresearch/XLM: PyTorch original implementation of Cross-lingual Language Model Pretr...
Read more GitHub - facebookresearch/XLM: PyTorch original implementation of Cross-lingual Language Model Pretr...
Read more [1902.10547] An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models