MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism - NVIDIA ADLR

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2

23 mentions: @chipro@ctnzr@ivan_bezdomny@kentenglish@TheRealRPuri@gwern@jakubzavrel@pareshkharya
Date:

Referring Tweets

@chipro
@chipro "8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case." Whoa t.co/p2ItGUyKD1 t.co/RlE1WSLxRm
@ctnzr
@ctnzr Here’s how we trained an 8.3B parameter GPT-2. We alternate row- and column- partitioning in the Transformer in order to remove synchronization and use hybrid model/data parallelism. 15 PFlops sustained on 512 GPUs. Details and code: t.co/7eXA6r15yX t.co/sEk4q0hU7T
@ivan_bezdomny
@ivan_bezdomny Our team, training the biggest neural net to date (Transformer language model with 8.3B trainable parameters). Now let's find out what we can do with this capacity. t.co/xRvi6HBgKW
@kentenglish
@kentenglish These #transformer models are getting more intense all the time. Anyone got a few hundred NVIDIA V100s that I can borrow? t.co/kIiPg350Dc
@TheRealRPuri
@TheRealRPuri We Just released a cool #PyTorch #NaturalLanguageProcessing project we've been working on: training an 8.3B GPT2 model with model parallelism. Check it out... Details: t.co/RPK91K4Dda Training Code: t.co/byti9HDZCE
@gwern
@gwern @WahrMethode @MelMitchell1 And Nvidia's new 8.3b-parameter model does overfit quickly in terms of test loss: t.co/vm3USs0a7Y So GPT-2-1.5b can't be too far away from that level of memorization either.
@jakubzavrel
@jakubzavrel The MegatronLM: training even larger Transformer models for NLP with massive GPU parallelism t.co/tvgBDRJzJh by @nvidia
@pareshkharya
@pareshkharya NVIDIA research team trained the largest ever language model based on transformers with a novel model parallel approach using Pytorch. The code that implements this approach is published on GitHub. t.co/lKvRyUhysV
@LumenNaturae
@LumenNaturae "8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case." t.co/fRBR2VfZxo

Bookmark Comments

id:Ryobot

Related Entries

GitHub - deepmind/spriteworld: Spriteworld: a flexible, configurable python-based reinforcement lear...
Read more GitHub - deepmind/spriteworld: Spriteworld: a flexible, configurable python-based reinforcement lear...
1 users, 8 mentions 2019/08/19 17:16
The project | SpeechBrain
Read more The project | SpeechBrain
1 users, 15 mentions 2019/09/10 15:48
Is the future of Neural Networks Sparse? An Introduction (1/N)
Read more Is the future of Neural Networks Sparse? An Introduction (1/N)
1 users, 6 mentions 2020/02/04 17:09
Knowledge Graphs For eXplainable AI - Towards Data Science
Read more Knowledge Graphs For eXplainable AI - Towards Data Science
0 users, 23 mentions 2020/03/02 16:09
Understanding RL Vision
Read more Understanding RL Vision
0 users, 12 mentions 2020/11/26 00:20

ML-Newsについて

機械学習の技術に関する情報は流速も早いし、分野も多様でキャッチアップが大変です。Twitterで機械学習用のリストを作っても、普段は機械学習以外の話題が多く流れており、効率的に情報収集するのは困難です。

ML-NewsはSNSを情報源とした機械学習に特化したニュースサイトです。機械学習に関する論文ブログライブラリコンペティション発表資料勉強会などの最新の情報を効率的に収集できます。

機械学習を応用した自然言語処理、画像認識、情報検索などの分野の情報や機械学習で必要になるデータ基盤やMLOpsの話題もカバーしています。
安定したサイト運営のためにGitHub sponsorを募集しています。

お知らせ

  • 2021/12/31: デザインを刷新しました
  • 2021/04/08: 日本語Kaggleのカテゴリを新設しました