@nabla_theta@theshawwn Just wanted to make things clear for anyone reading this: the dataset is NOT completed (and probably won't be for a while).
In other words, if you think some dataset should be added, by all means, open an issue on our repo (t.co/EPJWpAaCGi) and we'll consider adding it.
0 RT, 4 Fav2020/09/08 01:10
@gwern@Jakewk The scaling is sublinear, so IIRC, 1t is predicted to require ~5x GPT-3's <0.6tb, or <1.5tb. Not that much.
(For comparison, EleutherAI's GPT-3 replication attempt is already up to 0.3TB data compiled, 0.2TB in flight, & rest will be from Common Crawl: t.co/gSw4mmSkmg )
1 RT, 10 Fav2020/09/11 20:34
@tmasada“The Pile is (going to be) the world's largest open source language modeling data set. We are currently developing Version 1, with a goal of 1 TiB of English text.”
GitHub - EleutherAI/The-Pile t.co/WOFD077Hgr
0 RT, 0 Fav2020/09/12 12:36
[2001.05070] Understanding Generalization in Deep Learning via Tensor Methodscontact arXivarXiv Twit...