GitHub - EleutherAI/The-Pile

GitHub - EleutherAI/The-Pile

Contribute to EleutherAI/The-Pile development by creating an account on GitHub.

3 mentions: @gwern@nabla_theta@tmasada
Date: 2020/09/12 12:53

Referring Tweets

@nabla_theta @theshawwn Just wanted to make things clear for anyone reading this: the dataset is NOT completed (and probably won't be for a while). In other words, if you think some dataset should be added, by all means, open an issue on our repo ( and we'll consider adding it.
@gwern @Jakewk The scaling is sublinear, so IIRC, 1t is predicted to require ~5x GPT-3's <0.6tb, or <1.5tb. Not that much. (For comparison, EleutherAI's GPT-3 replication attempt is already up to 0.3TB data compiled, 0.2TB in flight, & rest will be from Common Crawl: )
@tmasada “The Pile is (going to be) the world's largest open source language modeling data set. We are currently developing Version 1, with a goal of 1 TiB of English text.” GitHub - EleutherAI/The-Pile

Related Entries

Read more [2001.05070] Understanding Generalization in Deep Learning via Tensor Methodscontact arXivarXiv Twit...
0 users, 3 mentions 2020/01/20 00:51
Read more [2005.00770] Exploring and Predicting Transferability across NLP Tasksopen searchopen navigation men...
0 users, 3 mentions 2020/06/28 06:52
Read more [2007.01179] Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Modelsope...
0 users, 3 mentions 2020/07/04 14:22
Read more [2007.14435] Towards Ecologically Valid Research on Language User Interfacesopen searchopen navigati...
0 users, 3 mentions 2020/07/31 05:21
Read more [2009.00666] Robust, Accurate Stochastic Optimization for Variational Inferenceopen searchopen navig...
0 users, 3 mentions 2020/09/07 12:59