Date: 2020/09/12 12:53

@nabla_theta @theshawwn Just wanted to make things clear for anyone reading this: the dataset is NOT completed (and probably won't be for a while). In other words, if you think some dataset should be added, by all means, open an issue on our repo ( and we'll consider adding it.
@gwern @Jakewk The scaling is sublinear, so IIRC, 1t is predicted to require ~5x GPT-3's <0.6tb, or <1.5tb. Not that much. (For comparison, EleutherAI's GPT-3 replication attempt is already up to 0.3TB data compiled, 0.2TB in flight, & rest will be from Common Crawl: )
@tmasada “The Pile is (going to be) the world's largest open source language modeling data set. We are currently developing Version 1, with a goal of 1 TiB of English text.” GitHub - EleutherAI/The-Pile

