[1911.00359] CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

2 mentions: @alex_conneau
Keywords: dataset
Date: 2019/11/07 02:20

Referring Tweets

@alex_conneau ⚙️Release: CCNet is our new tool for extracting high-quality and large-scale monolingual corpora from CommonCraw in more than a hundred languages. Paper: t.co/zwKGU3tnIv Tool: t.co/L7b5frErqk By G. Wenzek, M-A Lachaux, @EXGRV, @armandjoulin t.co/2FroKfrpan

Related Entries

Read more Dataset for Semantic Urban Scene Understanding
0 users, 0 mentions 2018/10/12 14:56
Read more GitHub - muhaochen/bilingual_dictionaries: This repository contains the source code and links to som...
0 users, 1 mentions 2019/09/08 09:47
Read more On Building an Instagram Street Art Dataset and Detection Model
0 users, 38 mentions 2019/01/29 20:59
Read more GitHub - allenai/PeerRead: Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Pe...
0 users, 0 mentions 2018/05/01 07:07
Read more gpt-2-output-dataset/README.md at master · openai/gpt-2-output-dataset · GitHub
0 users, 1 mentions 2019/11/08 12:52