[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representationsopen searchopen navigation menucontact arXivarXiv Twitter

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. We set a new state of the art on both the 100 hour subset of Librispeech as well as on TIMIT phoneme recognition. When lowering the amount of labeled data to one hour, our model outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 5.7/10.1 WER on the noisy/clean test sets of Librispeech. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Fine-tuning on all of Librispeech achieves 1.9/3.5 WER using a simple base

6 mentions: @ylecun@n0mad_0@meapistol@alex_conneau
Date: 2020/06/23 14:21

Referring Tweets

@ylecun Self-Supervised Learning making strides in speech recognition. Wav2Vec 2.0 from FAIR uses a kind of contrastive SSL for pre-training. This is the first time an SSL system reaches the very best results on a number of different ASR tasks. t.co/pqu1E320f4 1/N
@meapistol Facebook is doing great stuff now in video and speech recognition. Self-supervised. The video stuff is multimodal. t.co/pngmdZ7Y8y t.co/EN14Ard9vB Yannick Kilcher will be busy.
@alex_conneau We build on wav2vec 2.0 (t.co/tGOqGLWlYc), a self-supervised model which is trained by solving a contrastive task over masked latent speech representations. In this paper, we jointly learn a quantization of the latents shared across languages and call our approach XLSR.
@n0mad_0 Insanely good results on self-supervised learning of speech representations: 53k hours of unlabeled data + 10min of labelled data = 5.7/10.1 WER noisy/clean test of Librispeech. Baevski et al, wav2vec 2.0 t.co/BF0mR2X3w1

Related Entries

Read more [2007.01179] Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Modelsope...
0 users, 3 mentions 2020/07/04 14:22
Read more [2011.07743] Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases...
0 users, 3 mentions 2020/11/22 15:51
Read more [2004.04795] Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentat...
0 users, 3 mentions 2020/12/10 15:51
Read more [2012.12556] A Survey on Visual Transformer
0 users, 2 mentions 2020/12/25 03:51
Read more [2012.15856] Studying Strategically: Learning to Mask for Closed-book QA
0 users, 3 mentions 2021/01/04 03:51