[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representationsopen searchopen navigation menucontact arXivarXiv Twitter

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. We set a new state of the art on both the 100 hour subset of Librispeech as well as on TIMIT phoneme recognition. When lowering the amount of labeled data to one hour, our model outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 5.7/10.1 WER on the noisy/clean test sets of Librispeech. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Fine-tuning on all of Librispeech achieves 1.9/3.5 WER using a simple base

6 mentions: @ylecun@n0mad_0@meapistol@alex_conneau
Date: 2020/06/23 14:21

Referring Tweets

@ylecun Self-Supervised Learning making strides in speech recognition. Wav2Vec 2.0 from FAIR uses a kind of contrastive SSL for pre-training. This is the first time an SSL system reaches the very best results on a number of different ASR tasks. t.co/pqu1E320f4 1/N
@meapistol Facebook is doing great stuff now in video and speech recognition. Self-supervised. The video stuff is multimodal. t.co/pngmdZ7Y8y t.co/EN14Ard9vB Yannick Kilcher will be busy.
@alex_conneau We build on wav2vec 2.0 (t.co/tGOqGLWlYc), a self-supervised model which is trained by solving a contrastive task over masked latent speech representations. In this paper, we jointly learn a quantization of the latents shared across languages and call our approach XLSR.
@n0mad_0 Insanely good results on self-supervised learning of speech representations: 53k hours of unlabeled data + 10min of labelled data = 5.7/10.1 WER noisy/clean test of Librispeech. Baevski et al, wav2vec 2.0 t.co/BF0mR2X3w1

Related Entries

Read more [2003.00381] Statistical power for cluster analysiscontact arXivarXiv Twitter
0 users, 6 mentions 2020/03/03 23:20
Read more [2004.10240] Neural forecasting: Introduction and literature overviewopen searchopen navigation menu...
0 users, 6 mentions 2020/04/23 21:51
Read more A 2020 Vision of Linear Algebra | MIT OpenCourseWare
0 users, 12 mentions 2020/05/09 12:52
Read more Lessons from the PULSE Model and Discussion
0 users, 14 mentions 2020/06/24 19:53
Read more Knowledge Graphs in Natural Language Processing @ ACL 2020 | by Michael Galkin | Jul, 2020 | Towards...
1 users, 7 mentions 2020/07/11 21:49