[2005.03271] RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutionsopen searchopen navigation menucontact arXivarXiv Twitter

In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference. On a long-form YouTube test set, when the non-streaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 1

Keywords: rnn
Date: 2020/05/19 23:21

Related Entries

Read more [DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
1 users, 1 mentions 2020/05/15 11:21
Read more What is the TensorFloat-32 Precision Format? | NVIDIA Blog ...
0 users, 17 mentions 2020/05/14 13:38
Read more A highly efficient, real-time text to speech system deployed on CPUs
0 users, 16 mentions 2020/05/15 17:21
Read more Siamese and Dual BERT for Multi Text Classification
0 users, 3 mentions 2020/05/15 06:44
Read more WT5?! Text-to-TextモデルでNLPタスクの予測理由を説明する手法! | AI-SCHOLAR.TECH
1 users, 0 mentions 2020/05/22 11:55