[2009.04534v1] Pay Attention when Requiredopen searchopen navigation menucontact arXivsubscribe to arXiv mailings

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

Keywords: attention
Date: 2020/09/13 15:51

Related Entries

Read more [2001.04935] Humpty Dumpty: Controlling Word Meanings via Corpus Poisoningcontact arXivarXiv Twitter
0 users, 1 mentions 2020/01/21 23:20
Read more [2002.11361] Understanding Self-Training for Gradual Domain Adaptationcontact arXivarXiv Twitter
0 users, 1 mentions 2020/02/27 05:20
Read more Data Mining and Machine Learning by Mohammed J. Zaki
0 users, 1 mentions 2020/03/18 02:21
Read more GitHub - zxlzr/IEDatasets: Information extraction dataset zoo.
0 users, 1 mentions 2020/08/12 15:51
Read more GitHub - janosh/awesome-normalizing-flows: A list of awesome resources on normalizing flows.
0 users, 1 mentions 2020/09/16 14:22
Read more GitHub - koba-jon/pytorch_cpp: Deep Learning sample programs using PyTorch in C++
0 users, 1 mentions 2020/09/22 17:21