[1911.02150] Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

6 mentions: @Miles_Brundage@hillbig@hillbig@AkiraTOSEI@AkiraTOSEI
Keywords: transformer
Date: 2019/11/07 05:20

Referring Tweets

@hillbig 一般にTransformerは表現力を上げるため複数ヘッドを使い、各位置でヘッド数分、キーと値を出力するが、サイズが大きくメモリバンド幅律速となり逐次的な推論が遅かった。代わりにヘッド間でキーと値は共有し、クエリだけ複数使うと精度は少し落ちるが高速にできる t.co/8LmfYlTvNz
@hillbig Multi-head attention is slow for incremental inference because of the high memory-bandwidth cost for loading large keys and values at each position. Multi-query attention is fast with minor degradation by sharing keys and values between different heads. t.co/8LmfYlTvNz
@AkiraTOSEI t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこを共有することで高速化が可能になる。精度を落とさずに10倍以上の高速化を実現。 t.co/TCV3k9Todk
@Miles_Brundage Naturally, this is a single-authored paper: "Fast Transformer Decoding: One Write-Head is All You Need," Noam Shazeer: t.co/dYwkV3LGFx
@AkiraTOSEI t.co/BVrwPsI7oe they propose Multi-query Attention which is same as Self-attention but where each head shares keys and values. Since the reload time of those is large, they can speed up by sharing. Achieves a speed increase of 10 times or more without sacrificing accuracy t.co/ELIVVsVKD0

Related Entries

Read more [1902.02102] BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling
0 users, 9 mentions 2019/02/08 21:47
Read more [1906.01618] Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representatio...
0 users, 4 mentions 2019/06/05 06:48
Read more [1905.13678] Learning Sparse Networks Using Targeted Dropout
0 users, 5 mentions 2019/06/05 09:47
Read more [1905.11286] Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Netwo...
0 users, 3 mentions 2019/08/20 11:16
Read more [2005.05951] MOReL : Model-Based Offline Reinforcement Learningopen searchopen navigation menucontac...
0 users, 5 mentions 2020/05/14 00:53