[1911.02150] Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

6 mentions: @Miles_Brundage@hillbig@hillbig@AkiraTOSEI@AkiraTOSEI
Keywords: transformer
Date: 2019/11/07 05:20

Referring Tweets

@hillbig 一般にTransformerは表現力を上げるため複数ヘッドを使い、各位置でヘッド数分、キーと値を出力するが、サイズが大きくメモリバンド幅律速となり逐次的な推論が遅かった。代わりにヘッド間でキーと値は共有し、クエリだけ複数使うと精度は少し落ちるが高速にできる t.co/8LmfYlTvNz
@hillbig Multi-head attention is slow for incremental inference because of the high memory-bandwidth cost for loading large keys and values at each position. Multi-query attention is fast with minor degradation by sharing keys and values between different heads. t.co/8LmfYlTvNz
@AkiraTOSEI t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこを共有することで高速化が可能になる。精度を落とさずに10倍以上の高速化を実現。 t.co/TCV3k9Todk
@Miles_Brundage Naturally, this is a single-authored paper: "Fast Transformer Decoding: One Write-Head is All You Need," Noam Shazeer: t.co/dYwkV3LGFx
@AkiraTOSEI t.co/BVrwPsI7oe they propose Multi-query Attention which is same as Self-attention but where each head shares keys and values. Since the reload time of those is large, they can speed up by sharing. Achieves a speed increase of 10 times or more without sacrificing accuracy t.co/ELIVVsVKD0

Related Entries

Read more GitHub - soskek/bert-chainer: Chainer implementation of "BERT: Pre-training of Deep Bidirectional Tr...
7 users, 0 mentions 2018/12/02 18:01
Read more [DL Hacks]BERT: Pre-training of Deep Bidirectional Transformers for L…
4 users, 5 mentions 2018/12/07 04:31
Read more The Annotated Transformer
0 users, 0 mentions 2018/08/27 01:24
Read more [DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Lang…
0 users, 0 mentions 2018/10/20 12:15
Read more GitHub - huggingface/pytorch-pretrained-BERT: The Big-&-Extending-Repository-of-Transformers: PyTorc...
1 users, 7 mentions 2019/03/04 21:47