Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to $O\left(n^{1.5}d\right)$ from $O\left(n^2d\right)$ for sequence length $n$ and hidden dimens

Date: 2020/11/18 03:51

@maguroIsland Efficient Content-Based Sparse Attention with Routing Transformers (TACL2020) t.co/f1Wqb4Pcl5 Attentionを計算する際、入力をクラスタリングすることで最大内積探索 (MIPS) 問題となり、計算コストも少なく、強いAttention信号を獲得可能なRouting Transformerを提案した t.co/mVDFvaBnSD
@tsuchm この論文 t.co/xrx1XAQE1A ですけど、原理的に、学習が従来法に比べて短時間でできるようになることはあり得ると思う（それだけで十分偉いと思う）のですけど、本当に、性能が従来法に比べて上がるのかなあ。従来法のハイパーパラメータの探索が不足しているよトラップにはまってないかなあ。

