[2106.01548] When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Incep

9 mentions: @jwuphysics@AkiraTOSEI@RisingSayak@AkiraTOSEI@bneyshabur
Date:

Referring Tweets

@jwuphysics
@jwuphysics Vision transformers (ViT) and MLP-Mixer models tend to produce sharp loss landscapes. Thus, they stand to benefit most from the recently proposed sharpness-aware minimization (SAM) optimizer! (see Foret et al. 2021, ICLR for more on SAM) t.co/f9S6taOx4K t.co/0QcmYVWp9Y
@AkiraTOSEI
@AkiraTOSEI t.co/AtLMFa0x2R MLP-Mixer, ViTは局所最適解に陥りやすいので、損失関数を平滑化するSAM optimizerを用いることで改善することができる。大幅に精度が改善し、ImageNetの学習でResNetと同等程度にまで到達する。 t.co/Q47fWq3cpp
@RisingSayak
@RisingSayak Turns out when ViTs and MLP-Mixers are _aware_ of the loss-curvature instead of uniform loss values they don't need pre-training on larger datasets. Significant gains 😟 Combine ViTs with Sharpness-aware Minimization = voila! t.co/IXNJh680Ey
@AkiraTOSEI
@AkiraTOSEI t.co/AtLMF9IWbj The MLP-Mixer and ViT are prone to local optima, which can be improved by using the SAM optimizer to smooth the loss function. The accuracy is greatly improved and reaches the same level as ResNet by training ImageNet. t.co/RJ1djYWqwN
@bneyshabur
@bneyshabur This allows ViTs to outperform ResNets without any pretraining or strong data augmentations! Figures, numbers and conclusions are from this awesome recent paper by @XiangningChen, Cho-Jui Hsieh and @BoqingGo: t.co/nyn0yXlKCX 3/4

Related Entries

[2011.07357] Solving Physics Puzzles by Reasoning about Pathsopen searchopen navigation menucontact ...
Read more [2011.07357] Solving Physics Puzzles by Reasoning about Pathsopen searchopen navigation menucontact ...
0 users, 2 mentions 2020/11/27 11:21
[2101.07235] Reducing bias and increasing utility by federated generative modeling of medical images...
Read more [2101.07235] Reducing bias and increasing utility by federated generative modeling of medical images...
0 users, 2 mentions 2021/02/01 11:21
[2103.00430v2] Training Generative Adversarial Networks in One Stage
Read more [2103.00430v2] Training Generative Adversarial Networks in One Stage
0 users, 2 mentions 2021/03/11 11:22
[2105.09938] Measuring Coding Challenge Competence With APPS
Read more [2105.09938] Measuring Coding Challenge Competence With APPS
1 users, 2 mentions 2021/06/15 12:17
Faiss Tutorial: Getting Started With Faiss | Pinecone
Read more Faiss Tutorial: Getting Started With Faiss | Pinecone
2 users, 2 mentions 2021/07/31 04:38

ML-Newsについて

機械学習の技術に関する情報は流速も早いし、分野も多様でキャッチアップが大変です。Twitterで機械学習用のリストを作っても、普段は機械学習以外の話題が多く流れており、効率的に情報収集するのは困難です。

ML-NewsはSNSを情報源とした機械学習に特化したニュースサイトです。機械学習に関する論文ブログライブラリコンペティション発表資料勉強会などの最新の情報を効率的に収集できます。

機械学習を応用した自然言語処理、画像認識、情報検索などの分野の情報や機械学習で必要になるデータ基盤やMLOpsの話題もカバーしています。
安定したサイト運営のためにGitHub sponsorを募集しています。

お知らせ

  • 2021/12/31: デザインを刷新しました
  • 2021/04/08: 日本語Kaggleのカテゴリを新設しました