In five courses, you are going learn the foundations of Deep Learning, understand how to build neural networks, and learn how to lead successful machine learning projects. You will learn about Convolutional networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. You will work on case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. You will master not only the theory, but also see how it is applied in industry. You will practice all these ideas in Python and in TensorFlow, which we will teach.
In this paper, the authors propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case.
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed – Reformer, Linformer, Performer, Longformer, to name a few – which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency. With the aim of helping the avid researcher navigate this flurry, this paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.
In this paper, the authors propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case. Specifically, they review Transformer models that have applications in both language and vision domains, attempting to consolidate the literature across the spectrum. They also provide a detailed walk-through of many of these models and draw connections between them.
In the section 2, authors reviewed the background of the well-established Transformer architecture. Transformers are multi-layered architectures formed by stacking Transformer blocks on top of one another.
I really like the 2.4 section, when the authors summarised the the differences in the mode of usage of the Transformer block. Transformers can primarily be used in three ways, namely:
Encoder-only (e.g., for classification)
Decoder-only (e.g., for language modelling, GPT2/3)
Encoder-decoder (e.g., for machine translation)
In section 3, they provide a high-level overview of efficient Transformer models and present a characterization of the different models in the taxonomy with respect to core techniques and primary use case. This is the core part of this paper covering 17 different papers’ technical details.
Summary of Efficient Transformer Models presented in chronological order of their first public disclosure.
In the last section, authors address the state of research pertaining to this class of efficient models on model evaluation, design trends, and more discussion on orthogonal efficiency effort, such as Weight Sharing, Quantization / Mixed precision, Knowledge Distillation, Neural Architecture Search (NAS) and Task Adapters.
In sum, this is a really good paper summarised all the important work around the Transformer model. It is also a good reference for researcher and engineering to be inspired and try these techniques for different models in their own projects.