Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)
Yannic Kilcher Yannic Kilcher
263K subscribers
16,727 views
493

 Published On May 31, 2020

Do we really need dot-product attention? The attention mechanism is a central part of modern Transformers, mainly due to the dot-product attention mechanism. This paper changes the mechanism to remove the quadratic interaction terms and comes up with a new model, the Synthesizer. As it turns out, you can do pretty well like that!

OUTLINE:
0:00 - Intro & High Level Overview
1:00 - Abstract
2:30 - Attention Mechanism as Information Routing
5:45 - Dot Product Attention
8:05 - Dense Synthetic Attention
15:00 - Random Synthetic Attention
17:15 - Comparison to Feed-Forward Layers
22:00 - Factorization & Mixtures
23:10 - Number of Parameters
25:35 - Machine Translation & Language Modeling Experiments
36:15 - Summarization & Dialogue Generation Experiments
37:15 - GLUE & SuperGLUE Experiments
42:00 - Weight Sizes & Number of Head Ablations
47:05 - Conclusion

Paper: https://arxiv.org/abs/2005.00743
My Video on Transformers (Attention Is All You Need):    • Attention Is All You Need  
My Video on BERT:    • BERT: Pre-training of Deep Bidirectio...  

Abstract:
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Links:
YouTube:    / yannickilcher  
Twitter:   / ykilcher  
BitChute: https://www.bitchute.com/channel/yann...
Minds: https://www.minds.com/ykilcher

show more

Share/Embed