fbpx
Wikipedia

Transformer (machine learning model)

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP)[1] and computer vision (CV).[2]

Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.[1]

Transformers were introduced in 2017 by a team at Google Brain[1] and are increasingly the model of choice for NLP problems,[3] replacing RNN models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks.[4][5]

Background

Before transformers, most state-of-the-art NLP systems relied on gated RNNs, such as LSTMs and gated recurrent units (GRUs), with added attention mechanisms. Transformers also make use of attention mechanisms but, unlike RNNs, do not have a recurrent structure. This means that provided with enough training data, attention mechanisms alone can match the performance of RNNs with attention.[1]

Sequential processing

Gated RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen prior to the current token. To process the  th token, the model combines the state representing the sentence up to token   with the information of the new token to create a new state, representing the sentence up to token  . Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode contextual information about the token. In practice this mechanism is flawed: the vanishing gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens. The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern deep-learning hardware. This can make the training of RNNs inefficient.

Self-Attention

These problems were addressed by attention mechanisms. Attention mechanisms let a model draw from the state at any preceding point along the sequence. The attention layer can access all previous states and weigh them according to a learned measure of relevance, providing relevant information about far-away tokens.

A clear example of the value of attention is in language translation, where context is essential to assign the meaning of a word in a sentence. In an English-to-French translation system, the first word of the French output most probably depends heavily on the first few words of the English input. However, in a classic LSTM model, in order to produce the first word of the French output, the model is given only the state vector after processing the last English word. Theoretically, this vector can encode information about the whole English sentence, giving the model all the necessary knowledge. In practice, this information is often poorly preserved by the LSTM. An attention mechanism can be added to address this problem: the decoder is given access to the state vectors of every English input word, not just the last, and can learn attention weights that dictate how much to attend to each English input state vector.

When added to RNNs, attention mechanisms increase performance. The development of the Transformer architecture revealed that attention mechanisms were powerful in themselves and that sequential recurrent processing of data was not necessary to achieve the quality gains of RNNs with attention. Transformers use an attention mechanism without an RNN, processing all tokens simultaneously and calculating attention weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

Architecture

 
Transformer model architecture

Input

The input text is parsed into tokens by a byte pair encoding tokenizer, and each token is converted via a word embedding into a vector. Then, positional information of the token is added to the word embedding.

Encoder–decoder architecture

Like earlier seq2seq models, the original Transformer model used an encoder–decoder architecture. The encoder consists of encoding layers that process the input iteratively one layer after another, while the decoder consists of decoding layers that do the same thing to the encoder's output.

The function of each encoder layer is to generate encodings that contain information about which parts of the inputs are relevant to each other. It passes its encodings to the next encoder layer as inputs. Each decoder layer does the opposite, taking all the encodings and using their incorporated contextual information to generate an output sequence.[6] To achieve this, each encoder and decoder layer makes use of an attention mechanism.

For each part of the input, attention weighs the relevance of every other part and draws from them to produce the output.[7] Each decoder layer has an additional attention mechanism that draws information from the outputs of previous decoders, before the decoder layer draws information from the encodings.

Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps.[7]

Scaled dot-product attention

The transformer building blocks are scaled dot-product attention units. When a sentence is passed into a transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in context that contain information about the token itself along with a weighted combination of other relevant tokens each weighted by its attention weight.

For each attention unit, the transformer model learns three weight matrices; the query weights  , the key weights  , and the value weights  . For each token  , the input word embedding   is multiplied with each of the three weight matrices to produce a query vector  , a key vector  , and a value vector  . Attention weights are calculated using the query and key vectors: the attention weight   from token   to token   is the dot product between   and  . The attention weights are divided by the square root of the dimension of the key vectors,  , which stabilizes gradients during training, and passed through a softmax which normalizes the weights. The fact that   and   are different matrices allows attention to be non-symmetric: if token   attends to token   (i.e.   is large), this does not necessarily mean that token   will attend to token   (i.e.   could be small). The output of the attention unit for token   is the weighted sum of the value vectors of all tokens, weighted by  , the attention from token   to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices  ,   and   are defined as the matrices where the  th rows are vectors  ,  , and   respectively.

 

where softmax is taken over the horizontal axis.

Multi-head attention

One set of   matrices is called an attention head, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, with multiple attention heads the model can do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.[8] The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Concretely, let the multiple attention heads be indexed by  , then we have

 
where the matrices   are "projection matrices" owned by individual attention head  , and   is a final projection matrix owned by the whole multi-headed attention head.

Masked attention

It may be necessary to cut out attention links between some word-pairs. For example, the decoder for token position   should not have access to token position  . This may be accomplished before the softmax stage by adding a mask matrix   that is negative infinity at entries where the attention link must be cut, and zero at other places.

Encoder

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism accepts input encodings from the previous encoder and weights their relevance to each other to generate output encodings. The feed-forward neural network further processes each output encoding individually. These output encodings are then passed to the next encoder as its input, as well as to the decoders.

The first encoder takes positional information and embeddings of the input sequence as its input, rather than encodings. The positional information is necessary for the transformer to make use of the order of the sequence, because no other part of the transformer makes use of this.[1]

The encoder is bidirectional. Attention can be placed on tokens before and after the current token. Tokens are used instead of words to account for polysemy.

 
A diagram of a sinusoidal positional encoding with parameters  

Positional encoding

A positional encoding is a fixed-size vector representation that encapsulates the relative positions of tokens within a target sequence: it provides the transformer model with information about where the words are in the input sequence.

The positional encoding is defined as a function of type  , where   is a positive even integer. The full positional encoding - as defined in the original paper - is given by the equation:

 
where  .


Here,   is a free parameter that should be significantly larger than the biggest   that would be input into the positional encoding function. In the original paper,[1] the authors chose  .

The function is in a simpler form when written as a complex function of type  

 
where  .

The main reason the authors chose this as the positional encoding function is that it allows one to perform shifts as linear transformations:

 
where   is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication.

By taking a linear sum, any convolution can also be implemented as linear transformations:

 
for any constants  . This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a convolutional neural network language model. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position".

In typical implementations, all operations are done over the real numbers, not the complex numbers, but since complex multiplication can be implemented as real 2-by-2 matrix multiplication, this is a mere notational difference.

Other positional encoding schemes exist.[9]

Decoder

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the encoder-decoder attention.[1][7]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow.[1] This allows for autoregressive text generation. For all attention heads, attention can't be placed on following tokens. The last decoder is followed by a final linear transformation and softmax layer, to produce the output probabilities over the vocabulary.

GPT has a decoder-only architecture.

Alternatives

Training transformer-based architectures can be expensive, especially for long inputs.[10] Alternative architectures include the Reformer (which reduces the computational load from   to  [11]), or models like ETC/BigBird (which can reduce it to  )[12] where   is the length of the sequence. This is done using locality-sensitive hashing and reversible layers.[13][14]

Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention Free Transformers[15] reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.

A benchmark for comparing transformer architectures was introduced in late 2020 by the name of Long Range Arena.[16]

Training

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

Applications

The transformer has had great success in natural language processing (NLP), for example the tasks of machine translation and time series prediction.[18] Many pretrained models such as GPT-2, GPT-3, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications.[4][5][19] These may include:

In 2020, it was shown that the transformer architecture, more specifically GPT-2, could be tuned to play chess.[25] Transformers have been applied to image processing with results competitive with convolutional neural networks.[26][27]

Due to the impressive results and the wide adoption in computer vision and language modeling, transformers started being adopted in new domains, such as medical imaging[28][29][30][31] and speech recognition.[32][33][34][35]

Implementations

The transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch.

Transformers is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.[3]

See also

  • Perceiver – Machine learning algorithm for non-textual data
  • GPT-3 – 2020 text-generating language model
  • ChatGPT – Artificial intelligence chatbot developed by OpenAI
  • Wu Dao – Chinese multimodal artificial intelligence program
  • Vision transformer – Machine learning algorithm for vision processing
  • BLOOM (language model) – Open-access multilingual language model

References

  1. ^ a b c d e f g h Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
  2. ^ He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science.
  3. ^ a b Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
  4. ^ a b c d "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. Retrieved 2019-08-25.
  5. ^ a b c "Better Language Models and Their Implications". OpenAI. 2019-02-14. Retrieved 2019-08-25.
  6. ^ "Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Retrieved 2019-10-15.
  7. ^ a b c Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Retrieved 2019-10-15.
  8. ^ Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics: 276–286. doi:10.18653/v1/W19-4828.
  9. ^ Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (2022-06-06). "Position Information in Transformers: An Overview". Computational Linguistics. 48 (3): 733–763. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.
  10. ^ Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). "Reformer: The Efficient Transformer". arXiv:2001.04451 [cs.LG].
  11. ^ Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020-02-18). "Reformer: The Efficient Transformer". arXiv:2001.04451 [cs.LG].
  12. ^ "Constructing Transformers For Longer Sequences with Sparse Attention Methods". Google AI Blog. Retrieved 2021-05-28.
  13. ^ "Tasks with Long Sequences – Chatbot". Coursera.
  14. ^ "Reformer: The Efficient Transformer". Google AI Blog. Retrieved 2020-10-22.
  15. ^ Zhai, Shuangfei; Talbott, Walter; Srivastava, Nitish; Huang, Chen; Goh, Hanlin; Zhang, Ruixiang; Susskind, Josh (2021-09-21). "An Attention Free Transformer". arXiv:2105.14103 [cs.LG].
  16. ^ Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (2020-11-08). "Long Range Arena: A Benchmark for Efficient Transformers". arXiv:2011.04006 [cs.LG].
  17. ^ a b Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. arXiv:1804.07461. doi:10.18653/v1/w18-5446. S2CID 5034059.
  18. ^ Allard, Maxime (2019-07-01). "What is a Transformer?". Medium. Retrieved 2019-10-21.
  19. ^ Yang, Zhilin Dai, Zihang Yang, Yiming Carbonell, Jaime Salakhutdinov, Ruslan Le, Quoc V. (2019-06-19). XLNet: Generalized Autoregressive Pretraining for Language Understanding. OCLC 1106350082.{{cite book}}: CS1 maint: multiple names: authors list (link)
  20. ^ Monsters, Data (2017-09-26). "10 Applications of Artificial Neural Networks in Natural Language Processing". Medium. Retrieved 2019-10-21.
  21. ^ Rives, Alexander; Goyal, Siddharth; Meier, Joshua; Guo, Demi; Ott, Myle; Zitnick, C. Lawrence; Ma, Jerry; Fergus, Rob (2019). "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". bioRxiv 10.1101/622803.
  22. ^ Nambiar, Ananthan; Heflin, Maeve; Liu, Simon; Maslov, Sergei; Hopkins, Mark; Ritz, Anna (2020). "Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks". doi:10.1145/3388440.3412467. S2CID 226283020. {{cite journal}}: Cite journal requires |journal= (help)
  23. ^ Rao, Roshan; Bhattacharya, Nicholas; Thomas, Neil; Duan, Yan; Chen, Xi; Canny, John; Abbeel, Pieter; Song, Yun S. (2019). "Evaluating Protein Transfer Learning with TAPE". bioRxiv 10.1101/676825.
  24. ^ Bertasias; Wang; Torresani (2021). "Is Space-Time Attention All You Need for Video Understanding?". arXiv:2102.05095 [cs.CV].
  25. ^ Noever, David; Ciolino, Matt; Kalin, Josh (2020-08-21). "The Chess Transformer: Mastering Play using Generative Language Models". arXiv:2008.04057 [cs.AI].
  26. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob; Houlsby, Neil (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV]. {{cite arxiv}}: Cite has empty unknown parameters: |access-date= and |website= (help)
  27. ^ Touvron, Hugo; Cord, Matthieu; Douze, Matthijs; Massa, Francisco; Sablayrolles, Alexandre; Jégou, Hervé (2020). "Training data-efficient image transformers & distillation through attention". arXiv:2012.12877 [cs.CV]. {{cite arxiv}}: Cite has empty unknown parameters: |access-date= and |website= (help)
  28. ^ Chen, Jieneng; Lu, Yongyi; Yu, Qihang; Luo, Xiangde; Adeli, Ehsan; Wang, Yan; Lu, Le; Yuille, Alan L.; Zhou, Yuyin (2021-02-08). "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation". arXiv:2102.04306 [cs.CV].
  29. ^ "UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation". MICCAI 2021 - Accepted Papers and Reviews. 2021-09-01. Retrieved 2023-01-21.
  30. ^ Ristea, Nicolae-Catalin; Miron, Andreea-Iuliana; Savencu, Olivian; Georgescu, Mariana-Iuliana; Verga, Nicolae; Khan, Fahad Shahbaz; Ionescu, Radu Tudor (2021-10-21). "CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation". arXiv:2110.06400 [eess.IV].
  31. ^ Hatamizadeh, Ali; Tang, Yucheng; Nath, Vishwesh; Yang, Dong; Myronenko, Andriy; Landman, Bennett; Roth, Holger; Xu, Daguang (2021-10-09). "UNETR: Transformers for 3D Medical Image Segmentation". arXiv:2103.10504 [eess.IV].
  32. ^ Gong, Yuan; Chung, Yu-An; Glass, James (2021-07-08). "AST: Audio Spectrogram Transformer". arXiv:2104.01778 [cs.SD].
  33. ^ Leong, Chi-Hang; Huang, Yu-Han; Chien, Jen-Tzung. "Online Compressive Transformer for End-to-End Speech Recognition". www.isca-speech.org. Retrieved 2023-01-21.
  34. ^ Lohrenz, Timo; Li, Zhengyang; Fingscheidt, Tim (2021-07-14). "Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition". arXiv:2104.00120 [eess.AS].
  35. ^ Ristea, Nicolae-Catalin; Ionescu, Radu Tudor; Khan, Fahad Shahbaz (2022-06-20). "SepTr: Separable Transformer for Audio Spectrogram Processing". arXiv:2203.09581 [cs.CV].

Further reading

  • Hubert Ramsauer et al. (2020), "Hopfield Networks is All You Need", preprint submitted for ICLR 2021. arXiv:2008.02217; see also authors' blog
– Discussion of the effect of a transformer layer as equivalent to a Hopfield update, bringing the input closer to one of the fixed points (representable patterns) of a continuous-valued Hopfield network
  • Alexander Rush, The Annotated transformer, Harvard NLP group, 3 April 2018

transformer, machine, learning, model, this, article, relies, excessively, references, primary, sources, please, improve, this, article, adding, secondary, tertiary, sources, find, sources, transformer, machine, learning, model, news, newspapers, books, schola. This article relies excessively on references to primary sources Please improve this article by adding secondary or tertiary sources Find sources Transformer machine learning model news newspapers books scholar JSTOR February 2023 Learn how and when to remove this template message A transformer is a deep learning model that adopts the mechanism of self attention differentially weighting the significance of each part of the input data It is used primarily in the fields of natural language processing NLP 1 and computer vision CV 2 Like recurrent neural networks RNNs transformers are designed to process sequential input data such as natural language with applications towards tasks such as translation and text summarization However unlike RNNs transformers process the entire input all at once The attention mechanism provides context for any position in the input sequence For example if the input data is a natural language sentence the transformer does not have to process one word at a time This allows for more parallelization than RNNs and therefore reduces training times 1 Transformers were introduced in 2017 by a team at Google Brain 1 and are increasingly the model of choice for NLP problems 3 replacing RNN models such as long short term memory LSTM The additional training parallelization allows training on larger datasets This led to the development of pretrained systems such as BERT Bidirectional Encoder Representations from Transformers and GPT Generative Pre trained Transformer which were trained with large language datasets such as the Wikipedia Corpus and Common Crawl and can be fine tuned for specific tasks 4 5 Contents 1 Background 1 1 Sequential processing 1 2 Self Attention 2 Architecture 2 1 Input 2 2 Encoder decoder architecture 2 3 Scaled dot product attention 2 3 1 Multi head attention 2 3 2 Masked attention 2 4 Encoder 2 4 1 Positional encoding 2 5 Decoder 2 6 Alternatives 3 Training 4 Applications 5 Implementations 6 See also 7 References 8 Further readingBackground EditBefore transformers most state of the art NLP systems relied on gated RNNs such as LSTMs and gated recurrent units GRUs with added attention mechanisms Transformers also make use of attention mechanisms but unlike RNNs do not have a recurrent structure This means that provided with enough training data attention mechanisms alone can match the performance of RNNs with attention 1 Sequential processing Edit Gated RNNs process tokens sequentially maintaining a state vector that contains a representation of the data seen prior to the current token To process the n textstyle n th token the model combines the state representing the sentence up to token n 1 textstyle n 1 with the information of the new token to create a new state representing the sentence up to token n textstyle n Theoretically the information from one token can propagate arbitrarily far down the sequence if at every point the state continues to encode contextual information about the token In practice this mechanism is flawed the vanishing gradient problem leaves the model s state at the end of a long sentence without precise extractable information about preceding tokens The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern deep learning hardware This can make the training of RNNs inefficient Self Attention Edit These problems were addressed by attention mechanisms Attention mechanisms let a model draw from the state at any preceding point along the sequence The attention layer can access all previous states and weigh them according to a learned measure of relevance providing relevant information about far away tokens A clear example of the value of attention is in language translation where context is essential to assign the meaning of a word in a sentence In an English to French translation system the first word of the French output most probably depends heavily on the first few words of the English input However in a classic LSTM model in order to produce the first word of the French output the model is given only the state vector after processing the last English word Theoretically this vector can encode information about the whole English sentence giving the model all the necessary knowledge In practice this information is often poorly preserved by the LSTM An attention mechanism can be added to address this problem the decoder is given access to the state vectors of every English input word not just the last and can learn attention weights that dictate how much to attend to each English input state vector When added to RNNs attention mechanisms increase performance The development of the Transformer architecture revealed that attention mechanisms were powerful in themselves and that sequential recurrent processing of data was not necessary to achieve the quality gains of RNNs with attention Transformers use an attention mechanism without an RNN processing all tokens simultaneously and calculating attention weights between them in successive layers Since the attention mechanism only uses information about other tokens from lower layers it can be computed for all tokens in parallel which leads to improved training speed Architecture Edit Transformer model architecture Input Edit The input text is parsed into tokens by a byte pair encoding tokenizer and each token is converted via a word embedding into a vector Then positional information of the token is added to the word embedding Encoder decoder architecture Edit Like earlier seq2seq models the original Transformer model used an encoder decoder architecture The encoder consists of encoding layers that process the input iteratively one layer after another while the decoder consists of decoding layers that do the same thing to the encoder s output The function of each encoder layer is to generate encodings that contain information about which parts of the inputs are relevant to each other It passes its encodings to the next encoder layer as inputs Each decoder layer does the opposite taking all the encodings and using their incorporated contextual information to generate an output sequence 6 To achieve this each encoder and decoder layer makes use of an attention mechanism For each part of the input attention weighs the relevance of every other part and draws from them to produce the output 7 Each decoder layer has an additional attention mechanism that draws information from the outputs of previous decoders before the decoder layer draws information from the encodings Both the encoder and decoder layers have a feed forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps 7 Scaled dot product attention Edit The transformer building blocks are scaled dot product attention units When a sentence is passed into a transformer model attention weights are calculated between every token simultaneously The attention unit produces embeddings for every token in context that contain information about the token itself along with a weighted combination of other relevant tokens each weighted by its attention weight For each attention unit the transformer model learns three weight matrices the query weights W Q displaystyle W Q the key weights W K displaystyle W K and the value weights W V displaystyle W V For each token i displaystyle i the input word embedding x i displaystyle x i is multiplied with each of the three weight matrices to produce a query vector q i x i W Q displaystyle q i x i W Q a key vector k i x i W K displaystyle k i x i W K and a value vector v i x i W V displaystyle v i x i W V Attention weights are calculated using the query and key vectors the attention weight a i j displaystyle a ij from token i displaystyle i to token j displaystyle j is the dot product between q i displaystyle q i and k j displaystyle k j The attention weights are divided by the square root of the dimension of the key vectors d k displaystyle sqrt d k which stabilizes gradients during training and passed through a softmax which normalizes the weights The fact that W Q displaystyle W Q and W K displaystyle W K are different matrices allows attention to be non symmetric if token i displaystyle i attends to token j displaystyle j i e q i k j displaystyle q i cdot k j is large this does not necessarily mean that token j displaystyle j will attend to token i displaystyle i i e q j k i displaystyle q j cdot k i could be small The output of the attention unit for token i displaystyle i is the weighted sum of the value vectors of all tokens weighted by a i j displaystyle a ij the attention from token i displaystyle i to each token The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations The matrices Q displaystyle Q K displaystyle K and V displaystyle V are defined as the matrices where the i displaystyle i th rows are vectors q i displaystyle q i k i displaystyle k i and v i displaystyle v i respectively Attention Q K V softmax Q K T d k V displaystyle begin aligned text Attention Q K V text softmax left frac QK mathrm T sqrt d k right V end aligned where softmax is taken over the horizontal axis Multi head attention Edit One set of W Q W K W V displaystyle left W Q W K W V right matrices is called an attention head and each layer in a transformer model has multiple attention heads While each attention head attends to the tokens that are relevant to each token with multiple attention heads the model can do this for different definitions of relevance In addition the influence field representing relevance can become progressively dilated in successive layers Many transformer attention heads encode relevance relations that are meaningful to humans For example some attention heads can attend mostly to the next word while others mainly attend from verbs to their direct objects 8 The computations for each attention head can be performed in parallel which allows for fast processing The outputs for the attention layer are concatenated to pass into the feed forward neural network layers Concretely let the multiple attention heads be indexed by i displaystyle i then we haveMultiheadedAttention Q K V Concat Attention Q W i Q K W i K V W i V W O displaystyle text MultiheadedAttention Q K V text Concat text Attention QW i Q KW i K VW i V W O where the matrices W i Q W i K W i V displaystyle W i Q W i K W i V are projection matrices owned by individual attention head i displaystyle i and W O displaystyle W O is a final projection matrix owned by the whole multi headed attention head Masked attention Edit It may be necessary to cut out attention links between some word pairs For example the decoder for token position t displaystyle t should not have access to token position t 1 displaystyle t 1 This may be accomplished before the softmax stage by adding a mask matrix M displaystyle M that is negative infinity at entries where the attention link must be cut and zero at other places Encoder Edit Each encoder consists of two major components a self attention mechanism and a feed forward neural network The self attention mechanism accepts input encodings from the previous encoder and weights their relevance to each other to generate output encodings The feed forward neural network further processes each output encoding individually These output encodings are then passed to the next encoder as its input as well as to the decoders The first encoder takes positional information and embeddings of the input sequence as its input rather than encodings The positional information is necessary for the transformer to make use of the order of the sequence because no other part of the transformer makes use of this 1 The encoder is bidirectional Attention can be placed on tokens before and after the current token Tokens are used instead of words to account for polysemy A diagram of a sinusoidal positional encoding with parameters N 10000 d 100 displaystyle N 10000 d 100 Positional encoding Edit A positional encoding is a fixed size vector representation that encapsulates the relative positions of tokens within a target sequence it provides the transformer model with information about where the words are in the input sequence The positional encoding is defined as a function of type f R R d d Z d gt 0 displaystyle f mathbb R to mathbb R d d in mathbb Z d gt 0 where d displaystyle d is a positive even integer The full positional encoding as defined in the original paper is given by the equation f t 2 k f t 2 k 1 sin 8 cos 8 k 0 1 d 2 1 displaystyle f t 2k f t 2k 1 sin theta cos theta quad forall k in 0 1 ldots d 2 1 where 8 t r k r N 2 d displaystyle theta frac t r k r N 2 d Here N displaystyle N is a free parameter that should be significantly larger than the biggest k displaystyle k that would be input into the positional encoding function In the original paper 1 the authors chose N 10000 displaystyle N 10000 The function is in a simpler form when written as a complex function of type f R C d 2 displaystyle f mathbb R to mathbb C d 2 f t e i t r k k 0 1 d 2 1 displaystyle f t left e it r k right k 0 1 ldots frac d 2 1 where r N 2 d displaystyle r N 2 d The main reason the authors chose this as the positional encoding function is that it allows one to perform shifts as linear transformations f t D t d i a g f D t f t displaystyle f t Delta t mathrm diag f Delta t f t where D t R displaystyle Delta t in mathbb R is the distance one wishes to shift This allows the transformer to take any encoded position and find the encoding of the position n steps ahead or n steps behind by a matrix multiplication By taking a linear sum any convolution can also be implemented as linear transformations j c j f t D t j j c j d i a g f D t j f t displaystyle sum j c j f t Delta t j left sum j c j mathrm diag f Delta t j right f t for any constants c j displaystyle c j This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors This sum of encoded positions when fed into the attention mechanism would create attention weights on its neighbors much like what happens in a convolutional neural network language model In the author s words we hypothesized it would allow the model to easily learn to attend by relative position In typical implementations all operations are done over the real numbers not the complex numbers but since complex multiplication can be implemented as real 2 by 2 matrix multiplication this is a mere notational difference Other positional encoding schemes exist 9 Decoder Edit Each decoder consists of three major components a self attention mechanism an attention mechanism over the encodings and a feed forward neural network The decoder functions in a similar fashion to the encoder but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders This mechanism can also be called the encoder decoder attention 1 7 Like the first encoder the first decoder takes positional information and embeddings of the output sequence as its input rather than encodings The transformer must not use the current or future output to predict an output so the output sequence must be partially masked to prevent this reverse information flow 1 This allows for autoregressive text generation For all attention heads attention can t be placed on following tokens The last decoder is followed by a final linear transformation and softmax layer to produce the output probabilities over the vocabulary GPT has a decoder only architecture Alternatives Edit Training transformer based architectures can be expensive especially for long inputs 10 Alternative architectures include the Reformer which reduces the computational load from O N 2 displaystyle O N 2 to O N ln N displaystyle O N ln N 11 or models like ETC BigBird which can reduce it to O N displaystyle O N 12 where N displaystyle N is the length of the sequence This is done using locality sensitive hashing and reversible layers 13 14 Ordinary transformers require a memory size that is quadratic in the size of the context window Attention Free Transformers 15 reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value A benchmark for comparing transformer architectures was introduced in late 2020 by the name of Long Range Arena 16 Training EditTransformers typically undergo self supervised learning involving unsupervised pretraining followed by supervised fine tuning Pretraining is typically done on a larger dataset than fine tuning due to the limited availability of labeled training data Tasks for pretraining and fine tuning commonly include language modeling 4 next sentence prediction 4 question answering 5 reading comprehension sentiment analysis 17 paraphrasing 17 Applications EditThe transformer has had great success in natural language processing NLP for example the tasks of machine translation and time series prediction 18 Many pretrained models such as GPT 2 GPT 3 BERT XLNet RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of such NLP related tasks and have the potential to find real world applications 4 5 19 These may include machine translation document summarization document generation named entity recognition NER 20 biological sequence analysis 21 22 23 video understanding 24 In 2020 it was shown that the transformer architecture more specifically GPT 2 could be tuned to play chess 25 Transformers have been applied to image processing with results competitive with convolutional neural networks 26 27 Due to the impressive results and the wide adoption in computer vision and language modeling transformers started being adopted in new domains such as medical imaging 28 29 30 31 and speech recognition 32 33 34 35 Implementations EditThe transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch Transformers is a library produced by Hugging Face that supplies transformer based architectures and pretrained models 3 See also EditPerceiver Machine learning algorithm for non textual data GPT 3 2020 text generating language model ChatGPT Artificial intelligence chatbot developed by OpenAI Wu Dao Chinese multimodal artificial intelligence program Vision transformer Machine learning algorithm for vision processing BLOOM language model Open access multilingual language modelReferences Edit a b c d e f g h Vaswani Ashish Shazeer Noam Parmar Niki Uszkoreit Jakob Jones Llion Gomez Aidan N Kaiser Lukasz Polosukhin Illia 2017 06 12 Attention Is All You Need arXiv 1706 03762 cs CL He Cheng 31 December 2021 Transformer in CV Transformer in CV Towards Data Science a b Wolf Thomas Debut Lysandre Sanh Victor Chaumond Julien Delangue Clement Moi Anthony Cistac Pierric Rault Tim Louf Remi Funtowicz Morgan Davison Joe Shleifer Sam von Platen Patrick Ma Clara Jernite Yacine Plu Julien Xu Canwen Le Scao Teven Gugger Sylvain Drame Mariama Lhoest Quentin Rush Alexander 2020 Transformers State of the Art Natural Language Processing Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing System Demonstrations pp 38 45 doi 10 18653 v1 2020 emnlp demos 6 S2CID 208117506 a b c d Open Sourcing BERT State of the Art Pre training for Natural Language Processing Google AI Blog Retrieved 2019 08 25 a b c Better Language Models and Their Implications OpenAI 2019 02 14 Retrieved 2019 08 25 Sequence Modeling with Neural Networks Part 2 Attention Models Indico 2016 04 18 Retrieved 2019 10 15 a b c Alammar Jay The Illustrated Transformer jalammar github io Retrieved 2019 10 15 Clark Kevin Khandelwal Urvashi Levy Omer Manning Christopher D August 2019 What Does BERT Look at An Analysis of BERT s Attention Proceedings of the 2019 ACL Workshop BlackboxNLP Analyzing and Interpreting Neural Networks for NLP Florence Italy Association for Computational Linguistics 276 286 doi 10 18653 v1 W19 4828 Dufter Philipp Schmitt Martin Schutze Hinrich 2022 06 06 Position Information in Transformers An Overview Computational Linguistics 48 3 733 763 doi 10 1162 coli a 00445 ISSN 0891 2017 S2CID 231986066 Kitaev Nikita Kaiser Lukasz Levskaya Anselm 2020 Reformer The Efficient Transformer arXiv 2001 04451 cs LG Kitaev Nikita Kaiser Lukasz Levskaya Anselm 2020 02 18 Reformer The Efficient Transformer arXiv 2001 04451 cs LG Constructing Transformers For Longer Sequences with Sparse Attention Methods Google AI Blog Retrieved 2021 05 28 Tasks with Long Sequences Chatbot Coursera Reformer The Efficient Transformer Google AI Blog Retrieved 2020 10 22 Zhai Shuangfei Talbott Walter Srivastava Nitish Huang Chen Goh Hanlin Zhang Ruixiang Susskind Josh 2021 09 21 An Attention Free Transformer arXiv 2105 14103 cs LG Tay Yi Dehghani Mostafa Abnar Samira Shen Yikang Bahri Dara Pham Philip Rao Jinfeng Yang Liu Ruder Sebastian Metzler Donald 2020 11 08 Long Range Arena A Benchmark for Efficient Transformers arXiv 2011 04006 cs LG a b Wang Alex Singh Amanpreet Michael Julian Hill Felix Levy Omer Bowman Samuel 2018 GLUE A Multi Task Benchmark and Analysis Platform for Natural Language Understanding Proceedings of the 2018 EMNLP Workshop BlackboxNLP Analyzing and Interpreting Neural Networks for NLP Stroudsburg PA USA Association for Computational Linguistics 353 355 arXiv 1804 07461 doi 10 18653 v1 w18 5446 S2CID 5034059 Allard Maxime 2019 07 01 What is a Transformer Medium Retrieved 2019 10 21 Yang Zhilin Dai Zihang Yang Yiming Carbonell Jaime Salakhutdinov Ruslan Le Quoc V 2019 06 19 XLNet Generalized Autoregressive Pretraining for Language Understanding OCLC 1106350082 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link Monsters Data 2017 09 26 10 Applications of Artificial Neural Networks in Natural Language Processing Medium Retrieved 2019 10 21 Rives Alexander Goyal Siddharth Meier Joshua Guo Demi Ott Myle Zitnick C Lawrence Ma Jerry Fergus Rob 2019 Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences bioRxiv 10 1101 622803 Nambiar Ananthan Heflin Maeve Liu Simon Maslov Sergei Hopkins Mark Ritz Anna 2020 Transforming the Language of Life Transformer Neural Networks for Protein Prediction Tasks doi 10 1145 3388440 3412467 S2CID 226283020 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help Rao Roshan Bhattacharya Nicholas Thomas Neil Duan Yan Chen Xi Canny John Abbeel Pieter Song Yun S 2019 Evaluating Protein Transfer Learning with TAPE bioRxiv 10 1101 676825 Bertasias Wang Torresani 2021 Is Space Time Attention All You Need for Video Understanding arXiv 2102 05095 cs CV Noever David Ciolino Matt Kalin Josh 2020 08 21 The Chess Transformer Mastering Play using Generative Language Models arXiv 2008 04057 cs AI Dosovitskiy Alexey Beyer Lucas Kolesnikov Alexander Weissenborn Dirk Zhai Xiaohua Unterthiner Thomas Dehghani Mostafa Minderer Matthias Heigold Georg Gelly Sylvain Uszkoreit Jakob Houlsby Neil 2020 An Image is Worth 16x16 Words Transformers for Image Recognition at Scale arXiv 2010 11929 cs CV a href Template Cite arxiv html class mw redirect title Template Cite arxiv cite arxiv a Cite has empty unknown parameters access date and website help Touvron Hugo Cord Matthieu Douze Matthijs Massa Francisco Sablayrolles Alexandre Jegou Herve 2020 Training data efficient image transformers amp distillation through attention arXiv 2012 12877 cs CV a href Template Cite arxiv html class mw redirect title Template Cite arxiv cite arxiv a Cite has empty unknown parameters access date and website help Chen Jieneng Lu Yongyi Yu Qihang Luo Xiangde Adeli Ehsan Wang Yan Lu Le Yuille Alan L Zhou Yuyin 2021 02 08 TransUNet Transformers Make Strong Encoders for Medical Image Segmentation arXiv 2102 04306 cs CV UTNet A Hybrid Transformer Architecture for Medical Image Segmentation MICCAI 2021 Accepted Papers and Reviews 2021 09 01 Retrieved 2023 01 21 Ristea Nicolae Catalin Miron Andreea Iuliana Savencu Olivian Georgescu Mariana Iuliana Verga Nicolae Khan Fahad Shahbaz Ionescu Radu Tudor 2021 10 21 CyTran Cycle Consistent Transformers for Non Contrast to Contrast CT Translation arXiv 2110 06400 eess IV Hatamizadeh Ali Tang Yucheng Nath Vishwesh Yang Dong Myronenko Andriy Landman Bennett Roth Holger Xu Daguang 2021 10 09 UNETR Transformers for 3D Medical Image Segmentation arXiv 2103 10504 eess IV Gong Yuan Chung Yu An Glass James 2021 07 08 AST Audio Spectrogram Transformer arXiv 2104 01778 cs SD Leong Chi Hang Huang Yu Han Chien Jen Tzung Online Compressive Transformer for End to End Speech Recognition www isca speech org Retrieved 2023 01 21 Lohrenz Timo Li Zhengyang Fingscheidt Tim 2021 07 14 Multi Encoder Learning and Stream Fusion for Transformer Based End to End Automatic Speech Recognition arXiv 2104 00120 eess AS Ristea Nicolae Catalin Ionescu Radu Tudor Khan Fahad Shahbaz 2022 06 20 SepTr Separable Transformer for Audio Spectrogram Processing arXiv 2203 09581 cs CV Further reading EditHubert Ramsauer et al 2020 Hopfield Networks is All You Need preprint submitted for ICLR 2021 arXiv 2008 02217 see also authors blog Discussion of the effect of a transformer layer as equivalent to a Hopfield update bringing the input closer to one of the fixed points representable patterns of a continuous valued Hopfield network dd Alexander Rush The Annotated transformer Harvard NLP group 3 April 2018 Retrieved from https en wikipedia org w index php title Transformer machine learning model amp oldid 1139384039, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.