Sequential Data & Transformer Networks
ECS 170W22 10 March 2022
Copyright By PowCoder代写 加微信 powcoder
● Fully connected layers: General purpose, often connected to output layer
● Convolutional layers: Take advantage of the structure of data, recognize patterns
● Recurrent neural networks: Take advantage of sequential data
● Sequential data: Next datum is dependent of previous
● Example: weather one day is dependent on previous
85 87 86 81 79
Recurrent Neural Networks (RNN) (1982)
● Each neuron takes in a data point (xi) and activation from the previous neuron (vi) to make output (oi) and hidden activation (hi)
Recurrent Neural Networks (RNN)
● Result can be a new sequence (sequence to sequence), single prediction (sequence to vector), or single value to sequence (vector to sequence)
Recurrent Neural Networks (RNN)
● Sequence to sequence:
Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
Quien no puede recordar el pasado está condenado a repetirlo Those who cannot remember the past are condemned to repeat it
Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse
Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse
● Vector to sequence:
Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse
● Vector to sequence: ○ Captioning
○ Encoder-decoder
Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNN)
○ All context embedded into the v vector
○ Can only look back so far (usually sequence of about 2, depends on
Long-Short Term Memory (1997)
● Goal: Allow context to stretch further back
Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit
Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit
● Input: context Ci-1, hidden activation hi-1, input xi
● Output: Output Oi, context Ci, hidden activation hi
Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit
● Input: context Ci-1, hidden activation hi-1, input xi
● Output: Output Oi, context Ci, hidden activation hi
Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit
Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit
○ Has a window of ~5
● Drawbacks:
Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit
○ Has a window of ~5
● Drawbacks:
○ Many more parameters
○ Window might not be big enough
● Read input both forward and backward, output combines both
● Η σύζυγος του αδερφού μου είναι γιατρός
● My brother’s wife is a doctor
● Η σύζυγος του αδερφού μου είναι γιατρός
● My brother’s wife is a doctor
NMT By Jointly Learning to Align and Translate (2015)
● BiLSTM encoder
NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word
NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word
● Output depends on context and previous word (LSTM)
NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word
● Output depends on context and previous word (LSTM)
● What is needed:
○ BiLSTM encoder
○ Attention network to model attention
○ LSTM decoder
NMT By Jointly Learning to Align and Translate – Results
NMT By Jointly Learning to Align and Translate – Results
NMT By Jointly Learning to Align and Translate – Results
● Window size is increased to about 30-50
Attention is All You Need (2017)
● Encode sentence
Attention is All You Need (2017)
● Encode sentence
● Run attention via attention unit
○ Value, key, query model
Attention is All You Need (2017)
● Encode sentence
● Run attention via attention unit
○ Value, key, query model
● Feed forward network to get encoding
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding ● Decoder
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding ● Decoder
○ Input the sentence as generated word-by-word
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector
○ Sent to a feed forward network
Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector
○ Sent to a feed forward network
○ Next word generated
Attention is All You Need (2017) – Results
● Effectively infinite window
Attention is All You Need (2017) – Results
● Effectively infinite window
● The Chess Transformer – Mastering Play Using
Generative Language Models Noever et al 2020
● Enhancing the Transformer With Explicit Relational
Encoding for Math Problem Solving, Schlag et al 2020
● Play with a transformer
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com