CS代写 ECS 170W22 10 March 2022

Sequential Data & Transformer Networks
ECS 170W22 10 March 2022

Copyright By PowCoder代写 加微信 powcoder

● Fully connected layers: General purpose, often connected to output layer
● Convolutional layers: Take advantage of the structure of data, recognize patterns
● Recurrent neural networks: Take advantage of sequential data

● Sequential data: Next datum is dependent of previous
● Example: weather one day is dependent on previous
85 87 86 81 79

Recurrent Neural Networks (RNN) (1982)
● Each neuron takes in a data point (xi) and activation from the previous neuron (vi) to make output (oi) and hidden activation (hi)

Recurrent Neural Networks (RNN)
● Result can be a new sequence (sequence to sequence), single prediction (sequence to vector), or single value to sequence (vector to sequence)

Recurrent Neural Networks (RNN)
● Sequence to sequence:

Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
Quien no puede recordar el pasado está condenado a repetirlo Those who cannot remember the past are condemned to repeat it

Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:

Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse

Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse
● Vector to sequence:

Recurrent Neural Networks (RNN)
● Sequence to sequence: ○ Translation
● Sequence to vector:
○ Sentiment analysis
○ Movie reviews, online abuse
● Vector to sequence: ○ Captioning
○ Encoder-decoder

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN)
○ All context embedded into the v vector
○ Can only look back so far (usually sequence of about 2, depends on

Long-Short Term Memory (1997)
● Goal: Allow context to stretch further back

Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit

Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit
● Input: context Ci-1, hidden activation hi-1, input xi
● Output: Output Oi, context Ci, hidden activation hi

Long-Short Term Memory
● Goal: Allow context to stretch further back
● Allowed through a “gating” mechanism:
○ Input gate: Update memory with new information
○ Forget gate: Forget older memory
○ Output gate: Decide what to tell next unit
● Input: context Ci-1, hidden activation hi-1, input xi
● Output: Output Oi, context Ci, hidden activation hi

Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit

Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit
○ Has a window of ~5
● Drawbacks:

Long-Short Term Memory
● Benefits:
○ Overcome vanishing/exploding gradients a bit
○ Has a window of ~5
● Drawbacks:
○ Many more parameters
○ Window might not be big enough

● Read input both forward and backward, output combines both

● Η σύζυγος του αδερφού μου είναι γιατρός
● My brother’s wife is a doctor

● Η σύζυγος του αδερφού μου είναι γιατρός
● My brother’s wife is a doctor

NMT By Jointly Learning to Align and Translate (2015)
● BiLSTM encoder

NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word

NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word
● Output depends on context and previous word (LSTM)

NMT By Jointly Learning to Align and Translate
● BiLSTM encoder
● Each output word gets its own context vector
○ Attention is placed on each input word depending on the position of
the next output word
● Output depends on context and previous word (LSTM)
● What is needed:
○ BiLSTM encoder
○ Attention network to model attention
○ LSTM decoder

NMT By Jointly Learning to Align and Translate – Results

NMT By Jointly Learning to Align and Translate – Results

NMT By Jointly Learning to Align and Translate – Results
● Window size is increased to about 30-50

Attention is All You Need (2017)
● Encode sentence

Attention is All You Need (2017)
● Encode sentence
● Run attention via attention unit
○ Value, key, query model

Attention is All You Need (2017)
● Encode sentence
● Run attention via attention unit
○ Value, key, query model
● Feed forward network to get encoding

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding ● Decoder

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding ● Decoder
○ Input the sentence as generated word-by-word

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector
○ Sent to a feed forward network

Attention is All You Need (2017)
○ Encode sentence
○ Run attention via attention unit
■ Value, key, query model
○ Feed forward network to get encoding
○ Input the sentence as generated word-by-word
○ Attention is mapped from output
○ Output is compared to context vector
○ Sent to a feed forward network
○ Next word generated

Attention is All You Need (2017) – Results
● Effectively infinite window

Attention is All You Need (2017) – Results
● Effectively infinite window
● The Chess Transformer – Mastering Play Using
Generative Language Models Noever et al 2020
● Enhancing the Transformer With Explicit Relational
Encoding for Math Problem Solving, Schlag et al 2020
● Play with a transformer

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com