Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 115, 2021
This section:
• Convolutional neural nets
• Recurrent neural nets
• LSTMs
• Sequence to sequence models
Reading:
• optional: Mitchell: Chapter 4
• Note Mitchell book now downloadable
Convolutional Neural Nets
A Convolutional Neural Net for Handwritten
Digit recognition: LeNet5*
[LeCun, et al., 1998]
* In the 1998 LeNet5 paper output layer was a Gaussian RBF layer, though today we would use Softmax to obtain probabilities as outputs
Result S
Input I
Learned parameters
Convolution Layer
Kernel K
[from Goodfellow et al.]
Convolution : yields invariance to input translation
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
1
2
1
0
*
=
Trained parameters
Output activations
Convolution as parameter sharing
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
*
=
Trained parameters
Output activations
Result S:
Convolution as parameter sharing
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
*
=
Trained parameters
Output activations
Result S:
How do we calculate gradient components
?
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
*
=
Trained parameters
Output activations
Result S:
How do we calculate gradient components
for training example d?
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
Result S:
*
=
Trained parameters
Output activations
Maxpool Layer
What is derivative of out with respect to inputs?
e.g., if a=2,b=3,e=2,f=4
out = max(a,b,e,f)
[from Goodfellow et al.]
Subsampling Layer
In LeNet
What is derivative of out with respect to inputs?
out = sigmoid(w0 + w1(a+b+e+f))
[from Goodfellow et al.]
A Convolutional Neural Net for Handwritten Digit recognition: LeNet5*
LeNet5 details
• [LeCun et al., 1998]
• C1 is a convolution layer using 6 distinct 5×5 kernels, stride 1, creating 6 distinct channels of 28×28 feature maps, each based on one kernel. Total trainable parameters:
156
• S2 is a subsampling layer, creating 6 channels, one each from the corresponding channel of C1. Values are based on a 2×2 input kernel, stride 2 (so no overlap) and the value output to the S2 map is out = sigmoid(w0+w1(x1+x2+x3+x4)), where xi’s are the four inputs to the 2×2 kernel. Total trainable parameters:
• C3 is a convolutional layer, using 16 kernels to produce 16 feature maps. Each kernel is connected to several 5×5 neighborhoods at identical locations in a subset of the 6 channels of S2 as shown below. Total trainable parameters: 1,516
• S4 subsamples C3, just like S2 samples C1
LeNet5 details
• [LeCun et al., 1998]
• C1 is a convolution layer using 6 distinct 5×5 kernels, stride 1, creating 6 distinct channels of 28×28 feature maps, each based on one kernel. Total trainable parameters:
156
• S2 is a subsampling layer, creating 6 channels, one each from the corresponding channel of C1. Values are based on a 2×2 input kernel, stride 2 (so no overlap) and the value output to the S2 map is out = sigmoid(w0+w1(x1+x2+x3+x4)), where xi’s are the four inputs to the 2×2 kernel. Total trainable parameters:
• C3 is a convolutional layer, using 16 kernels to produce 16 feature maps. Each kernel How many total trainable
channels of S2 as shown below. Total trainable parameters: 1,516
S2?
• S4 subsamples C3, just like S2 samples C1 Answer:
Poll Question 2:
is connected to several 5×5 neighborhoods at identical locations in a subset of the 6
parameters are in layer
LeNet5 details
• [LeCun et al., 1998]
• C1 is a convolution layer using 6 distinct 5×5 kernels, stride 1, creating 6 distinct channels of 28×28 feature maps, each based on one kernel. Total trainable parameters:
156
• S2 is a subsampling layer, creating 6 channels, one each from the corresponding channel of C1. Values are based on a 2×2 input kernel, stride 2 (so no overlap) and the value output to the S2 map is out = sigmoid(w0+w1(x1+x2+x3+x4)), where xi’s are the four inputs to the 2×2 kernel. Total trainable parameters:
• C3 is a convolutional layer, using 16 kernels to produce 16 feature maps. Each kernel is connected to several 5×5 neighborhoods at identical locations in a subset of the 6 channels of S2 as shown below. Total trainable parameters: 1,516
• S4 subsamples C3, just like S2 samples C1
12
LeNet5 (1998):
More typical 2021 Convolutional Net:
Max-pool
Max-pool
Sigmoid, Linear or ReLUs
fully connected
Softmax
Softmax Layer: Predict Probability Distribution over discrete-valued labels
• Logistic Regression: when Y has two possible values
• Softmax: when Y has R values {y1 … yR}, then learn R sets of weights to predict R output probabilities
Note neural network now has R outputs instead of just 1
A Convolutional Neural Net for Handwritten Digit recognition: LeNet
• Shrinking size of feature maps
• Multiple channels
• LeNet-5 Demos:
http://yann.lecun.com/exdb/lenet/index.html
• Vary scale
• Vary stroke width • Squeeze
• Noisy-2, Noisy-4
[from Goodfellow et al.]
Convolutional networks for time series àinvariance across time
[from Margarita Granat]
[Abdel-Hamid, et al., Convolutional Neural Networks for Speech Recognition, IEEE, 2014]
Convolutional Neural Nets
• Convolution across space, time
• Parameter sharing
• Translation invariance
• Scaling
• Multiple channels of “feature maps”
• Architecture with multiple types of layers
• Popular for perception problems
Recurrent Neural Nets for Sequential Data
Sequences
● Words, Letters
● Speech
● Images, Videos
● Programs
● Sequential Decision Making (RL)
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Another example of parameter sharing, like CNNs
[Goodfellow et al., 2016]
Training Recurrent Networks
Key principle for training:
1. Treat as if unfolded in time, resulting in directed acyclic graph
2. Note shared parameters in unfolded netàsum the gradients
[Goodfellow et al., 2016]
Example: RNN to predict next character in string
• Train on entire works of Shakespeare
• 5,448,482 characters, 84 unique
• Python code online with today’s slides
Example: RNN to predict next character in string
• xt : input character, encode 1-hot, 84 dimensions
• ht : hidden layer, 100 dimension
• ot : predicted next character, softmax, 84 dimensions
84 unique characters in this dataset
Example: RNN to predict next character in string
e
i
c
Ni
c
Training loss
Generated strings at different stages of training
0 iterations:
2000 iterations:
200000 iterations:
Example: Language Models to Predict next word
Slide Credit: Piotr Mirowski
Chain Rule
Slide Credit: Piotr Mirowski
• Forward Pass
Slide Credit: Piotr Mirowski
• Backward Pass
* problem: vanishing and/or exploding gradients
Slide Credit: Piotr Mirowski
• Learned hidden representations of context useful for: • part of speech labeling
• sentiment analysis
• information extraction
• Predict label for each word, instead of predicting next word
Slide Credit: Piotr Mirowski
Example: Opinion Mining
Label opinion segments by labeling each word. o = outside
b = beginning of segment
i = inside segment
Label:o b i i i i o
Trump [has come a long way] from [Irsoy & Cardie, 2014]
h summarizes earlier words
Deep Bidirectional Recurrent Network
[Irsoy & Cardie, 2014]
Two additional ideas:
• Multiple layers to compute y from x
• A left-to-right RNN, plus right-to-left RNN
Example:
• Y label values {begin, inside, outside} for each word, to label
contiguous text segments indicating opinions. [Irsoy & Cardie, 2014]
Deep Bidirectional Recurrent Network: Opinion Mining
Mr. Stoiber [has come a long way] from his refusal to …
Y: oobiiiioooo o = outside
b = begin
i = inside
Correct:
[Irsoy & Cardie, 2014]
Deep Bidirectional Recurrent Network
Two additional ideas:
• Multiple layers to compute y from x
• A left-to-right RNN, plus right-to-left RNN
Example:
• Y label values {begin, inside, outside} for each word, to label
contiguous text segments indicating opinions. [Irsoy & Cardie, 2014]
Long Short Term Memory
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
Element-wise multiply
Bi-directional Recurrent Neural Networks
• Key idea: processing of word at position t can depend on following words too, not just preceding words
[Goodfellow et al., 2016]
Deep Bidirectional LSTM Network
[“Hybrid Speech Recognition with Deep Bidirectional LSTM,” Graves et al., 2013]
Gated Recurrent Units (GRUs)
Element-wise multiply
GRU
fewer parameters than LSTM found equally effective in some experiments involving
• speech recognition
• music analysis
see [Chung et al., 2014]
Optional material – won’t be on exam
Sequence to Sequence Learning
Learned Representation
Encoder
Output Sequence
Decoder
Input Sequence
• RNN Encoder-Decoders for Machine Translation (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015)
Seq2Seq
XYZQ
Target sequence
v
A B C D __ X Y Z Input sequence
63
Sequence to Sequence Models
• Natural language processing is concerned with tasks involving language data
Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks
Programming Frameworks for Deep Nets
• Pytorch (Facebook)
• TensorFlow (Google)
• TFLearn (runs on top of TensorFlow, but simpler to use)
• Theano (University of Montreal)
• CNTK (Microsoft)
• Keras (can run on top of Theano, CNTK, TensorFlow)
Many support use of Graphics Processing Units (GPU’s) Major factor in dissemination of Deep Network technology
TensorFlow
example
Modern Deep Networks: 2021 vs 1987
• vastly more online data
• GPU’s, TPU’s
• Heterogenous units
– Relu, sigmoid, tanh, linear
• including memory units – LSTM, GRU, …
• wild new architectures
– 100 layers deep, bidirectional LSTMs, Convolutional nets widespread …
• new ideas for gradient descent
– dropout, batch normalization, weight initialization, …
• unification with probabilistic models – train to output probabilities
• frameworks like TensorFlow
What you should know:
• Representation learning
– Hidden layers re-represent inputs in form to predict outputs
– Autoencoders
– Sometimes reused widely (e.g., word2vec word embeddings)
• Convolutional neural networks
– Convolution provides translation invariance
– Network stages with reducing spatial resolution, Mult. channels,…
• Recurrent neural networks
– Learn to represent history in time series
– Backpropagation as unfolding in time
– LSTM memory units
• Neural architectures
– Shared parameters across multiple computations
– Layers with different structures/functions
– Probabilistic classification à output Softmax layer