Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 10, 2021
This section:
• Representation learning
• Convolutional neural nets
• Recurrent neural nets
Reading:
• Goodfellow: Chapter 6
• optional: Mitchell: Chapter 4
Sigmoid unit
2 ReLU units
Feed Forward
already know this
Back propagation
Sigmoid function tanh function
What you should know: Artificial Neural Networks
• Highly non-linear regression/classification
• Vector Tensor-valued inputs and outputs
• Potentially billions of parameters to estimate
• Hidden layers learn intermediate representations
aka backpropagation
• Directed acyclic graph, trained by gradient descent
• Chain rule over this DAG allows computing all derivatives
• Can use any differentiable loss function
– we used neg. log likelihood in order to learn outputs P(Y|X)
• Gradient descent, local minima problems
• Overfitting and how to deal with it
Learning hidden representations
Network with sigmoid units only:
Gradient descent stepsà
Gradient descent stepsà
Gradient descent stepsà
w0
left strt right up
Word embeddings
Learning Distributed Representations for Words
• also called “word embeddings”
• word2vec is one commonly used embedding
• based on “skip gram” model
Key idea: given word sequence w1 w2 … wT
train network to predict surrounding words.
for each word wt predict wt-2, wt-1, wt+1, wt+2
e.g., “the dog jumped over the fence in order to get to..” “the cat jumped off the widow ledge in order to …”
Word2Vec Word Embeddings
basic skip gram model: train to maximize:
where
Modifications to training… + hierarchical softmax
+ negative sampling
+ subsample frequent w’s
Learned dense vector vw for each word w. 300 dimensional
[Mikolav et al., 2013]
”one hot” word encoding: all zeros except 1 position. ~50k dimensional
100 Dimensional Skip-gram embeddings, projected to two dimensions by PCA
Skip-gram Word Embeddings
analogy: w1 is to w2, as w3 is to ?w algorithm: ?w = w2-w1+w3
[Mikolav et al., 2013]
Convolutional Neural Nets
Computer Vision
Imagenet Visual Recognition Challenge
human
A Convolutional Neural Net for Handwritten Digit recognition: LeNet5
Max-pool Max-pool fully connected Sigmoid or ReLUs
Softmax
Result S
Input I
Learned parameters
Convolution Layer
Kernel K
[from Goodfellow et al.]
Convolution layer
p = padding s = stride
Convolution example
0
1
2
3
4
5
6
7
8
1
2
1
0
Input activations
Trained weights
Output activations
*
=
Convolution as parameter sharing
0
1
2
3
4
5
6
7
8
1
2
1
0
*
=
Trained weights
Output activations
Convolution as parameter sharing
0
1
2
3
4
5
6
7
8
1
2
1
0
*
=
Trained weights
Output activations
Convolution with padding
0
0
0
0
0
0
0
1
2
0
0
3
4
5
0
0
6
7
8
0
0
0
0
0
0
1
2
1
0
*
=
Multichannel Convolution
[from Zhang et al., Dive into Deep Learning”]
Maxpool Layer
out = max(a,b,e,f)
[from Goodfellow et al.]
A Convolutional Neural Net for Handwritten Digit recognition: LeNet5
Max-pool Max-pool fully connected Sigmoid or ReLUs
Softmax
Softmax Layer: Predict Probability Distribution over discrete-valued variables
• LogisticRegression:whenYhastwopossiblevalues
• Softmax: when Y has R values {y1 … yR}, then learn R sets of weights to predict R output probabilities
LeNet
Max- pool
Soft
Max- fully pool connected Sigmoid or
max ReLUs
A Convolutional Neural Net for Handwritten Digit recognition: LeNet
Max-pool
• Shrinking size of feature maps
• Multiple channels
• LeNet-5 Demos:
http://yann.lecun.com/exdb/lenet/index.html
• Vary scale
• Vary stroke width • Squeeze
• Noisy-2, Noisy-4
Max-pool
Softmax fully connected
Sigmoid or ReLUs
[from Goodfellow et al.]
Recurrent Neural Nets for Sequential Data
Sequences
● Words, Letters
● Speech
● Images, Videos
● Programs
● Sequential Decision Making (RL)
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Another example of parameter sharing, like CNNs
[Goodfellow et al., 2016]
Training Recurrent Networks
Key principle for training:
1. Treat as if unfolded in time, resulting in directed acyclic graph
2. Note shared parameters in unfolded netàsum the gradients
* problem: vanishing and/or exploding gradients
[Goodfellow et al., 2016]
Language model: Two Key Ingredients
Neural Embeddings Recurrent Language Models
Hinton, G., Salakhutdinov, R. “Reducing the Dimensionality of Data with Neural Networks.” Science (2006) Mikolov, T., et al. “Recurrent neural network based language model.” Interspeech (2010)
Language Models
Slide Credit: Piotr Mirowski
What do we Optimize?
Chain Rule
Slide Credit: Piotr Mirowski
• Forward Pass
Slide Credit: Piotr Mirowski
• Backward Pass
Slide Credit: Piotr Mirowski
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
Sequence to Sequence Learning
Learned Representation
Encoder
Output Sequence
Decoder
Input Sequence
• RNN Encoder-Decoders for Machine Translation (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015)
Seq2Seq
XYZQ
Target sequence
v
A B C D __ X Y Z Input sequence
67
Sequence to Sequence Models
• Natural language processing is concerned with tasks involving language data
Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks
What you should know:
• Representation learning
– Hidden layers re-represent inputs in form allowing out predictions
– Autoencoders
– Task-specific encoding (e.g., depend on Y in f: XàY)
– Sometimes reused widely (e.g., word2vec)
• Convolutional neural networks
– Convolution provides translation invariance
– Network stages with reducing spatial resolution, Mult. channels,
• Recurrent neural networks
– Learn to represent history in time series
– Backpropagation as unfolding in time
• Neural architectures
– Shared parameters across multiple computations
– Layers with different structures/functions
– Probabilistic classification à output Softmax layer