Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 8, 2021
This section:
• Artificial neural networks • Backpropagation
• Representation learning
Reading:
• Goodfellow: Chapter 6
• optional: Mitchell: Chapter 4
We like logistic regression, but
• how would it perform when trying to learn
P(image contains Hillary Clinton | pixel values X1, X2 … X10,000)
We like logistic regression, but
what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent? SIFT features?
We like logistic regression, but
what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent?
Deep nets : learn the features automatically !
Computer Vision
Imagenet Visual Recognition Challenge
human
Vision
Accuracy human
Machine Translation
Speech to Text
human
Strategic reasoning
Robots
Artificial Neural Networks learn f: XàY
• f might be complex, non-linear, real or discrete-valued
• X (vector of) continuous and/or discrete vars
• Y (vector of) continuous and/or discrete vars
– X = image, Y = new image if you drive forward 1 meter
– X = microphone signal, Y = which word was spoken
• we can also train the network to learn P(Y|X)
– Logistic regression is just a special case of a neural network
• Represent f by network of computational units which may contain billions of trained parameters
ALVINN [Pomerleau 1993]
Caption Generation
[from Russ Salakhutdinov]
Caption Generation
[from Russ Salakhutdinov]
Sigmoid units are exactly the form we use for logistic regression!
Many kinds of units
• Rectified linear unit : linear with thresholded output
• Sigmoid unit:
out = max(net,0)
= max(. , 0)
out =
Units often constructed so that output is of the form f(g(x)).
• Rectified linear unit : linear with thresholded output
• g(x)=net=
• f(g(x)) = max( , 0)
out = max(net,0)
= max(. , 0)
Units often constructed so that output is of the form f(g(x)).
• Sigmoid unit : linear with thresholded output
• g(x)=net= • f(g(x)) =
Units often constructed so that output is of the form f(g(x)).
• Sigmoid unit : linear with thresholded output
• g(x)=net=
• f(g(x)) =
• Use chain rule to compute gradients!
Many types of parameterized units
• Sigmoid units
• ReLU
• Leaky ReLU (fixed non-zero slope for input<0)
• Parametric ReLU (trainable slope)
• Max Pool
• Inner Product
• GRU’s
• LSTM’s
• Matrix multiply
• .... no end in sight
Any unit h(X; W) whose output is differentiable w.r.t. X and W
Training Deep Nets
1. ChooselossfunctionJ(θ)tooptimize
– sum of squared errors for y continuous: Σ (y – h(x; θ))2
– maximize conditional log likelihood: Σ log P(y|x; θ)
– MAP estimate: Σ log P(y|x; θ) P(θ)
– 0/1 loss. Sum of classification errors: Σ δ(y = h(x; θ)
–
2. Designnetworkarchitecture
– Network of layers (ReLU’s, sigmoid, convolutions, ...)
– Widths of layers
– Fully or partly interconnected – ...
3. Trainingalgorithm
– Derive gradient components
– Choose gradient descent method, including stopping condition
– (optional: Experiment with alternative network architectures)
...
Example
Example: Learn probabilistic XOR
• Given boolean Y, X1, X2 learn P(Y|X1,X2), where
X1
X2
P(Y=1|X1,X2)
P(Y=0|X1,X2)
0
0
0.1
0.9
1
0
0.9
0.1
0
1
0.9
0.1
1
1
0.1
0.9
• Can we learn this with logistic regression?
Superscript denotes layer
Superscript denotes layer
Same as logistic regression
Sigmoid unit
2 ReLU units
Feed Forward
Derive the gradient using chain rule
simplify notation by considering just one training example
Derive the gradient we need
simplify notation by considering just one training example
Ytrue=1
already know this
Ytrue=1
already know this
=
Ytrue=1
already know this
= -0.47
Ytrue=1
Back propagation
Next training example.. next stochastic gradient step...
Given boolean Y, X1, X2 learn P(Y|X1,X2), where
Training:
- stochastic gradient descent
- 20,000 iterations of gradient descent
- minibatch size 4
- no momentum, regularization, ...
Input: [0 1] [1 1] [1 0] [0 0]
20,000 training iterations
Learned representation for X2
How does the hidden layer representation X2 evolve over time during gradient descent training?
Learned representation for X2
Input: [0 1] 1.8
[1 1] [1 0] [0 0]
Small to large dots indicate
early to late learned representation of X2
0
1.8
Final decision surface in terms of X2
Input X1: [0 1] [1 1] [1 0] [0 0] 1.8
0
1.8
for sigmoid netwk, minimizing Σd (td-od)2
xd = input
oi = observed ith
unit output ti = target output
wij =wtfromitoj δk= error term
backpropagated to unit k; that is:
dJ(θ) / dgk(Xk-1)
Incremental (Stochastic) Gradient Descent
Many modifications to gradient descent
• Stochastic vs. Batch gradient descent (and mini-batches)
• Momentum
• Weight decay (MAP estimate with zero-mean prior)
• Gradient clipping
• Batch normalization
• Dropout
• Adagrad
• Adam
• .... no end in sight
See ML Department course on Optimization Methods
Gradient Descent and Backpropagation
• Updates every network parameter simultaneously, each iteration
• Because we repeatedly use the chain rule to calculate gradients, easily generalized to arbitrary directed acyclic graphs
• Finds local minimum in J(θ), not necessarily global min
• Minimizes J(θ) over training examples, not necessarily future...
• Training can require hours, days, weeks, GPUs
• Applying network after training is relatively very fast
Overfitting in Neural Nets
Dealing with Overfitting
Deep net training involves a hyperparameter n=number of gradient descent iterations
How do we choose n to optimize future accuracy?
• Separate available data into training and validation set
• Use training to perform gradient descent
• test validation error frequently, save network at each step
• nßnumber of iterations that optimizes validation set error
àgives unbiased estimate of optimal n
(but still an optimistically biased estimate of true error)
Initializing neural net weights
Usually initialize weights to random values close to zero Why? Consider sigmoid unit...
net=g(x)
out = f(g(x))
Small w’s à small net=g(x) à output f(g(x)) nearly linear function of inputs x As we take more gradient steps, |w| growsàincreasingly non-linear
f
Begin with small weightsàBegin with ~linear function
Second derivative of sig(t) = 0 here
sig(t) approximately linear here
sig(t) very non-linear here
Gradient descent stepsà
Gradient descent stepsà
Gradient descent stepsà
What you should know: Artificial Neural Networks
• Highly non-linear regression/classification
• Vector-valued inputs and outputs
• Potentially billions of parameters to estimate
• Hidden layers learn intermediate representations
• Directed acyclic graph, trained by gradient descent
• Chain rule over this DAG allows computing all derivatives
• Can use any differentiable loss function
– we used neg. log likelihood in order to learn outputs P(Y|X)
• Gradient descent, local minima problems
• Overfitting and how to deal with it