程序代写代做代考 algorithm chain Machine Learning 10-601/301

Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 8, 2021
This section:
• Artificial neural networks • Backpropagation
• Representation learning
Reading:
• Goodfellow: Chapter 6
• optional: Mitchell: Chapter 4

We like logistic regression, but
• how would it perform when trying to learn
P(image contains Hillary Clinton | pixel values X1, X2 … X10,000)

We like logistic regression, but
what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent? SIFT features?

We like logistic regression, but
what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent?
Deep nets : learn the features automatically !

Computer Vision
Imagenet Visual Recognition Challenge
human

Vision
Accuracy human
Machine Translation
Speech to Text
human
Strategic reasoning
Robots

Artificial Neural Networks learn f: XàY
• f might be complex, non-linear, real or discrete-valued
• X (vector of) continuous and/or discrete vars
• Y (vector of) continuous and/or discrete vars
– X = image, Y = new image if you drive forward 1 meter
– X = microphone signal, Y = which word was spoken
• we can also train the network to learn P(Y|X)
– Logistic regression is just a special case of a neural network
• Represent f by network of computational units which may contain billions of trained parameters

ALVINN [Pomerleau 1993]

Caption Generation
[from Russ Salakhutdinov]

Caption Generation
[from Russ Salakhutdinov]

Sigmoid units are exactly the form we use for logistic regression!

Many kinds of units
• Rectified linear unit : linear with thresholded output
• Sigmoid unit:
out = max(net,0)
= max(. , 0)
out =

Units often constructed so that output is of the form f(g(x)).
• Rectified linear unit : linear with thresholded output
• g(x)=net=
• f(g(x)) = max( , 0)
out = max(net,0)
= max(. , 0)

Units often constructed so that output is of the form f(g(x)).
• Sigmoid unit : linear with thresholded output
• g(x)=net= • f(g(x)) =

Units often constructed so that output is of the form f(g(x)).
• Sigmoid unit : linear with thresholded output
• g(x)=net=
• f(g(x)) =
• Use chain rule to compute gradients!

Many types of parameterized units
• Sigmoid units
• ReLU
• Leaky ReLU (fixed non-zero slope for input<0) • Parametric ReLU (trainable slope) • Max Pool • Inner Product • GRU’s • LSTM’s • Matrix multiply • .... no end in sight Any unit h(X; W) whose output is differentiable w.r.t. X and W Training Deep Nets 1. ChooselossfunctionJ(θ)tooptimize – sum of squared errors for y continuous: Σ (y – h(x; θ))2 – maximize conditional log likelihood: Σ log P(y|x; θ) – MAP estimate: Σ log P(y|x; θ) P(θ) – 0/1 loss. Sum of classification errors: Σ δ(y = h(x; θ) – 2. Designnetworkarchitecture – Network of layers (ReLU’s, sigmoid, convolutions, ...) – Widths of layers – Fully or partly interconnected – ... 3. Trainingalgorithm – Derive gradient components – Choose gradient descent method, including stopping condition – (optional: Experiment with alternative network architectures) ... Example Example: Learn probabilistic XOR • Given boolean Y, X1, X2 learn P(Y|X1,X2), where X1 X2 P(Y=1|X1,X2) P(Y=0|X1,X2) 0 0 0.1 0.9 1 0 0.9 0.1 0 1 0.9 0.1 1 1 0.1 0.9 • Can we learn this with logistic regression? Superscript denotes layer Superscript denotes layer Same as logistic regression Sigmoid unit 2 ReLU units Feed Forward Derive the gradient using chain rule simplify notation by considering just one training example Derive the gradient we need simplify notation by considering just one training example Ytrue=1 already know this Ytrue=1 already know this = Ytrue=1 already know this = -0.47 Ytrue=1 Back propagation Next training example.. next stochastic gradient step... Given boolean Y, X1, X2 learn P(Y|X1,X2), where Training: - stochastic gradient descent - 20,000 iterations of gradient descent - minibatch size 4 - no momentum, regularization, ... Input: [0 1] [1 1] [1 0] [0 0] 20,000 training iterations Learned representation for X2 How does the hidden layer representation X2 evolve over time during gradient descent training? Learned representation for X2 Input: [0 1] 1.8 [1 1] [1 0] [0 0] Small to large dots indicate early to late learned representation of X2 0 1.8 Final decision surface in terms of X2 Input X1: [0 1] [1 1] [1 0] [0 0] 1.8 0 1.8 for sigmoid netwk, minimizing Σd (td-od)2 xd = input oi = observed ith unit output ti = target output wij =wtfromitoj δk= error term backpropagated to unit k; that is: dJ(θ) / dgk(Xk-1) Incremental (Stochastic) Gradient Descent Many modifications to gradient descent • Stochastic vs. Batch gradient descent (and mini-batches) • Momentum • Weight decay (MAP estimate with zero-mean prior) • Gradient clipping • Batch normalization • Dropout • Adagrad • Adam • .... no end in sight See ML Department course on Optimization Methods Gradient Descent and Backpropagation • Updates every network parameter simultaneously, each iteration • Because we repeatedly use the chain rule to calculate gradients, easily generalized to arbitrary directed acyclic graphs • Finds local minimum in J(θ), not necessarily global min • Minimizes J(θ) over training examples, not necessarily future... • Training can require hours, days, weeks, GPUs • Applying network after training is relatively very fast Overfitting in Neural Nets Dealing with Overfitting Deep net training involves a hyperparameter n=number of gradient descent iterations How do we choose n to optimize future accuracy? • Separate available data into training and validation set • Use training to perform gradient descent • test validation error frequently, save network at each step • nßnumber of iterations that optimizes validation set error àgives unbiased estimate of optimal n (but still an optimistically biased estimate of true error) Initializing neural net weights Usually initialize weights to random values close to zero Why? Consider sigmoid unit... net=g(x) out = f(g(x)) Small w’s à small net=g(x) à output f(g(x)) nearly linear function of inputs x As we take more gradient steps, |w| growsàincreasingly non-linear f Begin with small weightsàBegin with ~linear function Second derivative of sig(t) = 0 here sig(t) approximately linear here sig(t) very non-linear here Gradient descent stepsà Gradient descent stepsà Gradient descent stepsà What you should know: Artificial Neural Networks • Highly non-linear regression/classification • Vector-valued inputs and outputs • Potentially billions of parameters to estimate • Hidden layers learn intermediate representations • Directed acyclic graph, trained by gradient descent • Chain rule over this DAG allows computing all derivatives • Can use any differentiable loss function – we used neg. log likelihood in order to learn outputs P(Y|X) • Gradient descent, local minima problems • Overfitting and how to deal with it