CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 4 – Multiclass Classification & Neural Networks
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec4 1 / 48

Classification: predicting a discrete-valued target
􏰀 Binary classification: predicting a binary-valued target
􏰀 Multiclass classification: predicting a discrete(> 2)-valued target
Examples of multi-class classification
􏰀 predict the value of a handwritten digit
􏰀 classify e-mails as spam, travel, work, personal
Intro ML (UofT) CSC311-Lec4

Multiclass Classification
Classification tasks with more than two categories:
It is very hard to say what makes a 2 Some examples from an earlier version of the net
Intro ML (UofT) CSC311-Lec4 3 / 48

Multiclass Classification
Targets form a discrete set {1,…,K}.
It’s often more convenient to represent them as one-hot vectors, or
a one-of-K encoding:
t = (0,…,0,1,0,…,0) ∈ RK
entry k is 1
Intro ML (UofT) CSC311-Lec4 4 / 48

Multiclass Classification
Now there are D input dimensions and K output dimensions, so we need K × D weights, which we arrange as a weight matrix W.
Also, we have a K-dimensional vector b of biases. Linear predictions:
Vectorized:
wkjxj+bk for k=1,2,…,K
z = Wx + b
Intro ML (UofT)
CSC311-Lec4

Multiclass Classification
Predictions are like probabilities: want 1 ≥ yk ≥ 0 and 􏰊k yk = 1 A natural activation function to use is the softmax function, a
multivariable generalization of the logistic function: yk = softmax(z1,…,zK)k = 􏰊ezk
The inputs zk are called the logits. Properties:
􏰀 Outputs are positive and sum to 1 (so they can be interpreted as probabilities)
􏰀 If one of the zk is much larger than the others, softmax(z)k ≈ 1 (behaves like argmax).
􏰀 Exercise: how does the case of K = 2 relate to the logistic function?
Note: sometimes σ(z) is used to denote the softmax function; in this class, it will denote the logistic function applied elementwise. Intro ML (UofT) CSC311-Lec4 6 / 48

Multiclass Classification
If a model outputs a vector of class probabilities, we can use cross-entropy as the loss function:
LCE(y, t) = −
= −t⊤(log y),
where the log is applied elementwise.
Just like with logistic regression, we typically combine the softmax
and cross-entropy into a softmax-cross-entropy function.
Intro ML (UofT) CSC311-Lec4 7 / 48

Multiclass Classification
Softmax regression:
z = Wx + b
y = softmax(z) LCE = −t⊤(log y)
Gradient descent updates can be derived for each row of W: ∂LCE = ∂LCE · ∂zk =(yk −tk)·x
∂wk ∂zk ∂wk
wk ← wk − α (y(i) − t(i))x(i) Nkk
Similar to linear/logistic reg (no coincidence) (verify the update)
Intro ML (UofT) CSC311-Lec4 8 / 48

Limits of Linear Classification
Visually, it’s obvious that XOR is not linearly separable. But how to show this?
Intro ML (UofT) CSC311-Lec4 9 / 48

Limits of Linear Classification
Showing that XOR is not linearly separable (proof by contradiction)
If two points lie in a half-space, line segment connecting them also lie in the same halfspace.
Suppose there were some feasible weights (hypothesis). If the positive examples are in the positive half-space, then the green line segment must be as well.
Similarly, the red line segment must line within the negative half-space.
But the intersection can’t lie in both half-spaces. Contradiction!
Intro ML (UofT) CSC311-Lec4 10 / 48

Limits of Linear Classification
Sometimes we can overcome this limitation using feature maps, just like for linear regression. E.g., for XOR:
 x1  ψ(x)=x2 
x1 x2 ψ1(x) ψ2(x) ψ3(x) t 000000 010101 101001 111110
This is linearly separable. (Try it!)
Not a general solution: it can be hard to pick good basis functions. Instead, we’ll use neural nets to learn nonlinear hypotheses
Intro ML (UofT) CSC311-Lec4 11 / 48

Neural Networks
Intro ML (UofT) CSC311-Lec4 12 / 48

Inspiration: The Brain
Our brain has ∼ 1011 neurons, each of which communicates (is connected) to ∼ 104 other neurons
Figure: The basic computational unit of the brain: Neuron [Pic credit: http://cs231n.github.io/neural-networks-1/]
Intro ML (UofT) CSC311-Lec4

Inspiration: The Brain
Neurons receive input signals and accumulate voltage. After some threshold they will fire spiking responses.
[Pic credit: www.moleculardevices.com]
Intro ML (UofT) CSC311-Lec4 14 / 48

Inspiration: The Brain
Neurons in form billions of connections in the brain.
Intro ML (UofT) CSC311-Lec4 15 / 48

Inspiration: The Brain
For neural nets, we use a much simpler model neuron, or unit:
output bias i’th weight
y=g b+ xiwi i
Compare with logistic regression: y = σ(w⊤x + b)
By throwing together lots of these incredibly simplistic neuron-like processing units, we can do some powerful computations!
Intro ML (UofT) CSC311-Lec4 16 / 48
w1 w2 w3 weights inputs
nonlinearity i’th input

We can connect lots of units together into a directed acyclic graph.
Typically, units are grouped together into layers.
This gives a
feed-forward neural network. That’s in contrast to recurrent neural networks, which can have cycles.
Intro ML (UofT) CSC311-Lec4 17 / 48

Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
Intro ML (UofT) CSC311-Lec4 18 / 48

Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
If we need to compute M outputs from N
inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix
The output units are a function of the input units:
y = f (x) = φ (Wx + b)
Intro ML (UofT) CSC311-Lec4

Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
If we need to compute M outputs from N
inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix
The output units are a function of the input units:
y = f (x) = φ (Wx + b)
A multilayer network consisting of fully connected layers is called a multilayer perceptron. Despite the name, it has nothing to do with perceptrons!
Intro ML (UofT) CSC311-Lec4

Some activation functions:
Rectified Linear Unit (ReLU)
y = max(0, z)
y = log 1 + ez
Intro ML (UofT)
CSC311-Lec4

Some activation functions:
Hard Threshold
y = 􏰘 1 if z > 0 0 ifz≤0
Hyperbolic Tangent (tanh)
ez − e−z y=ez+e−z
y = 1 1+e−z
Intro ML (UofT)
CSC311-Lec4

Each layer computes a function, so the network computes a composition of functions:
h(1) = f(1)(x) = φ(W(1)x + b(1))
h(2) = f(2)(h(1)) = φ(W(2)h(1) + b(2))
y = f(L)(h(L−1))
Or more compactly:
y=f(L) ◦···◦f(1)(x).
Neural nets provide modularity: we can implement each layer’s computations as a black box.
Intro ML (UofT) CSC311-Lec4 21 / 48

Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L)
Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Neural nets can be viewed as a way of learning features:
Intro ML (UofT) CSC311-Lec4

Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Neural nets can be viewed as a way of learning features:
Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning
Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 × 28 = 784 pixel values.
Each first-layer hidden unit computes φ(wiT x). It acts as a feature detector.
We can visualize w by reshaping it into an image. Here’s an example that responds to a diagonal stroke.
Intro ML (UofT) CSC311-Lec4 23 / 48

Feature Learning
Here are some of the features learned by the first hidden layer of a handwritten digit classifier:
Intro ML (UofT) CSC311-Lec4 24 / 48

Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input
􏰀 We call this a linear layer
Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input
􏰀 We call this a linear layer
Any sequence of linear layers can be equivalently represented with
a single linear layer.
y = W(3)W(2)W(1) x 􏰏 􏰎􏰍′ 􏰐
􏰀 Deep linear networks are no more expressive than linear regression.
Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power
Multilayer feed-forward neural nets with nonlinear activation functions are universal function approximators: they can approximate any function arbitrarily well.
This has been shown for various activation functions (thresholds, logistic, ReLU, etc.)
􏰀 Even though ReLU is “almost” linear, it’s nonlinear enough.
Intro ML (UofT) CSC311-Lec4 26 / 48

Designing a network to classify XOR:
Assume hard threshold activation function
Intro ML (UofT) CSC311-Lec4 27 / 48

h1 computes I[x1 + x2 − 0.5 > 0] 􏰀 i.e.x1ORx2
Intro ML (UofT) CSC311-Lec4 28 / 48

h1 computes I[x1 + x2 − 0.5 > 0] 􏰀 i.e.x1ORx2
h2 computes I[x1 + x2 − 1.5 > 0] 􏰀 i.e.x1ANDx2
Intro ML (UofT) CSC311-Lec4 28 / 48

h1 computes I[x1 + x2 − 0.5 > 0] 􏰀 i.e.x1ORx2
h2 computes I[x1 + x2 − 1.5 > 0] 􏰀 i.e.x1ANDx2
y computes I[h1 − h2 − 0.5 > 0] ≡ I[h1 + (1 − h2) − 1.5 > 0] 􏰀 i.e. h1 AND(NOTh2)=x1 XORx2
Intro ML (UofT) CSC311-Lec4

Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular input configuration
Intro ML (UofT) CSC311-Lec4

Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular input configuration
Only requires one hidden layer, though it needs to be extremely wide.
Intro ML (UofT) CSC311-Lec4 29 / 48

Expressive Power
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weights and biases:
y = σ(x) y = σ(5x)
Intro ML (UofT) CSC311-Lec4

Expressive Power
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weights and biases:
y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can train them
with gradient descent.
Intro ML (UofT) CSC311-Lec4 30 / 48

Expressive Power
Limits of universality
􏰀 You may need to represent an exponentially large network.
􏰀 How can you find the appropriate weights to represent a given function?
􏰀 If you can learn any function, you’ll just overfit.
􏰀 We desire a compact representation.
Intro ML (UofT) CSC311-Lec4

Training Neural Networks with Backpropagation
Intro ML (UofT) CSC311-Lec4 32 / 48

Recap: Gradient Descent
Recall: gradient descent moves opposite the gradient (the direction of steepest descent)
Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize!
We want to define a loss L and compute the gradient of the cost dJ /dw, which is the vector of partial derivatives.
􏰀 This is the average of dL/dw over all the training examples, so in this lecture we focus on computing dL/dw.
Intro ML (UofT) CSC311-Lec4 33 / 48

Univariate Chain Rule
We’ve already been using the univariate Chain Rule. Recall: if f(x) and x(t) are univariate functions, then
d f(x(t)) = df dx.
Intro ML (UofT)
CSC311-Lec4

Univariate Chain Rule
Recall: Univariate logistic least squares model
z = wx + b y = σ(z)
L = 12 ( y − t ) 2 Let’s compute the loss derivatives ∂L , ∂L
Intro ML (UofT) CSC311-Lec4 35 / 48

Univariate Chain Rule
How you would have done it in calculus class
= ∂ 􏰅1(σ(wx+b)−t)2􏰆 ∂w 2
= 1 ∂ (σ(wx+b)−t)2 2 ∂w
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂w
∂ 􏰅1 2􏰆 = ∂b 2(σ(wx+b)−t)
= 1 ∂ (σ(wx+b)−t)2 2∂b
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂b
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂b
1 L=2(σ(wx+b)−t)
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂w
= (σ(wx+b)−t)σ′(wx+b) What are the disadvantages of this approach?
= (σ(wx + b) − t)σ′(wx + b)x
Intro ML (UofT) CSC311-Lec4

Univariate Chain Rule
A more structured way to do it
Computing the derivatives:
dL = y − t dy
dL = dLdy = dLσ′(z) dz dy dz dy
L = 12 ( y − t ) 2
Remember, the goal isn’t to obtain closed-form solutions, but to be
able to write a program that efficiently computes the derivatives.
Computing the loss:
z = wx + b y=σ(z)
∂ L = d L d z = d L x ∂w dz dw dz ∂L = dL dz = dL
∂b dz db dz
Intro ML (UofT) CSC311-Lec4

Univariate Chain Rule
We can diagram out the computations using a computation graph.
The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes.
Computing the loss:
z = wx + b y = σ(z)
L = 12 ( y − t ) 2
Intro ML (UofT) CSC311-Lec4 38 / 48

Univariate Chain Rule
A slightly more convenient notation:
Use y to denote the derivative dL/dy, sometimes called the error signal.
This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).
Computing the loss:
z = wx + b y = σ(z)
L = 21 ( y − t ) 2 Intro ML (UofT)
Computing the derivatives:
z = y σ′(z) w=zx
CSC311-Lec4

Multivariate Chain Rule
Problem: what if the computation graph has fan-out > 1? This requires the multivariate Chain Rule!
L2-Regularized regression
z = wx + b y = σ(z)
L = 1 (y − t)2 2
Softmax regression
zl =􏰋wljxj +bl j
12 ezk R = 2w yk = 􏰊l ezl
Lreg =L+λR
L = − 􏰋 tk log yk k
Intro ML (UofT) CSC311-Lec4

Multivariate Chain Rule
Suppose we have a function f(x,y) and functions x(t) and y(t). (All the variables here are scalar-valued.) Then
d f(x(t),y(t)) = ∂f dx + ∂f dy dt ∂x dt ∂y dt
Plug in to Chain Rule:
f (x, y) = y + exy x(t) = cos t y(t) = t2
Intro ML (UofT)
df = ∂f dx + ∂f dy dt ∂x dt ∂y dt
= (yexy) · (− sin t) + (1 + xexy) · 2t CSC311-Lec4

Multivariable Chain Rule
In the context of backpropagation:
In our notation:
t = x dx + y dy dt dt
Intro ML (UofT) CSC311-Lec4 42 / 48

Backpropagation
Full backpropagation algorithm:
Letv1,…,vN beatopologicalorderingofthecomputationgraph (i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).
Intro ML (UofT) CSC311-Lec4 43 / 48

Backpropagation
Example: univariate logistic least squares regression
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
Intro ML (UofT) CSC311-Lec4 44 / 48

Backpropagation
Example: univariate logistic least squares regression Backward pass:
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
dLreg = Lreg λ
dLreg = Lreg dL
= Lreg = L dL
= L (y − t)
Intro ML (UofT)
CSC311-Lec4 44 / 48

Backpropagation
Example: univariate logistic least squares regression Backward pass:
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
dLreg = Lreg λ
z = y dy dz
b = z ∂z ∂b
dLreg = Lreg dL
w= z ∂w + R dw = z x + R w
= Lreg = L dL
= L (y − t)
Intro ML (UofT)
CSC311-Lec4

Perceptron (multiple outputs):
Forward pass:
zi =􏰋w(1)xj +b(1) ij i
hi = σ(zi)
yk =􏰋w(2)hi +b(2) ki k
L = 1 􏰋(yk − tk)2
Intro ML (UofT) CSC311-Lec4 45 / 48

Perceptron (multiple outputs): Backward pass:
Forward pass:
zi =􏰋w(1)xj +b(1) ij i
hi = σ(zi)
yk =􏰋w(2)hi +b(2) ki k
L = 1 􏰋(yk − tk)2
yk =L(yk −tk)
w(2) = yk hi ki
b(2) = yk k
hi =􏰋ykw(2) ki
zi = hi σ′(zi)
w(1) = zi xj ij
Intro ML (UofT)
CSC311-Lec4

Backpropagation
In vectorized form:
Forward pass:
z = W(1)x + b(1) h = σ(z)
y = W(2)h + b(2)
L = 21 ∥ y − t ∥ 2
Intro ML (UofT) CSC311-Lec4 46 / 48

Backpropagation
In vectorized form:
Backward pass:
Forward pass:
z = W(1)x + b(1) h = σ(z)
y = W(2)h + b(2)
L = 21∥y − t∥2 Intro ML (UofT)
y = L (y − t) W(2) = yh⊤
h = W(2)⊤y
z = h ◦ σ′(z) W(1) = zx⊤
CSC311-Lec4

Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =􏰋w(1)xj +b(1) ij i
Intro ML (UofT) CSC311-Lec4 47 / 48

Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =􏰋w(1)xj +b(1) ij i
Computational cost of backward pass: two add-multiply operations per weight
w(2) = y􏰋h
hi = ykw(2) ki
CSC311-Lec4

Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =􏰋w(1)xj +b(1) ij i
Computational cost of backward pass: two add-multiply operations per weight
w(2) = y􏰋h
hi = ykw(2) ki
Rule of thumb: the backward pass is about as expensive as two forward passes.
For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer. Intro ML (UofT) CSC311-Lec4

is used to train the overwhelming majority of neural nets today.
􏰀 Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients.
Despite its practical success, backprop is believed to be neurally implausible.
Intro ML (UofT) CSC311-Lec4 48 / 48