CSC 311: Introduction to Machine Learning
Lecture 4 – Multiclass Classification & Neural Networks
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec4 1 / 48
Classification: predicting a discrete-valued target
Binary classification: predicting a binary-valued target
Multiclass classification: predicting a discrete(> 2)-valued target
Examples of multi-class classification
predict the value of a handwritten digit
classify e-mails as spam, travel, work, personal
Intro ML (UofT) CSC311-Lec4
Multiclass Classification
Classification tasks with more than two categories:
It is very hard to say what makes a 2 Some examples from an earlier version of the net
Intro ML (UofT) CSC311-Lec4 3 / 48
Multiclass Classification
Targets form a discrete set {1,…,K}.
It’s often more convenient to represent them as one-hot vectors, or
a one-of-K encoding:
t = (0,…,0,1,0,…,0) ∈ RK
entry k is 1
Intro ML (UofT) CSC311-Lec4 4 / 48
Multiclass Classification
Now there are D input dimensions and K output dimensions, so we need K × D weights, which we arrange as a weight matrix W.
Also, we have a K-dimensional vector b of biases. Linear predictions:
Vectorized:
wkjxj+bk for k=1,2,…,K
z = Wx + b
Intro ML (UofT)
CSC311-Lec4
Multiclass Classification
Predictions are like probabilities: want 1 ≥ yk ≥ 0 and k yk = 1 A natural activation function to use is the softmax function, a
multivariable generalization of the logistic function: yk = softmax(z1,…,zK)k = ezk
The inputs zk are called the logits. Properties:
Outputs are positive and sum to 1 (so they can be interpreted as probabilities)
If one of the zk is much larger than the others, softmax(z)k ≈ 1 (behaves like argmax).
Exercise: how does the case of K = 2 relate to the logistic function?
Note: sometimes σ(z) is used to denote the softmax function; in this class, it will denote the logistic function applied elementwise. Intro ML (UofT) CSC311-Lec4 6 / 48
Multiclass Classification
If a model outputs a vector of class probabilities, we can use cross-entropy as the loss function:
LCE(y, t) = −
= −t⊤(log y),
where the log is applied elementwise.
Just like with logistic regression, we typically combine the softmax
and cross-entropy into a softmax-cross-entropy function.
Intro ML (UofT) CSC311-Lec4 7 / 48
Multiclass Classification
Softmax regression:
z = Wx + b
y = softmax(z) LCE = −t⊤(log y)
Gradient descent updates can be derived for each row of W: ∂LCE = ∂LCE · ∂zk =(yk −tk)·x
∂wk ∂zk ∂wk
wk ← wk − α (y(i) − t(i))x(i) Nkk
Similar to linear/logistic reg (no coincidence) (verify the update)
Intro ML (UofT) CSC311-Lec4 8 / 48
Limits of Linear Classification
Visually, it’s obvious that XOR is not linearly separable. But how to show this?
Intro ML (UofT) CSC311-Lec4 9 / 48
Limits of Linear Classification
Showing that XOR is not linearly separable (proof by contradiction)
If two points lie in a half-space, line segment connecting them also lie in the same halfspace.
Suppose there were some feasible weights (hypothesis). If the positive examples are in the positive half-space, then the green line segment must be as well.
Similarly, the red line segment must line within the negative half-space.
But the intersection can’t lie in both half-spaces. Contradiction!
Intro ML (UofT) CSC311-Lec4 10 / 48
Limits of Linear Classification
Sometimes we can overcome this limitation using feature maps, just like for linear regression. E.g., for XOR:
x1 ψ(x)=x2
x1 x2 ψ1(x) ψ2(x) ψ3(x) t 000000 010101 101001 111110
This is linearly separable. (Try it!)
Not a general solution: it can be hard to pick good basis functions. Instead, we’ll use neural nets to learn nonlinear hypotheses
Intro ML (UofT) CSC311-Lec4 11 / 48
Neural Networks
Intro ML (UofT) CSC311-Lec4 12 / 48
Inspiration: The Brain
Our brain has ∼ 1011 neurons, each of which communicates (is connected) to ∼ 104 other neurons
Figure: The basic computational unit of the brain: Neuron [Pic credit: http://cs231n.github.io/neural-networks-1/]
Intro ML (UofT) CSC311-Lec4
Inspiration: The Brain
Neurons receive input signals and accumulate voltage. After some threshold they will fire spiking responses.
[Pic credit: www.moleculardevices.com]
Intro ML (UofT) CSC311-Lec4 14 / 48
Inspiration: The Brain
Neurons in form billions of connections in the brain.
Intro ML (UofT) CSC311-Lec4 15 / 48
Inspiration: The Brain
For neural nets, we use a much simpler model neuron, or unit:
output bias i’th weight
y=g b+ xiwi i
Compare with logistic regression: y = σ(w⊤x + b)
By throwing together lots of these incredibly simplistic neuron-like processing units, we can do some powerful computations!
Intro ML (UofT) CSC311-Lec4 16 / 48
w1 w2 w3 weights inputs
nonlinearity i’th input
We can connect lots of units together into a directed acyclic graph.
Typically, units are grouped together into layers.
This gives a
feed-forward neural network. That’s in contrast to recurrent neural networks, which can have cycles.
Intro ML (UofT) CSC311-Lec4 17 / 48
Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
Intro ML (UofT) CSC311-Lec4 18 / 48
Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
If we need to compute M outputs from N
inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix
The output units are a function of the input units:
y = f (x) = φ (Wx + b)
Intro ML (UofT) CSC311-Lec4
Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We call
this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and
outputs to the network.
If we need to compute M outputs from N
inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix
The output units are a function of the input units:
y = f (x) = φ (Wx + b)
A multilayer network consisting of fully connected layers is called a multilayer perceptron. Despite the name, it has nothing to do with perceptrons!
Intro ML (UofT) CSC311-Lec4
Some activation functions:
Rectified Linear Unit (ReLU)
y = max(0, z)
y = log 1 + ez
Intro ML (UofT)
CSC311-Lec4
Some activation functions:
Hard Threshold
y = 1 if z > 0 0 ifz≤0
Hyperbolic Tangent (tanh)
ez − e−z y=ez+e−z
y = 1 1+e−z
Intro ML (UofT)
CSC311-Lec4
Each layer computes a function, so the network computes a composition of functions:
h(1) = f(1)(x) = φ(W(1)x + b(1))
h(2) = f(2)(h(1)) = φ(W(2)h(1) + b(2))
y = f(L)(h(L−1))
Or more compactly:
y=f(L) ◦···◦f(1)(x).
Neural nets provide modularity: we can implement each layer’s computations as a black box.
Intro ML (UofT) CSC311-Lec4 21 / 48
Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L)
Intro ML (UofT) CSC311-Lec4 22 / 48
Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Intro ML (UofT) CSC311-Lec4 22 / 48
Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Neural nets can be viewed as a way of learning features:
Intro ML (UofT) CSC311-Lec4
Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))
Neural nets can be viewed as a way of learning features:
Intro ML (UofT) CSC311-Lec4 22 / 48
Feature Learning
Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 × 28 = 784 pixel values.
Each first-layer hidden unit computes φ(wiT x). It acts as a feature detector.
We can visualize w by reshaping it into an image. Here’s an example that responds to a diagonal stroke.
Intro ML (UofT) CSC311-Lec4 23 / 48
Feature Learning
Here are some of the features learned by the first hidden layer of a handwritten digit classifier:
Intro ML (UofT) CSC311-Lec4 24 / 48
Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Intro ML (UofT) CSC311-Lec4 25 / 48
Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input
We call this a linear layer
Intro ML (UofT) CSC311-Lec4 25 / 48
Expressive Power
We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input
We call this a linear layer
Any sequence of linear layers can be equivalently represented with
a single linear layer.
y = W(3)W(2)W(1) x ′
Deep linear networks are no more expressive than linear regression.
Intro ML (UofT) CSC311-Lec4 25 / 48
Expressive Power
Multilayer feed-forward neural nets with nonlinear activation functions are universal function approximators: they can approximate any function arbitrarily well.
This has been shown for various activation functions (thresholds, logistic, ReLU, etc.)
Even though ReLU is “almost” linear, it’s nonlinear enough.
Intro ML (UofT) CSC311-Lec4 26 / 48
Designing a network to classify XOR:
Assume hard threshold activation function
Intro ML (UofT) CSC311-Lec4 27 / 48
h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2
Intro ML (UofT) CSC311-Lec4 28 / 48
h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2
h2 computes I[x1 + x2 − 1.5 > 0] i.e.x1ANDx2
Intro ML (UofT) CSC311-Lec4 28 / 48
h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2
h2 computes I[x1 + x2 − 1.5 > 0] i.e.x1ANDx2
y computes I[h1 − h2 − 0.5 > 0] ≡ I[h1 + (1 − h2) − 1.5 > 0] i.e. h1 AND(NOTh2)=x1 XORx2
Intro ML (UofT) CSC311-Lec4
Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular input configuration
Intro ML (UofT) CSC311-Lec4
Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular input configuration
Only requires one hidden layer, though it needs to be extremely wide.
Intro ML (UofT) CSC311-Lec4 29 / 48
Expressive Power
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weights and biases:
y = σ(x) y = σ(5x)
Intro ML (UofT) CSC311-Lec4
Expressive Power
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weights and biases:
y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can train them
with gradient descent.
Intro ML (UofT) CSC311-Lec4 30 / 48
Expressive Power
Limits of universality
You may need to represent an exponentially large network.
How can you find the appropriate weights to represent a given function?
If you can learn any function, you’ll just overfit.
We desire a compact representation.
Intro ML (UofT) CSC311-Lec4
Training Neural Networks with Backpropagation
Intro ML (UofT) CSC311-Lec4 32 / 48
Recap: Gradient Descent
Recall: gradient descent moves opposite the gradient (the direction of steepest descent)
Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize!
We want to define a loss L and compute the gradient of the cost dJ /dw, which is the vector of partial derivatives.
This is the average of dL/dw over all the training examples, so in this lecture we focus on computing dL/dw.
Intro ML (UofT) CSC311-Lec4 33 / 48
Univariate Chain Rule
We’ve already been using the univariate Chain Rule. Recall: if f(x) and x(t) are univariate functions, then
d f(x(t)) = df dx.
Intro ML (UofT)
CSC311-Lec4
Univariate Chain Rule
Recall: Univariate logistic least squares model
z = wx + b y = σ(z)
L = 12 ( y − t ) 2 Let’s compute the loss derivatives ∂L , ∂L
Intro ML (UofT) CSC311-Lec4 35 / 48
Univariate Chain Rule
How you would have done it in calculus class
= ∂ 1(σ(wx+b)−t)2 ∂w 2
= 1 ∂ (σ(wx+b)−t)2 2 ∂w
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂w
∂ 1 2 = ∂b 2(σ(wx+b)−t)
= 1 ∂ (σ(wx+b)−t)2 2∂b
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂b
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂b
1 L=2(σ(wx+b)−t)
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂w
= (σ(wx+b)−t)σ′(wx+b) What are the disadvantages of this approach?
= (σ(wx + b) − t)σ′(wx + b)x
Intro ML (UofT) CSC311-Lec4
Univariate Chain Rule
A more structured way to do it
Computing the derivatives:
dL = y − t dy
dL = dLdy = dLσ′(z) dz dy dz dy
L = 12 ( y − t ) 2
Remember, the goal isn’t to obtain closed-form solutions, but to be
able to write a program that efficiently computes the derivatives.
Computing the loss:
z = wx + b y=σ(z)
∂ L = d L d z = d L x ∂w dz dw dz ∂L = dL dz = dL
∂b dz db dz
Intro ML (UofT) CSC311-Lec4
Univariate Chain Rule
We can diagram out the computations using a computation graph.
The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes.
Computing the loss:
z = wx + b y = σ(z)
L = 12 ( y − t ) 2
Intro ML (UofT) CSC311-Lec4 38 / 48
Univariate Chain Rule
A slightly more convenient notation:
Use y to denote the derivative dL/dy, sometimes called the error signal.
This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).
Computing the loss:
z = wx + b y = σ(z)
L = 21 ( y − t ) 2 Intro ML (UofT)
Computing the derivatives:
z = y σ′(z) w=zx
CSC311-Lec4
Multivariate Chain Rule
Problem: what if the computation graph has fan-out > 1? This requires the multivariate Chain Rule!
L2-Regularized regression
z = wx + b y = σ(z)
L = 1 (y − t)2 2
Softmax regression
zl =wljxj +bl j
12 ezk R = 2w yk = l ezl
Lreg =L+λR
L = − tk log yk k
Intro ML (UofT) CSC311-Lec4
Multivariate Chain Rule
Suppose we have a function f(x,y) and functions x(t) and y(t). (All the variables here are scalar-valued.) Then
d f(x(t),y(t)) = ∂f dx + ∂f dy dt ∂x dt ∂y dt
Plug in to Chain Rule:
f (x, y) = y + exy x(t) = cos t y(t) = t2
Intro ML (UofT)
df = ∂f dx + ∂f dy dt ∂x dt ∂y dt
= (yexy) · (− sin t) + (1 + xexy) · 2t CSC311-Lec4
Multivariable Chain Rule
In the context of backpropagation:
In our notation:
t = x dx + y dy dt dt
Intro ML (UofT) CSC311-Lec4 42 / 48
Backpropagation
Full backpropagation algorithm:
Letv1,…,vN beatopologicalorderingofthecomputationgraph (i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).
Intro ML (UofT) CSC311-Lec4 43 / 48
Backpropagation
Example: univariate logistic least squares regression
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
Intro ML (UofT) CSC311-Lec4 44 / 48
Backpropagation
Example: univariate logistic least squares regression Backward pass:
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
dLreg = Lreg λ
dLreg = Lreg dL
= Lreg = L dL
= L (y − t)
Intro ML (UofT)
CSC311-Lec4 44 / 48
Backpropagation
Example: univariate logistic least squares regression Backward pass:
Forward pass:
z = wx + b y = σ(z)
L= 1(y−t)2 2
Lreg =L+λR
dLreg = Lreg λ
z = y dy dz
b = z ∂z ∂b
dLreg = Lreg dL
w= z ∂w + R dw = z x + R w
= Lreg = L dL
= L (y − t)
Intro ML (UofT)
CSC311-Lec4
Perceptron (multiple outputs):
Forward pass:
zi =w(1)xj +b(1) ij i
hi = σ(zi)
yk =w(2)hi +b(2) ki k
L = 1 (yk − tk)2
Intro ML (UofT) CSC311-Lec4 45 / 48
Perceptron (multiple outputs): Backward pass:
Forward pass:
zi =w(1)xj +b(1) ij i
hi = σ(zi)
yk =w(2)hi +b(2) ki k
L = 1 (yk − tk)2
yk =L(yk −tk)
w(2) = yk hi ki
b(2) = yk k
hi =ykw(2) ki
zi = hi σ′(zi)
w(1) = zi xj ij
Intro ML (UofT)
CSC311-Lec4
Backpropagation
In vectorized form:
Forward pass:
z = W(1)x + b(1) h = σ(z)
y = W(2)h + b(2)
L = 21 ∥ y − t ∥ 2
Intro ML (UofT) CSC311-Lec4 46 / 48
Backpropagation
In vectorized form:
Backward pass:
Forward pass:
z = W(1)x + b(1) h = σ(z)
y = W(2)h + b(2)
L = 21∥y − t∥2 Intro ML (UofT)
y = L (y − t) W(2) = yh⊤
h = W(2)⊤y
z = h ◦ σ′(z) W(1) = zx⊤
CSC311-Lec4
Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =w(1)xj +b(1) ij i
Intro ML (UofT) CSC311-Lec4 47 / 48
Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =w(1)xj +b(1) ij i
Computational cost of backward pass: two add-multiply operations per weight
w(2) = yh
hi = ykw(2) ki
CSC311-Lec4
Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =w(1)xj +b(1) ij i
Computational cost of backward pass: two add-multiply operations per weight
w(2) = yh
hi = ykw(2) ki
Rule of thumb: the backward pass is about as expensive as two forward passes.
For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer. Intro ML (UofT) CSC311-Lec4
is used to train the overwhelming majority of neural nets today.
Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients.
Despite its practical success, backprop is believed to be neurally implausible.
Intro ML (UofT) CSC311-Lec4 48 / 48