Introduction to Machine Learning
CS/SE 4X03
Ned Nedialkov McMaster University March 16, 2021
Outline
Example
Activation function
A simple network
Training
Steepest descent Stochastic gradient descent
Example Activation function Network Training Steepest descent Stochastic GD
This is a summary of Sections 1-4 from
C. F. Higham, D. J. Higham, Deep Learning: An Introduction for Applied Mathematicians
Figures 1 and 3 are cropped from this article
Copyright © 2021 N. Nedialkov. All rights reserved. 3/20
Example Activation function Network Training Steepest descent Stochastic GD Example
• Points in R2 classified in two categories A and B • This is labeled data
• Given a new point, how to use the labeled data to classify this point?
Copyright © 2021 N. Nedialkov. All rights reserved. 4/20
Example Activation function Network Training Steepest descent Stochastic GD Activation function
• A neuron fires or is inactive
• Activation can be modeled by the sigmoid function
σ(x) = 1 1+e−x
• σ(0)=0.5,σ(x)≈1whenxlarge,σ(x)≈0whenxsmall 1
0.8 0.6 0.4 0.2
0
-10 -5 0 5 10
x
(x)
Copyright © 2021 N. Nedialkov. All rights reserved. 5/20
Example Activation function Network Training Steepest descent
Stochastic GD
• Steepness can be changed by scaling
• Location can be changed by shifting
• Useful property σ′(x) = σ(x)(1 − σ(x))
1 0.8 0.6 0.4 0.2
0
-10 -5 0 5 10
x
(3*(x-5))
Copyright © 2021 N. Nedialkov. All rights reserved.
6/20
Example Activation function Network Training Steepest descent A simple network
• Each neuron
◦ outputs a real number
◦ sends to every neuron in next layer
• Neuron in next layer
◦ forms a linear combination of inputs + bias
◦ applies activation function
Copyright © 2021 N. Nedialkov. All rights reserved.
Stochastic GD
7/20
Example Activation function Network Training Steepest descent Stochastic GD Consider layers 2 and 3
Layer 2: neurons 1 and 2 output real a1 and a2, respectively, and send to neurons 1, 2, 3 in layer 3
Layer 3:
• neuron
1 combines a1 and a2 and ads bias b1:
w11a1 + w12a2 + b1 outputs
σ (w11a1 + w12a2 + b1) • neuron 2 outputs
σ (w21a1 + w22a2 + b2) • neuron 3 outputs
σ (w31a1 + w32a2 + b3)
Copyright © 2021 N. Nedialkov. All rights reserved. 8/20
Example Activation function Network Denote
w11 w12 W =w21 w22,
w31 w32 z=Wa+b
Training Steepest descent Stochastic GD
W is a matrix with weights, b is a bias vector For a vector z, apply σ component wise
The output of layer 3 is
(σ(z))i = σ(zi) σ(z) = σ(W a + b)
Copyright © 2021 N. Nedialkov. All rights reserved.
9/20
a1 a= a
2
,
b1 b=b2
b3
Example Activation function Network Training Steepest descent Stochastic GD • Denotetheinputbyx,theW andbatlayeribyW[i] and
b[i], and the output of layer i by a[i] • Output of layer 2 is
a[2] =σW[2]x+b[2]∈R2, • Output of layer 3 is
a[3] =σW[3]a[2] +b[3]∈R3, • Output of layer 4 is
a[4] =σW[4]a[3] +b[4]∈R2, • Write the above as
W[2] ∈R2×2, b[2] ∈R2
W[3] ∈R3×2, b[3] ∈R3
W[4] ∈R2×3, b[4] ∈R2
F (x) = σ W [4]σ W [3]σ W [2]x + b[2] + b[3] + b[4] Copyright © 2021 N. Nedialkov. All rights reserved. 10/20
Example Activation function Network Training Steepest descent Stochastic GD
• Layer i:
◦ W [i] is of size (# outputs)×(# inputs) ◦ b[i] is of size (# outputs)
Number of parameters is 23:
layer i inputs outputs W[i] b[i]
2222×22 3233×23 4322×32
16 7
• F (x) is a function from R2 → R2 with 23 parameters
• Training is about finding parameters
Copyright © 2021 N. Nedialkov. All rights reserved.
11/20
Example Activation function Network Training Steepest descent Stochastic GD Training
Residual
• Denote the input points by x{i}
• Let
1 {i}
0
ifx∈A
y x{i} =
0
1 ifx{i}∈B
• Suppose we have computed W [2], W [3], W [4], b[2], b[3], b[4] and evaluate F x{i}
• Residual {i} {i} yx −Fx
2
Copyright © 2021 N. Nedialkov. All rights reserved. 12/20
Example Activation function Network Training Steepest descent Stochastic GD Training
Cost function
• Cost function
Cost W [2], W [3], W [4], b[2], b[3], b[4]
1 10 1 {i} {i} 2
= yx−Fx 10 i=1 2 2
• Training: find the parameters that minimize the cost function • Nonlinear least squares problem
Copyright © 2021 N. Nedialkov. All rights reserved. 13/20
Example Activation function Network Training Steepest descent Stochastic GD Classifying
• Suppose we have computed values for the parameters • Givenx∈R2,computey=F(x)
• If y1 > y2 classify x as A, y closer to [1, 0]T
• If y1 < y2 classify x as B, y closer to [0, 1]T
• Tie breaking when =
Copyright © 2021 N. Nedialkov. All rights reserved.
14/20
Example Activation function Network Training Steepest descent Stochastic GD Steepest descent
• Consider the parameters in a vector p ∈ Rs. Here s = 23 • Cost function is Cost(p)
• Find ∆p such that
Cost(p + ∆p) < Cost(p)
• For small ∆p,
Cost(p + ∆p) ≈ Cost(p) + s ∂Cost(p)∆pr
r=1 ∂pr
= Cost(p) + ∇Cost(p)T ∆p
∂Cost(p) ∂Cost(p) ∂Cost(p)T ∇Cost(p) = ∂p , ∂p ,··· , ∂p
12s
Copyright © 2021 N. Nedialkov. All rights reserved. 15/20
Example Activation function Network Training Steepest descent Stochastic GD Example
• To illustrate the above, suppose Cost(p)=p21 +p2 +2p1 +3
• Gradient is
∇Cost(p) = [2p1 + 2, 2p2]T Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p
= Cost(p) + (2p1 + 2)∆p1 + 2p2∆p2
Copyright © 2021 N. Nedialkov. All rights reserved.
16/20
Example Activation function Network Training Steepest descent Stochastic GD Steepest descent cont
• Cost(p) ≥ 0
• From
Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p,
we want to make ∇Cost(p)T ∆p as negative as possible
• Given ∇Cost(p) how to choose ∆p?
• Foru,v∈Rs,
uT v = ∥u∥ · ∥v∥ cos θ is most negative when v = −u
• Chose ∆p in the direction of −∇Cost(p)
That is move along the direction of steepest descent
Copyright © 2021 N. Nedialkov. All rights reserved.
17/20
Example Activation function Network Training Steepest descent Stochastic GD
η is learning rate Steepest descent:
chose initial p repeat
∆p = pnew − p = −η∇Cost(p) pnew = p − η∇Cost(p)
p ← p − η∇Cost(p)
until stopping criterion is met or max # of iterations is reached
Copyright © 2021 N. Nedialkov. All rights reserved.
18/20
Example Activation function Network Training Steepest descent Stochastic GD • In general N input points
1N1{i} {i}2 Cost(p) = y x − F x
N i=1 2
2
= N
1 N
∇Cost(p) = N
• N can be large
• Number of parameters can be very large
• Evaluating ∇Cost(p) can be very expensive
1 N
Ci(p) ∇Ci(p)
i=1
Ci (p)
i=1
Copyright © 2021 N. Nedialkov. All rights reserved.
19/20
Example Activation function Network Training Steepest descent Stochastic GD Stochastic gradient descent
• Idea: replace 1 N ∇Ci(p) by random ∇Ci(p) N i=1
• Iterate until a stopping criterion is met or max # of iterations is reached:
◦ pick a random integer i from {1,2,...,N} ◦ p←p−η∇Ci(p)
Copyright © 2021 N. Nedialkov. All rights reserved. 20/20