CS计算机代考程序代写 deep learning Introduction to Machine Learning

Introduction to Machine Learning
CS/SE 4X03
Ned Nedialkov McMaster University March 16, 2021

Outline
Example
Activation function
A simple network
Training
Steepest descent Stochastic gradient descent

Example Activation function Network Training Steepest descent Stochastic GD
This is a summary of Sections 1-4 from
C. F. Higham, D. J. Higham, Deep Learning: An Introduction for Applied Mathematicians
Figures 1 and 3 are cropped from this article
Copyright © 2021 N. Nedialkov. All rights reserved. 3/20

Example Activation function Network Training Steepest descent Stochastic GD Example
• Points in R2 classified in two categories A and B • This is labeled data
• Given a new point, how to use the labeled data to classify this point?
Copyright © 2021 N. Nedialkov. All rights reserved. 4/20

Example Activation function Network Training Steepest descent Stochastic GD Activation function
• A neuron fires or is inactive
• Activation can be modeled by the sigmoid function
σ(x) = 1 1+e−x
• σ(0)=0.5,σ(x)≈1whenxlarge,σ(x)≈0whenxsmall 1
0.8 0.6 0.4 0.2
0
-10 -5 0 5 10
x
(x)
Copyright © 2021 N. Nedialkov. All rights reserved. 5/20

Example Activation function Network Training Steepest descent
Stochastic GD
• Steepness can be changed by scaling
• Location can be changed by shifting
• Useful property σ′(x) = σ(x)(1 − σ(x))
1 0.8 0.6 0.4 0.2
0
-10 -5 0 5 10
x
(3*(x-5))
Copyright © 2021 N. Nedialkov. All rights reserved.
6/20

Example Activation function Network Training Steepest descent A simple network
• Each neuron
◦ outputs a real number
◦ sends to every neuron in next layer
• Neuron in next layer
◦ forms a linear combination of inputs + bias
◦ applies activation function
Copyright © 2021 N. Nedialkov. All rights reserved.
Stochastic GD
7/20

Example Activation function Network Training Steepest descent Stochastic GD Consider layers 2 and 3
Layer 2: neurons 1 and 2 output real a1 and a2, respectively, and send to neurons 1, 2, 3 in layer 3
Layer 3:
• neuron
1 combines a1 and a2 and ads bias b1:
w11a1 + w12a2 + b1 outputs
σ (w11a1 + w12a2 + b1) • neuron 2 outputs
σ (w21a1 + w22a2 + b2) • neuron 3 outputs
σ (w31a1 + w32a2 + b3)
Copyright © 2021 N. Nedialkov. All rights reserved. 8/20

Example Activation function Network Denote
w11 w12 W =w21 w22,
w31 w32 z=Wa+b
Training Steepest descent Stochastic GD
W is a matrix with weights, b is a bias vector For a vector z, apply σ component wise
The output of layer 3 is
(σ(z))i = σ(zi) σ(z) = σ(W a + b)
Copyright © 2021 N. Nedialkov. All rights reserved.
9/20
􏰆a1􏰇 a= a
2
,
b1 b=b2
b3

Example Activation function Network Training Steepest descent Stochastic GD • Denotetheinputbyx,theW andbatlayeribyW[i] and
b[i], and the output of layer i by a[i] • Output of layer 2 is
a[2] =σ􏰋W[2]x+b[2]􏰌∈R2, • Output of layer 3 is
a[3] =σ􏰋W[3]a[2] +b[3]􏰌∈R3, • Output of layer 4 is
a[4] =σ􏰋W[4]a[3] +b[4]􏰌∈R2, • Write the above as
W[2] ∈R2×2, b[2] ∈R2
W[3] ∈R3×2, b[3] ∈R3
W[4] ∈R2×3, b[4] ∈R2
F (x) = σ 􏰋W [4]σ 􏰋W [3]σ 􏰋W [2]x + b[2]􏰌 + b[3]􏰌 + b[4]􏰌 Copyright © 2021 N. Nedialkov. All rights reserved. 10/20

Example Activation function Network Training Steepest descent Stochastic GD
• Layer i:
◦ W [i] is of size (# outputs)×(# inputs) ◦ b[i] is of size (# outputs)
Number of parameters is 23:
layer i inputs outputs W[i] b[i]
2222×22 3233×23 4322×32
16 7
• F (x) is a function from R2 → R2 with 23 parameters
• Training is about finding parameters
Copyright © 2021 N. Nedialkov. All rights reserved.
11/20

Example Activation function Network Training Steepest descent Stochastic GD Training
Residual
• Denote the input points by x{i}
• Let
􏰙 􏰚
1 {i}
 0
 ifx∈A
y 􏰋x{i}􏰌 =  􏰙􏰚
 0
1 ifx{i}∈B
• Suppose we have computed W [2], W [3], W [4], b[2], b[3], b[4] and evaluate F 􏰀x{i}􏰁
• Residual 􏰰 􏰋 {i}􏰌 􏰋 {i}􏰌􏰰 􏰰􏰰yx −Fx 􏰰􏰰
2
Copyright © 2021 N. Nedialkov. All rights reserved. 12/20

Example Activation function Network Training Steepest descent Stochastic GD Training
Cost function
• Cost function
Cost 􏰋W [2], W [3], W [4], b[2], b[3], b[4]􏰌
1 10 1􏰰 􏰰 􏰄 􏰋 {i}􏰌 􏰋 {i}􏰌 2
= 􏰰􏰰yx−Fx􏰰􏰰 10 i=1 2 2
• Training: find the parameters that minimize the cost function • Nonlinear least squares problem
Copyright © 2021 N. Nedialkov. All rights reserved. 13/20

Example Activation function Network Training Steepest descent Stochastic GD Classifying
• Suppose we have computed values for the parameters • Givenx∈R2,computey=F(x)
• If y1 > y2 classify x as A, y closer to [1, 0]T
• If y1 < y2 classify x as B, y closer to [0, 1]T • Tie breaking when = Copyright © 2021 N. Nedialkov. All rights reserved. 14/20 Example Activation function Network Training Steepest descent Stochastic GD Steepest descent • Consider the parameters in a vector p ∈ Rs. Here s = 23 • Cost function is Cost(p) • Find ∆p such that Cost(p + ∆p) < Cost(p) • For small ∆p, Cost(p + ∆p) ≈ Cost(p) + 􏰄s ∂Cost(p)∆pr r=1 ∂pr = Cost(p) + ∇Cost(p)T ∆p 􏰆∂Cost(p) ∂Cost(p) ∂Cost(p)􏰇T ∇Cost(p) = ∂p , ∂p ,··· , ∂p 12s Copyright © 2021 N. Nedialkov. All rights reserved. 15/20 Example Activation function Network Training Steepest descent Stochastic GD Example • To illustrate the above, suppose Cost(p)=p21 +p2 +2p1 +3 • Gradient is ∇Cost(p) = [2p1 + 2, 2p2]T Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p = Cost(p) + (2p1 + 2)∆p1 + 2p2∆p2 Copyright © 2021 N. Nedialkov. All rights reserved. 16/20 Example Activation function Network Training Steepest descent Stochastic GD Steepest descent cont • Cost(p) ≥ 0 • From Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p, we want to make ∇Cost(p)T ∆p as negative as possible • Given ∇Cost(p) how to choose ∆p? • Foru,v∈Rs, uT v = ∥u∥ · ∥v∥ cos θ is most negative when v = −u • Chose ∆p in the direction of −∇Cost(p) That is move along the direction of steepest descent Copyright © 2021 N. Nedialkov. All rights reserved. 17/20 Example Activation function Network Training Steepest descent Stochastic GD η is learning rate Steepest descent: chose initial p repeat ∆p = pnew − p = −η∇Cost(p) pnew = p − η∇Cost(p) p ← p − η∇Cost(p) until stopping criterion is met or max # of iterations is reached Copyright © 2021 N. Nedialkov. All rights reserved. 18/20 Example Activation function Network Training Steepest descent Stochastic GD • In general N input points 1􏰄N1􏰰􏰋{i}􏰌 􏰋{i}􏰌􏰰2 Cost(p) = 􏰰􏰰y x − F x 􏰰􏰰 N i=1 2 2 􏰩 􏰨􏰧 􏰪 = N 1 􏰄N ∇Cost(p) = N • N can be large • Number of parameters can be very large • Evaluating ∇Cost(p) can be very expensive 1 􏰄N Ci(p) ∇Ci(p) i=1 Ci (p) i=1 Copyright © 2021 N. Nedialkov. All rights reserved. 19/20 Example Activation function Network Training Steepest descent Stochastic GD Stochastic gradient descent • Idea: replace 1 􏰃N ∇Ci(p) by random ∇Ci(p) N i=1 • Iterate until a stopping criterion is met or max # of iterations is reached: ◦ pick a random integer i from {1,2,...,N} ◦ p←p−η∇Ci(p) Copyright © 2021 N. Nedialkov. All rights reserved. 20/20