程序代写代做代考 algorithm chain graph CMPUT 366 F20: More on ANN

CMPUT 366 F20: More on ANN
James Wright & Vadim Bulitko
December 3, 2020
CMPUT 366 F20: More on ANN
1

Lecture Outline
More on RNNs
GBC 10
CMPUT 366 F20: More on ANN 2

RNN: Overview
CMPUT 366 F20: More on ANN 3

RNN: Details
Forward pass:
Negative loss-likelihood loss:
CMPUT 366 F20: More on ANN 4

Problem with RNNs Above
CMPUT 366 F20: More on ANN 5

LSTM: Overview
CMPUT 366 F20: More on ANN 6

LSTM: Under the Hood
i-th LSTM cell’s state s(t) is updated as: i
x(t) is the input vector and h(t) is the hidden layer vector containing the outputs of all LSTM cells. b, U, W are biases, input weights and recurrent weights
f (t) is the forget gate with parameters bf , Uf , Wf : i
g(t) is the external input gate with parameters bg, Ug, Wg: i
CMPUT 366 F20: More on ANN 7

LSTM: Under the Hood
i-th LSTM cell’s output h(t) is computed as: i
where q(t) is the output gate with its own parameters bo, Uo, Wo i
Optionally can feed s(t−) into the three gates
CMPUT 366 F20: More on ANN 8

Training Neural Networks CMPUT 366: Intelligent Systems


GBC §6.5

Lecture Outline
1. Recap
2. Gradient Descent for Neural Networks
3. Automatic Differentiation
4. Back-Propagation

Recap: Nonlinear Features
y = f(x;w) = g(wTx) = g(∑n wixi) i=1
Generalized Linear Model: Learn a linear model on richer inputs
1. Define a feature mapping 𝜙(x) that returns functions of the original inputs
2. Learn a linear model of the features instead of the inputs

y = f(x;w) = g(wTφ(x)) = g(∑n wi[φ(x)]i) i=1

Recap: Feedforward Neural Network
x1
x2 h2
A neural network is many units composed together
Feedforward neural network:
Units arranged into layers
h1 h(x;w(1),b(1))=g(b(1)+∑n w(1)x) y1 i=1ii
• •
y(x; w, b) = g (b(y) + ∑n w(y)hi(xi; w(i), b(i))) i=1 i
= g b(y) + ∑n w(y)g b(i) + ∑n w(i)xj i=1 i j=1 j
Each layer takes outputs of

previous layer as its inputs

Recap: Chain Rule of Calculus
dz = dz dy
dx
h(x) = f(g(x)) ⟹ h′(x) = f′(g(x))g′(x)
If we know formulas for the derivatives of components of a function, then we can build up the derivative of their composition mechanically
dy dx
i.e,

Neural Network Parameters
xh 11
y = f(x; θ) x2 h2
y
A neural network is just a supervised model
It is a function that takes inputs x, and computes an output y based on Question: What is θ in a feedforward neural network?

parameters θ


Training Neural Networks Specify a loss L and a set of training examples:
E = (x(1), y(1)), . . . , (x(n), y(n)) Training by gradient descent:

1. Compute loss on training data: L(W, b) = ∑ l (f(x(i); W, b), y(i))
i
[Wnew] = [Wold] − η∇L(Wold,bold) bnew bold
Prediction Target
2. Compute gradient of loss: ∇L(W, b)
3. Update parameters to make loss smaller:

Three Representations A function f(x, y) can be represented in multiple ways:
1. As a formula: 2 f(x,y)=(xy+sinx+4)(3y +6)
2. As a computational graph: sin
3. As a finite numerical algorithm s1 = x
s2 = y
s3 = s1 × s2
s4 = sin(s1) s5 = s3 + s4
s6 = s5 + 4 s7 = sqr(s2) s8 = 3 × s7 s9 = s8 + 6
s10 = s6 × s9
example from [Bücker et al., 2006]
4
x
×
sqr
f(x,y)
y
3
6
× ++

z = f(y) y = f(x) x = f(w)
z = f(f(f(w)))
∂z = ∂z ∂y ∂x
Symbolic Differentiation
∂w
∂y ∂x ∂w
= f′(f(f(w)))f′(f(w))f′(w)
We can differentiate a nested formula by recursively applying the chain rule

to derive a new formula for the gradient
Problem: This can result in a lot of repeated subexpressions Question: What happens if the nested function is defined piecewise?
• •
f(w)={w ifw>0
0 otherwise.

Automatic Differentiation:
Forward Mode
The forward mode converts a finite numerical algorithm for computing a function into a

finite numerical algorithm for computing the function’s derivative
For each step, a new step is constructed representing the derivative of the corresponding

step in the original program:
⟹ s′ = 1 1
s2 = y s′ = 0
s4 = s1 × s2 ⋮
s1 = x
s=s+s 2
3 1 2 s′ = s′ + s′
To compute the partial derivative ∂sn , set s′ = 1 and s′ = 0 and run augmented algorithm
12
312
s′=s ×s′+s′×s 4⋮1212
• •
∂s1
This takes roughly twice as long to run as the original algorithm (why?)

Let’s compute ∂f
∂y x=2,y=8
sin ×
sqr
using forward mode:
Forward Mode Example
s1 = x = 2 s′ s=y =81
= 0
=1
2
s =s ×s =16
s′ 2
x
4
3
12
s′ =s1×s′+s′×s2=2 3 21
s′ =cos(s1)×s′=0 4 1
s′ =s′+s′=2 5 34
s′=s′=2 65
s′=s′×2×s =16 722
s
= sin(s ) ≈ 0.034 1
=s+s =16.034 34
=s+4 =20.034 5
=sqr(s) =64 2
4
s
f(x,y)
5
s
y
3
6
s
7
6
s=3×s =192
87 87
s9 = s8 + 6 = 198 s10 = s6 × s9 = 3966.732
s′=3×s′=48 s′ = s′ = 48
98
s′ =s6×s′+s′×s9 =1357.632 10 96
× ++

Forward Mode Performance To compute the full gradient of a function m variables requires computing

m partial derivatives
• • • •
In forward mode, this requires m forward passes
For our toy examples, that means running the forward pass twice
Neural networks can easily have thousands of parameters
We don’t want to run the network thousands of times for each gradient update!



Automatic Differentiation: Backward Mode
Forward mode sweeps through the graph, computing s′ = ∂si for each s i∂s1 i
The numerator varies, and the denominator is fixed Backward mode does the opposite:

For each si, computes the local gradient si =
∂sn ∂si
• •
At the end, we have computed xi = ∂sn for each input xi ∂xi
The numerator is fixed, and the denominator varies

Backward Mode Example
Let’s compute ∂f and ∂f using backward mode: ∂x x=2,y=8 ∂y x=2,y=8
s1 = x = 2 s2=y =8 s3=s1×s2 =16 s4 = sin(s1) ≈ 0.034 s5=s3+s4 =16.034 s6=s5+4 =20.034 s7 = sqr(s2) = 64 s8=3×s7 =192 s9=s8+6 =198
s10 = s6 × s9 = 3966.732

s =∂s10 =∂s10∂s7 =s0=0
x
sin ×
sqr
6 ∂s6 ∂s7 ∂s6 7
s7 = ∂s10 = ∂s10 ∂s8 =s83=60.102
4
f(x,y)
∂s7 ∂s8 ∂s7
s8 = ∂s10 = ∂s10 ∂s9 =s91=20.034
y
3
∂s8 ∂s9 ∂s8
s9 = ∂s10 = ∂s10 ∂s10 = s10s6 = 20.034
6
∂s9 ∂s ∂s10 ∂s9
s10=10=1 ∂s10
× ++

Back-Propagation
L(W, b) = ∑ l (f(x(i); W, b), y(i)) i
Back-propagation is simply automatic differentiation in backward mode, used to compute the gradient ∇W,b L of the loss function with respect to its parameters W, b:
1. At each layer, compute the local gradients of the layer’s computations
2. These local gradients will be used as inputs to the next layer’s local gradient
computations
3. At the end, we have a partial derivative for each of the parameters, which we can use to take a gradient step

Summary
The loss function of a deep feedforward networks is simply a very nested

function of the parameters of the model
Automatic differentiation can compute these gradients more efficiently

than symbolic differentiation or finite-differences numeric computations
• • •
Symbolic differentiation is interleaved with numeric computation
In forward mode, m sweeps are required for a function of m parameters In backward mode, only a single sweep is required
Back-propagation is simply automatic differentiation applied to neural

networks in backward mode