Machine Learning
Lecture 7: Deep Learning I
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Group Technische Universit ̈at Mu ̈nchen
Copyright By PowCoder代写 加微信 powcoder
Winter term 2020/2021
Introduction
Deep Learning 2
Data Analytics and Machine Learning
Another look at Logistic Regression
We had before:
y | x ∼ BernoulliσwT x wTx:=w0 +w1x1 +…+wDxD
We can represent this graphically
(the first node 1 = x0 is for the bias term): • each node is a (scalar) input
• multiply the input with the weight on the edge: xjwj
• compute weighted sum of incoming edges: a0 =Dj=0xjwj
• apply (activation) function:
p(y = 1 | x) = σ(a0) = f(x,w)
Deep Learning 3
Data Analytics and Machine Learning
The XOR dataset
The XOR dataset is not linearly separable → Logistic Regression will fail since it learns a linear decision boundary
6 4 2 0 2 4 6
Deep Learning 4
Data Analytics and Machine Learning
How to handle non-linearity? Basis functions
We have input vectors x and associated targets y. We want to describe the underlying functional relation.
We can use the following simple model:
f(x,w) = σw0 + wjφj(x) = σ(wTφ(x)) (1)
φ basis function — many choices, can be nonlinear w0 bias — equivalent to defining φ0 ≡ 1;
or adding constant 1 to every sample Remember we are linear in w!
Deep Learning 5
Data Analytics and Machine Learning
Example: Handling XOR dataset by custom basis functions
Apply a (nonlinear) transformation φ that maps samples to a space where they are linearly separable. For example:
6 4 2 0 2 4 6
Transformed R2 space 1.0
0.8 0.6 0.4 0.2 0.0
0.0 0.2 0.4 0.6 0.8 1.0
Here we defined a custom basis function φ : R3 → R2 φ(x)=φ(1,×1,x2)=(σ(5+x1+x2), σ(5−x1−x2))
6 4 2 0 2 4 6
Deep Learning 6
Data Analytics and Machine Learning
Example: Handling XOR dataset by custom basis functions
Overall function that is modeled:
5 1 1 1 f(x,w)=σ(wTφ(x))=σ1w0 w1·σ0 5 −1 −1 x1
How to find the parameters w?
Just train the model by minimizing a
corresponding loss function.
Here: binary cross-entropy since binary classification problem.
w∗ =argmin−(ynlogf(xn,w)+(1−yn)log(1−f(xn,w)))
Deep Learning 7
Data Analytics and Machine Learning
How to find the basis functions?
Different datasets require different transformations to become (almost) linearly separable.
Idea: learn the basis functions and the weights of the logistic regression jointly from the data (end-to-end learning)
• Previously: only learning w100 and w110
• Now: Learn all wijk where i=layer, j = input node, k = output node
w000 w010
W∗ = argmin−yn logf(xn,W)+(1−yn)log1−f(xn,W)
We get a simple Feed-Forward Neural Network (FFNN) with 1 hidden layer.
Note: σ0 and σ1 can be arbitrary activation functions.
f(x,W)=σ1 w100 w110·σ0 N
w020 w021 x2
Deep Learning 8
Data Analytics and Machine Learning
Making the model more complicated
Each basis function can be a more complicated function of the feature vector x (a function of other basis functions rather than a function of x).
By adding more hidden layers we get a“deep”neural network:
f(x,W) = σ2 W2σ1W1σ0(W0 x)
where W = {W0, W1, W2} are the weights to be learned. Above architecture is called a Multi-layered Perceptron (MLP) = fully-connected (feed-forward) Neural Network
Deep Learning 9
Data Analytics and Machine Learning
Activation functions
Most activation functions are applied element-wise when given a multi-dimensional input.
Softmax activation is a notable exception!
Deep Learning 10
Data Analytics and Machine Learning
Why use nonlinear activation functions?
Multiple linear layers:
f(x,W) = WL φ(WL−1 φ(… φ(W0x)…)) = (WLWL−1 . . . W1W0)x
results in a linear transformation!
Multiple nonlinear layers:
f(x,W) = WL σk(WL−1 σL−1(… σ0(W0x)…)) For non-linear functions we can not (in general) simplify it.
Deep Learning 11
Data Analytics and Machine Learning
Neural networks are universal approximators
Universal approximation theorem
An MLP with a linear output layer and one hidden layer can approx- imate any continuous function defined over a closed and bounded subset of RD, under mild assumptions on the activation function (’squashing’ activation functions; e.g. sigmoid) and given the num- ber of hidden units is large enough.
[Cybenko 1989; Funahashi 1989; Hornik et al 1989, 1991; Hartman et al 1990].
Also in the discrete case: a neural network can approximate any function from a discrete space to another.
Good news: Regardless of the function we want to learn, there exists an MLP that can approximate that function arbitrarily well.
Bad news: We have no idea how to find this MLP that provides the best approximation.
Deep Learning 12
Data Analytics and Machine Learning
Demo: http://playground.tensorflow.org
Deep Learning 13
Data Analytics and Machine Learning
Multiple hidden layers
According to the universal approximation theorem, a feed-forward network with 1 hidden layer can represent any function.
Why do we add more layers? Theoretical reason:
– For some families of functions, if we use a few layers we would need a large number of hidden units (and therefore parameters). But we can get the same representation power by adding more layers, fewer hidden units, and fewer parameters.
Practical reason:
– Deeper networks (with some additional tricks) often train faster and generalize better.
Deep Learning 14
Data Analytics and Machine Learning
Multiple hidden layers
Functions that can be compactly represented with k layers may require exponentially many hidden units when using only k − 1 layers.
Deep network can learn a hierarchy of representations.
Different high-level features share lower-level features.
from: Understanding and Improving Deep Learning Algorithms, , ML Google Distinguished Lecture, 2010
Deep Learning 15
Data Analytics and Machine Learning
Beyond binary classification
Deep Learning 16
Data Analytics and Machine Learning
Loss function
Neural networks can be used for various prediction tasks.
Different tasks require changing
1. the activation function in the final layer 2. the loss function
For supervised learning common choices are
Prediction target
Binary Discrete Continuous
p(y | x) Bernoulli Categorical Gaussian
Final layer
Loss function
Binary cross entropy Cross entropy Squared error
Deep Learning 17
Data Analytics and Machine Learning
Example 1: Binary classification
• Data: {xn, yn}Nn=1, where yn ∈ {0, 1}. • Activation in the final layer: sigmoid
f(x,W) = 1
1 + exp(−a)
• Conditional distribution
p(y | x) = Bernoulli(y | f(x,W))
• Loss function: binary cross entropy
E(W) = −logp(yn | xn)
= − yn log f (xn , W ) + (1 − yn ) log (1 − f (xn , W )) n=1
Deep Learning 18
Data Analytics and Machine Learning
Example 2: Multi-class classification
• Data: {xn, yn}Nn=1, where yn ∈ {0, 1}K (one-hot notation). • Activation in the final layer: softmax
f (x,W)= exp(ak)
k Kj=1 exp(aj)
• Conditional distribution
p(y | x) = Categorical(y | f (x, W ))
• Loss function: categorical cross entropy
E(W) = −logp(yn | xn)
= −ynk logfk(xn,W) n=1 k=1
Deep Learning 19
Data Analytics and Machine Learning
Example 3: Single-output regression
• Data: {xn, yn}Nn=1, where yn ∈ R.
• Activation in the final layer: identity (no activation)
f(x,W) = a • Conditional distribution: Gaussian
p(y | x) = N(y | f(x,W),1)
• Loss function: squared error (a.k.a. Gaussian cross-entropy)
E(W) = −logp(yn | xn)
= (yn −f(xn,W))2 +const. n=1
Deep Learning 20
Data Analytics and Machine Learning
Unsupervised deep learning
So far we have only considered neural networks for supervised learning. Unsupervised deep learning is also a very active field of research, with models such as
• Autoencoder (covered later in this course)
• Variational autoencoder (covered in our MLGS lecture)
• Generative adversarial networks (GAN; covered in our MLGS lecture)
• Unsupervised representation learning (e.g. word or node embeddings; covered in our MLGS lecture)
Deep Learning 21
Data Analytics and Machine Learning
Choosing the loss
We are free to choose any loss that provides a useful gradient for training.
This choice includes both the function and which values to use and compare to. Many model types mainly differ in their choice of loss.
Example loss functions:
• Cross entropy: For supervised classification
• Mean squared error (MSE)
• Mean absolute error (MAE): Useful if we have outliers, but gradient is independent of the distance from the optimum (advanced optimizers handle this surprisingly well, though).
• Huber loss, LogCosh: MSE at close distances, MAE far away. Combine benefits of MSE and MAE.
• Wasserstein (earth mover’s) distance, KL-divergence: For continuous distributions
Deep Learning 22
Data Analytics and Machine Learning
Parameter learning
Deep Learning 23
Data Analytics and Machine Learning
Minimizing the loss function
In practice E(W) is often non-convex =⇒ optimization is tricky
– a local minimum is not necessarily a global minimum.
– potentially there exist several local minima, many of which can be equivalent (see next tutorial session)
– often it is not possible to find a global minimum nor is it useful. We may find a few local minima, and pick the one with higher
performance on a validation set.
Default approach:
find a local minimum by using gradient descent
W(new) =W(old) −τ∇WE(W(old))
Deep Learning 24
Data Analytics and Machine Learning
How can we compute the gradient?
1) By hand: manually working out ∇W E and coding it is tricky and cumbersome (furthermore: see point 3).
2) Numeric: Can be done as
∂En = En(wij + ε) − En(wij) + O(ε) ∂wij ε
– Each evaluation of the above equation roughly requires O|W | operations, where |W | is the dimensionality of weight space.
– The evaluation has to be done for each parameter independently. Therefore computing ∇W E requires O|W |2 operations!
Deep Learning 25
Data Analytics and Machine Learning
How can we compute the gradient?
3) Symbolic differentiation: Automates essentially how you would compute the gradient function by hand.
But: “Writing down” the general expression for the gradient of every parameter is very expensive
• potentially exponentially many different cases (e.g. when having multiple layers with ReLUs)
• many terms reappear in the gradient computation for different parameters (since the function f is hierarchically constructed); these terms could be re-used to make computation faster; however symbolic differentiation does not exploit this insight
4) Automatic differentiation: e.g. backpropagation for neural networks • Evaluate ∇W E(W ) at the current point W
• Evaluation in O|W| (every“neuron”is visited only twice)
Deep Learning 26
Data Analytics and Machine Learning
Backpropagation: Toy example
f(x) = 2 f(x) = d(c(b(a(x)))) sin(exp(−x))
Computational graph:
a(x) = −x b(a) = exp(a) c(b) = sin b
d ( c ) = 2c
Deep Learning 27
Data Analytics and Machine Learning
Backpropagation: Toy example
f(x)= 2 sin(exp(−x))
a(x) = −x b(a) = exp(a) c(b) = sin b
Chain rule:
∂f = ∂d ∂c ∂b ∂a ∂x ∂c ∂b ∂a ∂x
∂f is the global derivative ∂x
∂d,∂c,∂b,∂a arethelocalderivatives ∂c ∂b ∂a ∂x
Deep Learning 28
Data Analytics and Machine Learning
Backpropagation in a nutshell
• Write your function as a composition of modules What the modules are is up to you
• Work out the local derivative of each module symbolically
• Do a forward pass for a given input x
i.e. compute the function f(x) and remember the intermediate values
• Compute the local derivatives for x
• Obtain the global derivative by multiplying the local derivatives
Deep Learning 29
Data Analytics and Machine Learning
Work out the local derivatives
Compute the local derivative of each module symbolically
b(a) = exp(a)
c(b) = sinb
d(c) = 2 c
Computational graph:
∂b = exp(a) ∂a
∂c = cosb ∂b
∂d = − 2 ∂cc2
Deep Learning 30
Data Analytics and Machine Learning
Forward pass
We want to compute the derivative ∂f at x = −4.499 ∂x
In the forward pass, we compute f by following the computational graph and cache (memorize) the intermediate values of a, b, c, d
a = −x = 4.499 b = exp(a) = 90 c = sin b = 1
d = 2c = 2 Computational graph:
Deep Learning 31
Data Analytics and Machine Learning
Backward pass
In the backward pass, we use the cached values of a, b, c, d to compute the local derivatives ∂d , ∂c , ∂b , ∂a
∂a = −1 ∂x
∂c=cosb=cos90=0 ∂b
∂b = exp(a) = exp(4.499) = 90 ∂a
∂d=−2 =−2 =−2 ∂c c2 12
∂c ∂b ∂a ∂x
We obtain the global derivative by multiplying the local derivatives
∂f = ∂d∂c∂b∂a =−2·0·90·−1=0 ∂x ∂c ∂b ∂a ∂x
Deep Learning 32
Data Analytics and Machine Learning
Multiple inputs
What if a function has multiple inputs?
We compute the derivative along each of the paths
∂c = ∂c∂a ∂c = ∂c∂b ∂x ∂a ∂x ∂y ∂b ∂y
Deep Learning 33
Data Analytics and Machine Learning
Multiple paths
What if a computational graph contains multiple paths from x to c? a
We need to sum the derivatives along each of the paths
∂c = ∂c∂a+∂c∂b ∂x ∂a ∂x ∂b ∂x
Deep Learning 34
Data Analytics and Machine Learning
Multivariate chain rule
The generalization to an arbitrary number of paths is given by the multivariate chain rule
We need to sum the derivatives along all the paths
∂c=m ∂c∂ai ∂x i=1 ∂ai ∂x
Deep Learning 35
Data Analytics and Machine Learning
Jacobian and the gradient
Consider a function f : Rn → Rm and let a = f(x). The Jacobian is an m × n matrix of partial derivatives
∂a1 ··· ∂a1 ∂a ∂x1 ∂xn
. . . m×n ∂x= . .. . ∈R
∂am ··· ∂am ∂x1 ∂xn
Letg:Rm →Randletc=g(a).
The gradient ∇ac ∈ Rm is the transpose of the Jacobian ∂c ∈ R1×m
∂cT T ∇ac= =∂c ··· ∂c
∂a ∂a1 ∂am Note that the gradient ∇ac has the same shape as a.
We are using the so-called numerator notation. Some other resources use a different (denominator) notation — be careful when using other books or slides.
Deep Learning 36
Data Analytics and Machine Learning
Chain rule in matrix form
We can compactly represent the chain rule using Jacobian matrices
∂ c = m ∂ c ∂ a i
∂xj We can write this in matrix form
∂c ∂c ∂a ∂x=∂a∂x
or equivalently
∇xc = ∂x ∇ac
i=1 ∂ai ∂xj
∂a ∂ai where ∂x =∂xj
Deep Learning 37
Data Analytics and Machine Learning
Computational graph of a neural network
Consider a simple regression problem: x ∈ RD , y ∈ R, square error loss. Generate predictions with a feed-forward NN
a=Wx+b h = σ(a)
yˆ = V h + c E = ( yˆ − y ) 2
In order to optimize W with gradient descent, we need to compute ∂E . ∂W
Deep Learning 38
Data Analytics and Machine Learning
Computational graph of a neural network
Consider a simple regression problem: x ∈ RD , y ∈ R, square error loss. Generate predictions with a feed-forward NN
a=Wx+b h = σ(a)
yˆ = V h + c E = ( yˆ − y ) 2
In order to optimize W with gradient descent, we need to compute ∂E . ∂W
Does chain rule even make sense in this scenario?
∂ E = ∂ E ∂ yˆ ∂ h ∂ a ∂ W ∂ yˆ ∂ h ∂ a ∂ W
What does it mean to take a derivative of a vector a w.r.t. a matrix W ?
Deep Learning 38
Data Analytics and Machine Learning
Matrix calculus
What is the derivative of a vector a ∈ RH w.r.t. a matrix W ∈ RH×D? Input is a …
matrix 3-way tensor
matrix 3-way tensor 4-way tensor
matrix matrix
scalar vector
That means,
∂a is a 3-way tensor of shape (H × H × D) where ∂W
Deep Learning 39
Data Analytics and Machine Learning
Output is a…
Accumulating the gradient
The local derivatives (i.e. Jacobians) such as ∂a are large ∂W
multidimensional tensors.
Materializing (computing) them is expensive and unnecessary. In reality, we only care about the global derivative!
···xy···E Assume that we know ∂ E and want to compute ∂ E .
∂y ∂x Most of the time, the Jacobian-vector product
∂E = ∂E ∂y ∂x ∂y ∂x
can be computed efficiently without materializing the Jacobian ∂y ! ∂x
Deep Learning 40
Data Analytics and Machine Learning
Accumulating the gradient: Implementation
Backpropagation is typically implemented using the following abstraction.
···xy···E Each module in the computational graph defines two functions
forward(x): given input x, compute output y
backward ∂y : given the incoming global derivative ∂y compute the product ∂E = ∂E ∂y
Then, ∂E is passed as input to the parent node in the computational
∂x graph, etc.
Deep Learning 41
Data Analytics and Machine Learning
Example: Affine layer
Back to our problem of finding
∂ E = ∂ E ∂ yˆ ∂ h ∂ a ∂ W ∂ yˆ ∂ h ∂ a ∂ W
An affine layer is defined as
a=Wx+b wherex∈RD,W ∈RH×D,b∈RH,a∈RH
forward(W , x, b): compute W x + b ∂E
backward ∂ a : compute
∂E ∂a, ∂E∂a,
∂E∂a ∂a ∂b
∂a ∂W ∂a ∂x How can we implement this efficiently?
Deep Learning 42
Data Analytics and Machine Learning
Step 1: work out the scalar derivative
First, let’s find ∂E ∂Wkl
for some k, l
=∂E ∂ai i ∂ai ∂Wkl
= ∂E ∂(W x + b)i i ∂ai ∂Wkl
= ∂E ∂(Wi:x)
i ∂ai ∂Wkl
=∂E∂( jWijxj) i ∂ai ∂Wkl
= ∂E ∂(j Wijxj) i ∂ai ∂Wkl
=∂E∂Wijxj =∂Exl ij∂ai∂Wkl ∂ak
Deep Learning 43
Data Analytics and Machine Learning
Step 2: backward pass in matrix form
For each element Wkl we have
We can compute
∂E for all elements k,l in matrix form
∂E∂E ··· ∂E ∂a1 ∂W11 ∂W1D
. . … .
∂E ∂E ··· ∂E ∂aH ∂WH1 ∂WHD
∂E ∂ET ∂W= ∂a xT
× x1 ··· xD
Or more compactly
Matrix operations are much more efficient than for-loops thanks to vectorization (especially on GPUs)
Deep Learning 44
Data Analytics and Machine Learning
Example: Affine layer
An affine layer is defined as
forward(W , x, b): compute W x + b ∂E
∂ a : compute
∂E =∂E ∂a =x∂E
∂W ∂a ∂W ∂a ∂E = ∂E ∂a = ∂EW
∂x ∂a ∂x ∂a ∂E = ∂E ∂a = ∂E
∂b ∂a ∂b ∂a
The derivatives for x and b are obtained similarly to W .
Deep Learning 45
Data Analytics and Machine Learning
Backpropagation: Summary
• Define the computation as a composition of modules
• The forward function computes the output of the module
• The backward function accumulates the total gradient
• We can implement backward without materializing the Jacobian
• Backpropagation walks backward through the computational graph, accumulating the product of gradients
Deep Learning 46
Data Analytics and Machine Learning
Reading material
• Goodfellow, Deep Learning: Chapter 6 Acknowledgements
• Some slides based on an earlier version by Patrick van der Smagt
• Backpropagation slides are based on the materials from mlvu.github.io by
Deep Learning 47
Data Analytics and Machine Learning
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com