Machine Learning and Data Mining in Business
Lecture 10: Deep Feedforward Networks
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 10: Deep Feedforward Networks
Learning objectives
• Representation learning.
• Deep feedforward networks.
Lecture 10: Deep Feedforward Networks
1. Representation learning
2. Deep feedforward networks
Representation learning
Representation learning
• Deep learning is based on an approach to machine learning known as representation learning.
• Let’s start by reviewing the challenges that motivate this approach.
Linear models
Linear models such as a linear regression
f(x; α, β) = α + x⊤β are useful for prediction in many applications.
Note: we will need vector notation to discuss neural networks.
Linear models
• Some important advantages of linear models are interpretability and the fact we can efficiently and reliably fit them with convex optimisation.
• However, linear models do not generalise well if the relationship between the output and inputs cannot be reasonably approximated by a linear function.
Linear basis expansions
To extend linear models to fit nonlinear functions, we can specify a linear model for a transformed input
f(x; α, β) = α + φ(x)⊤β,
whereφ:Rp →RM isamappingfromxtoanextendedsetof
features that provide a useful representation of the inputs.
For example, we implement a polynomial regression with a scalar input by choosing φ(x) = (x, x2, x3, . . .).
Linear basis expansions
f(x; α, β) = α + φ(x)⊤β
How do we choose φ?
• One option is to design suitable features for each problem, as
we did in generalised additive modelling.
• The limitation is that can be hard to design useful representations for complex problems. For example, which features should interact?
Linear basis expansions
f(x; α, β) = α + φ(x)⊤β
• Another strategy is to use a very flexible representation φ(x).
• Kernel machines such as support vector machines (SVMs) and Gaussian processes (not covered in this unit) implicitly use an infinite number of basis functions.
• However, a representation that is too generic is subject to the curse of dimensionality.
Representation learning
The strategy of representation learning methods is to learn φ. We consider the model
f(x; α, β, θ) = α + φ(x; θ)⊤β
where the parameter vector θ allows us to learn features from a broad range of possible representations, together with the remaining parameters α and β.
Representation learning
f(x; α, β, θ) = α + φ(x; θ)⊤β
This approach is very flexible, but sacrifices the convexity of the problem.
Example: projection pursuit regression
In a projection pursuit regression, we consider the model
f(x) = α + gm(bm + x⊤wm),
where we learn bm, wm and gm jointly from data. We use univariate smoothers such as penalised splines to fit each gm.
Example: projection pursuit regression
Projection pursuit regression:
f(x) = α + gm(bm + x⊤wm)
• We can view the projection pursuit regression as a GAM based on learned features.
• At the same time, a projection pursuit regression can learn a much richer class of functions than a standard GAM.
• Unfortunately, this model is difficult to train in practice.
Single layer perceptron
The single layer perceptron (SLP) is the simplest neural network model. We can write the model (for regression) as
hm =gbm +wmjxj, m=1,…,M, j=1
f(x) = α + βmhm,
where the first equation describes the hidden layer, the second is the output layer, h1, . . . , hm are known as hidden units, and g is a nonlinear activation function.
Example: single layer perceptron
Figure from ISL.
Single layer perceptron
hm =gbm +wmjxj, m=1,…,M, j=1
f(x) = α + βmhm.
• In the SLP, each feature hm is a nonlinear transformation of a linear transformation of the inputs. The activation g is typically a simple pre-specified function.
• Theoutputlayerislinearonthederivedfeaturesh1,…,hm.
• It turns out that this model is flexible enough to approximate any reasonable function, at least in principle!
When discussing neural networks, we will write a model such as the SLP
hm =gbm +wmjxj, m=1,…,M, j=1
f(x) = α + βmhm,
m=1 in matrix notation as
h = g(b + W x) f(x) = α + h⊤β.
Representation learning
Image credit: Deep Learning by Goodfellow, I., Bengio, Y. and A. Courville.
Deep feedforward networks
Deep learning
Deep learning is a class of representation learning methods, that at the most basic level, learn complex representations by building them of out of simpler representations.
Deep learning
Image credit: M. , PNAS January 22, 2019 116 (4) 1074-1077.
Deep learning
Image credit: Deep Learning by Goodfellow, I., Bengio, Y. and A. Courville.
Deep feedforward networks
Deep feedforward networks, also called feedforward neural networks, or multilayer perceptron (MLP), specify a composition of many functions, such as
f(x) = f(3)(f(2)(f(1)(x))).
In this model, we refer to f(1) input layer, f(2) as the second layer, and so on. The final layer is the output layer. The number of layers is the depth of the architecture.
Deep feedforward networks
Image credit: https://www.ibm.com/cloud/learn/neural-networks
Deep feedforward networks
• Because the training data do not tell us the desired output for the intermediate layers, these are are called hidden layers.
• The output of each hidden layer is a vector. Each element of the vector is called a hidden unit or neuron.
• The number of units is the width of the model.
Deep feedforward networks
When designing a deep forward network, we need to specify:
• The network architecture, including the number of layers, the number of units in each layer, and how the the layers should be connected to each other.
• How to compute the hidden units.
• The output unit.
Deep Feedforward networks
Figure from ISL.
Hidden layers
Let h(l−1) denote the output from the previous layer (x for the input layer), which becomes the input for the current layer. The hidden units have the form
h(l) = g(l) b(l) + W (l)h(l−1) , where g(l) is the activation function for the layer.
We refer to the elements of W (l) as weights and the elements of b(l) as the biases of the layer.
Hidden layers
That is, the first layer in a deep feedforward network is
h(1) = g(1) b(1) + W (1)x , h(2) = g(2) b(2) + W (2)h(1) ,
and so on, with the depth of the network and the width of each layer to be decided.
the second layer is
Hidden layers
Let M(l) denote the number of units in layer l. We can the hidden
M(l−1)
h(l) = g(l) b(l) + w(l) h(l−1) , j j jmm
for j = 1,…,M(l), where wjm denotes the element in row j and
column m of W (l).
We refer to this type of layer as a dense or fully connected layer, since each unit depends on all the layer inputse
Illustration: feedforward network with two hidden layers
Image credit: Hundred Page Machine Learning Book
Hidden units
We use the following notation for the pre-activation units
z(l) = b(l) + W (l)h(l−1), h(l) = g(l)(z(l)).
Sometimes, we will drop the superscript indicating the layer for simplicity.
Rectified linear units
The standard choice of activation function for modern neural networks is the rectified linear unit (ReLU) activation,
g(z) = that is, g(z) = max(0, z).
Rectified linear units
Logistic sigmoid activation
Before the introduction of ReLUs, most neural networks where based on the logistic sigmoid activation function,
g(z) = σ(z) = 1 . 1 + exp(−z)
Activation functions
Rectified linear units
The ReLU has important advantages over the sigmoid activation:
• We can compute it more efficiently.
• We can store it more efficiently.
• It makes the learning algorithm less prone to the vanishing gradient problem.
Activation functions
Image credit: Murphy (2021)
Vanishing gradient problem
• The sigmoid activation has the problem that it saturates for values of z away from zero, which means that it becomes very flat for low or high values of z.
• Since the units are not sensitive to input in these regions, gradient-based learning becomes extremely slow to update the parameters.
• This problem becomes worse in deep networks, since we end multiplying many such small derivatives when training (one for each layer).
Rectified linear units
• ReLUs still have the drawback that gradient-based learning gets no information from training examples with zero activation.
• If the activation becomes zero for every training instance, then the unit gets stuck at zero. This issue is known as the dying ReLUs problem.
Leaky ReLU and PreLU
We can mitigate these issues by generalising the rectified linear unit as
g(z, α) = max(0, z) + α min(0, z), for a slope parameter α.
A leaky ReLU fixes α to a small value, while a parametric ReLU (PreLU) treats the slope as a parameter.
ReLU, Leaky ReLU, and ELU
Exponential linear unit
The exponential linear unit (ELU) activation is
g(z) = where α is a hyperparameter.
α(exp(z) − 1)
if z < 0 if z ≥ 0
Exponential linear unit
• Like the Leaky ReLU and PreLU, the ELU has a nonzero gradient for z < 0, which avoids the problem of dead units.
• Further, if α = 1 the function is smooth everywhere, which helps to speed up gradient-based learning.
• On the other hand, the ELU is slower to compute.
Scaled exponential linear unit
The scaled exponential linear unit (SELU) is
where λ and α are pre-defined scalars.
λα(exp(z) − 1)
if z < 0 if z ≥ 0
Scaled exponential linear unit
• We can show that by setting α and λ carefully chosen values, this activation function is guaranteed to ensure that the output of each layer is standardised (provided the input is also standardised).
• This can help with model fitting.
Swish and GELU
The swish activation is
swish(z; β) = zσ(βz).
The GELU (Gaussian error linear unit) is GELU(z) = zΦ(z),
where Φ is the CDF of the standard normal distribution.
Activations
Image credit: Murphy (2021)
Output layer
The output layer has the form
o = g(L)(b(L) + W (L)h(L−1)),
where L is the number of layers, g(L) is the activation function for the output layer and h(L−1) is the output from the last hidden layer.
For a scalar output, this simplifies to
o = g(L) b(L) + w(L)⊤h(L−1) .
Output layer: regression
The output layer depends on the task. In regression, the output layer is a linear layer
o = b(L) + w(L)⊤h(L−1),
where h(L−1) are the features constructed by the network.
An equivalent way to write it is
o = b(L) + w(L)h(L−1).
Output unit: binary classification
In binary classification, we use the sigmoid function as the activation function
o = σ b(L) + w(L)⊤h(L−1) .
Output unit: multiclass classification
In multiclass classification, we first compute
z = b(l−1) + W (l)h(l−1),
where the dimension of the output z is the number of classes C.
We then use the softmax function
exp(z1) exp(zC)
S(z) = Cc=1 exp(zc),..., Cc exp(zc)
to exponentiate and normalise z to obtain the conditional class probabilities p(y = c|x).
Multilayer perceptron
Figure from ISL.
Universal approximation
• The universal approximation theorem says that a feedforward network with a linear output layer, at least one hidden layer, and a sufficient number of units can approximate any continuous function with arbitrary accuracy.
• However, this result does not mean that neural networks can learn any function in practice. For example, the required number of hidden units may be infeasibly large.
The importance of depth
• Using deeper models can substantially reduce the total number of units required to represent a function. It’s been shown that deep networks have an exponential advantage over single layer networks for certain classes of functions.
• Empirically, deep models lead to better generalisation for a wide variety of tasks.
Training deep models
• Training deep neural networks is a complex task because of the non-convexity of the objective function and the typically large number of parameters.
• The optimisation is based on variants of stochastic gradient descent together with the method of backpropagation to compute the gradient.
• In the case of a MLP, backpropagation corresponds to the repeated application of the chain rule for differentiation.
Deep learning
Other important types of neural networks are:
• Convolutional networks, which lead to significant advances in computer vision.
• Recurrent neural networks, which can process sequential data.
• Transformers, initially developed for NLP applications but increasingly used in other domains as well.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com