QBUS 6840 Lecture 10 & 11 Predictive Analytics with Neural Networks and Deep Learning I & II
QBUS 6840 Lecture 10 & 11
Copyright By PowCoder代写 加微信 powcoder
Predictive Analytics with Neural Networks and
Deep Learning I & II
The University of School
Introduction and Neural Networks Architecture
Neural Networks for cross-sectional data
Deep Structure in Neural Networks
Teaching Slides; and
A comprehensive book is Deep Learning by Goodfellow,
Bengio and Courville, freely available at
https://www.deeplearningbook.org
https://www.deeplearningbook.org
Learning objectives
Understand the importance of data representation in data
analysis, and that neural network modeling and deep learning
are efficient data representation tools
Understand some basic concepts of neural network (NN) and
deep learning (DL)
Know the methods used to train/estimate a neural network
model, and the difficulties in training
Know how to use a neural network for prediction with
cross-sectional data and time series data
Know how to use NN&DL in business predictive analytic
context. Most of the text out there in DL uses computer
science terminologies. This lecture looks at DL from a
statistician’s perspective.
Introduction
In regression modelling, sometimes it is advisable to add
interaction terms Xi × Xj or quadratic terms X 2i to the model.
These terms are examples of non-linear effects: when
appropriate non-linear effect terms are added into the
regression/classification model, the prediction accuracy is
How to select non-linear effect terms? When should they be
Sometimes, this can be done manually, but requires
domain-knowledge, trial and error: not efficient and not
always possible!
Introduction
In regression modelling, sometimes it is advisable to add
interaction terms Xi × Xj or quadratic terms X 2i to the model.
These terms are examples of non-linear effects: when
appropriate non-linear effect terms are added into the
regression/classification model, the prediction accuracy is
How to select non-linear effect terms? When should they be
Sometimes, this can be done manually, but requires
domain-knowledge, trial and error: not efficient and not
always possible!
A Simple Example
Let’s look at the Direct Marketing dataset
(DirectMarketing.csv provided on Canvas)
There are totally 11 covariates. The response is AmountSpent
Let’s use the first 900 observations as training data, the rest
100 as test data (in practice, data should be shuffled first)
A Simple Example
The MSE of the prediction on the test data Dtest is defined as
(ŷi − yi )2
To ease comparison, let’s use the square root
MSE , to get back to the original scale ($).
First, try the full linear regression model:
Lecture10 Example01.py
impor t numpy as np
impor t pandas as pd
impor t s t a t smod e l s . f o rmu la . a p i as smf
# impor t data
DM = pd . r e a d c s v ( ’ D i r e c tMa rk e t i ng . csv ’ )
DM = DM. dropna ( ) # Drop a l l Nans
n = 900 ; n t e s t = 1000 = n ;
lm = smf . o l s ( ’ AmountSpent˜ Ch i l d r e n + Cata l og s + Sa l a r y + Gender b + Mar r i ed b + Loca t i o n b \
+ Ownhome b + Age y + Age m + Hist m + Hi s t h ’ ,DM. head ( n ) ) . f i t ( )
p r e d i c t i o n s = lm . p r e d i c t (DM. t a i l ( n t e s t ) )
DM = DM. v a l u e s ; DM = DM. a s t ype ( f l o a t ) ; y t e s t = DM[ n : 1001 , 11 ]
MSE lm = np . mean ( ( p r e d i c t i o n s=y t e s t )**2)
p r i n t ( ’ Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : ’ , np . s q r t (MSE lm ) )
Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : 1530.9765347841285
A better linear regression model:
Lecture10 Example02.py
DM = pd . r e a d c s v ( ’ D i r e c tMa rk e t i ng . csv ’ )
lm = smf . o l s ( ’ AmountSpent˜ Ch i l d r e n + Cata l og s* Sa l a r y * Sa l a r y + Loca t i o n b \
+ Hist m ’ ,DM. head ( n ) ) . f i t ( )
lm . summary ( )
p r e d i c t i o n s = lm . p r e d i c t (DM. t a i l ( n t e s t ) )
DM = pd . DataFrame . a s ma t r i x (DM)
DM = DM. a s t ype ( f l o a t )
y t e s t = DM[ n : 1001 , 11 ]
MSE lm = np . mean ( ( p r e d i c t i o n s=y t e s t )**2)
p r i n t ( ’ Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : ’ , np . s q r t (MSE lm ) )
Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : 1526.2875712736322
Now use a neural network model:
Lecture10 Example03.py
# now bu i l d the n e u r a l net model
model = t f . k e r a s . models . S e q u e n t i a l ( )
model . add ( t f . k e r a s . l a y e r s . Dense (11 , i npu t d im=11, a c t i v a t i o n =’ r e l u ’ ) )
# the f i r s t h idden l a y e r has 11 un i t s , i n pu t has 11
# model . add ( Dense (11 , a c t i v a t i o n =’ r e l u ’ ) ) # add ano the r h idden l a y e r w i th 11 u n i t s
model . add ( t f . k e r a s . l a y e r s . Dense (1 , a c t i v a t i o n =’ l i n e a r ’ ) ) # the output l a y e r has 1 u n i t
# wi th the l i n e a r a c t i v a t i o n
# Compi l i ng model f o r use
s t a r t = t ime . t ime ( )
model . comp i l e ( l o s s =’MSE’ , o p t im i z e r =’adam ’ )
p r i n t (” Comp i l a t i on Time : ” , t ime . t ime ( ) = s t a r t )
# F i t the model
model . f i t ( X t r a i n , y t r a i n , epochs=100 , b a t c h s i z e =10, v e r bo s e =2, v a l i d a t i o n s p l i t =0.05)
# e v a l u a t e the model
MSE nn = model . e v a l u a t e ( X te s t , y t e s t )
p r i n t ( ’\n Root o f MSE on the t e s t data f o r n e u r a l net : ’ , np . s q r t (MSE nn ) )
Root o f MSE on the t e s t data f o r n e u r a l net : 0 .24108619398735537
A Simple Example
So for this dataset, which model is better in terms of
prediction accuracy?
Neural Networks
Neural networks and deep neural networks (called deep
learning) have become an exciting research and application
area in the last few years
Deep learning is widely known for its high prediction accuracy
It has been successfully applied to many large-scale industry
problems, image recognition, language processing
Its secret is Data Representation Learning
Representation Learning
We want to predict a response Y , based on raw/original
covariates X = (X1, …,Xp)
T , using linear regression modelling
Usually, before doing regression modelling, some appropriate
transformation of the covariates Xi is needed: Z1 = φ1(X ),
…, Zd = φd(X ).
The Zi are called predictors or features.
Then we model
E(Y |X ) = β0 + β1Z1 + · · ·+ βdZd = β0 + βTZ
where Z = (Z1, …,Zd)
T is a representation of
X = (X1, …,Xp)
T and β = (β1, …, βd)
A better representation (in terms of predicting Y ) leads to a
better prediction accuracy
Selection of the transformations φi (X ) is an art!
Representation Learning
Neural network modeling is a representation learning method.
It provides an efficient way to design a representation
Z = φ(X ) that is effective for predicting the response Y .
[Image credit: ]
What are neural networks?
They are a set of very flexible non-linear methods for
regression/classification and other tasks.
A neural network, also called artificial neural network (ANN)
is a computational model that is inspired by the network of
neurons in the human brain
What are neural networks?
A neural network is an interconnected assembly of simple
processing units or neurons, which communicate by sending
signals to each other over weighted connections
A neural network is made of layers of similar neurons: an
input layer, (one or many) hidden layers, and an output layer.
The input layer receives data from outside the network. The
output layer sends data out of the network. Hidden layers
receive/process/send data within the network.
A neural network is said to be deep, if it has many hidden
layers. Deep neural network modelling is collectively referred
to as deep learning.
What are neural networks?
In a nutshell, a neural net is a multivariate function: output η
is a function of the inputs X = (X1, …,Xp)
η = f (X ) = f (X1, …,Xp)
More precisely, this function is a layered composite function
Z (1) = f1(Z
(0)) : Z (0) = X
Z (2) = f2(Z
Z (L) = fL(Z
η = fL+1(Z
What are neural networks?
A neural network provides a mechanism for functional
approximation
Suppose that ftrue(X ) is a true, yet unknown, function that we
want to estimate. E.g.,
ftrue(X ) = E(Y |X )
the conditional mean of a response Y given X
A neural net with the output η = f (X ) provides an
approximation of ftrue(X ), i.e. we use f (X ) to approximate
ftrue(X ).
Variants of neural networks
The network structure considered so far is often called
feed-forward neural networks, which are most suitable for
cross-sectional data. Can be used for time series data too.
Later you will study recurrent neural networks, which are most
suitable for time series data.
Elements of a neural network (a generic layer connection)
Elements of a neural network
A (feedforward) neural net includes
a set of processing units (also called neurons, nodes)
weights wik , which are connection strengths from unit i to
biase w0k , which is the extra bias at unit k (may be 0).
a propagation rule that determines the total input Sk of unit
k, from the units that send information to unit k
the output Zk for each unit k , which is a function of the input
an activation function hk that determines the output Zk based
on the input Sk , Zk = hk(Sk)
Elements of a neural network
It’s useful to distinguish three types of units:
input units (often denoted by X ): receive data from outside
the network
hidden units (often denoted by Z ): receive data from and
send data to units within the network.
output units: send data out of the network. The type of the
output depends on the task (regression, binary classification or
multinomial regression). In many cases, there is only one
scalar output unit.
Given the signal from a set of inputs X , a NN produces an output.
Elements of a neural network
The total input sent to unit k is
wikZi + w0k
which is a weighted sum of the outputs from all units i that
are connected to unit k, plus a bias/intercept term w0k
Then, the output of unit k is
Zk = hk(Sk) = hk
wikZi + w0k
Usually, we use the same activation function hk = h for all
Elements of a neural network
Popular activation functions:
Sigmoid activation function: h(S) = σ(S) = 1
Tanh activation function: h(S) = tanh(S) = e
Rectified activation function :
h(S) = ReLU(S) = max(0,S) =
Linear Regression Model
Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y
The linear regression model is
y = w1x1 + w2x2 + w3x3 + b
Let add an intermediate variable z
y := z := w1x1 + w2x2 + w3x3 + b
Draw this model in the following diagram
x1Input #1
x2Input #2
x3Input #3
z Output y
Linear Regression Model
Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y
The linear regression model is
y = w1x1 + w2x2 + w3x3 + b
Let add an intermediate variable z
y := z := w1x1 + w2x2 + w3x3 + b
Draw this model in the following diagram
x1Input #1
x2Input #2
x3Input #3
z Output y
Linear Regression Model
Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y
The linear regression model is
y = w1x1 + w2x2 + w3x3 + b
Let add an intermediate variable z
y := z := w1x1 + w2x2 + w3x3 + b
Draw this model in the following diagram
x1Input #1
x2Input #2
x3Input #3
z Output y
A Simple Neural Networks (No hidden layer)
x1Input #1
x2Input #2
x3Input #3
1 Output Z
Information flow through a linear mapping
i1 Xi + w01
At the output, the above value may be then modified using a
nonlinear function such as a sigmoid,
1 + exp (−S(1)1 )
1 + exp (−w (1)11 X1 − w
Without applying the so-called activation function, the
network is equivalent to a linear regression.
Slightly Complex Neural Networks
Consider the following neural network:
x1Input #1
x2Input #2
x3Input #3
x4Input #4
where we have applied activation z
i ) for i = 1, 2, 3, 4, 5.
Models defined by Neural Networks
The previous network actually defines the following
composition nonlinear function, (on the output node, we take
no activation),
1 ) + β2h(s
2 ) + β3h(s
3 ) + β4h(s
4 ) + β5h(s
where h is the chosen activation function
Training means fitting this mathematical model with training
data, i.e., finding the best weights
11 , …,w
01 , …,w
05 , β1, …, β5, β0.
Models defined by Neural Networks
Graphical representation of a neural net with L = 3 hidden layers.
The input layer represents the raw covariates X . The last hidden
layer (hidden layer 3) represents the predictors Z .
Models defined by Neural Networks
Denote the final output of the neural net as
η = β0 + β1Z1 + · · ·+ βdZd = β0 + βTZ
with β = (β1, …, βd)
Note that η is a function of X and depends on W , β0 and β,
denoted by θ = (w ,β, β0),
η = F (X , θ)
where W is the set of weights that connect covariates X to
predictors Z , and (β, β0) is the set of weights that connect Z
We will use F (X , θ) to approximate ftrue(X ).
Forward propagation algorithm (optional)
Slides with “optional” are highly technical. You are encouraged to
go through them, but these will not be tested in the exams.
Forward propagation algorithm for computing the output
Consider a neural net with the structure (p, `(1), …, `(L), 1)
The input layer has p covariates X1,X2, …,Xp.
L hidden layers: the first hidden layer has `(1) units, the second
hidden layer has `(2) , etc.
The last layer is a single output η
Forward propagation algorithm (optional)
0v be the bias at unit v in layer j , and w
weight from unit u in the previous layer j − 1 to unit v in layer
j . Layer j = 0 is the input layer, `(0) = p.
The total input to unit v of layer j is
Its output is Z
v ). Other notations refer to the next
Forward propagation algorithm (optional)
: set of weights sends signal to unit v of layer j
: vector of total inputs to layer j , j = 1, …, L
: vector of outputs from layer j , Z (0) =
Forward propagation algorithm (optional)
The matrix of all weights from layer j − 1 to layer j and the biases
on layer j :
1,1 · · · w
1,2 · · · w
· · · w (j)
`(j−1),`(j)
and W (j)0 =
The final output of the network is
η = β0 + β1Z
1 + · · ·+ βdZ
Collect by W (j) := [W
Forward propagation algorithm (optional)
Pseudo-code algorithm for computing the output.
Input: covariates X1, …,Xp and weights w = (W
(1), …,W (L)), β0
Output:. η
Z (0) := (X1, …,Xp)
For j = 1, …, L:
Z (j) = h(S (j))
η = β0 + β
w31 = 0.5w32 = 0.6
The model (function) defined by this neural network is,
suppose we use the sigmoid activation function σ,
η =β0 + β1z1 + β2z2 = 0 + 0.7σ(s1) + 0.8σ(s2)
=0.7σ(0.1 ∗ x1 + 0.3 ∗ x2 + 0.5 ∗ x3 + 1)
+ 0.8σ(0.2 ∗ x1 + 0.4 ∗ x2 + 0.6 ∗ x3 + 2)
Neural net for regression
Given a neural network, we now know how to compute its
output η from an input vector X .
How is this output used for forecasting?
Other problems in neural network modelling
How to select the number of hidden layers?
How to select the number of units in each hidden layer?
How to perform variable selection?
We rely on machine learning package to train neural networks
Neural net for forecasting
Suppose that the response Y is numerical.
The model is
Y = η+ ε = F (X , θ) + ε = β0 +β1Z1 +β2Z2 + · · ·+βdZd + ε
where Zi ’s are the last hidden layer outputs and ε is an error
term with mean 0 and variance σ2. Often, we assume
ε ∼ N(0, σ2)
The least squares method can now be used to estimate the
model parameters.
Note on Python: In Python, the activation function of the output
unit for regression is defined as the identity function, named linear.
Least squares method for training
For the given training dataset D = {yi ,Xi = (xi1, …, xip)T},
i = 1, …,N.
The neural net regression model can be written as
yi = F (Xi , θ) + εi , i = 1, 2, …,N.
Define the loss function to be the sum of squared errors
`i (θ), `i (θ) = (yi − F (Xi , θ))2
LS minimizes Loss(θ) to estimate θ.
Difficulties in training a neural net
We need to solve an optimization problem
find θ that minimises Loss(θ)
Difficulties:
There are a huge number of parameters
The surface of the loss function of often highly multimodal
We often need big data, so computational expensive
In most cases, neural net models are trained by the Stochastic
Gradient Descent (SGD) method.
Gradient descent definition
Gradient descent is a first-order iterative optimization
algorithm for finding the minimum of a function.
To find a local minimum of a function using gradient descent,
we take steps proportional to the negative of the gradient (or
approximate gradient) of the function at the current point.
Source: wikipedia.
Gradient descent method for optimization
Suppose that we want to minimise a function Loss(θ), the
gradient descent method for optimization includes the
following steps:
1 Start from an initial θ(0)
θ(t+1) = θ(t) − αt∇θLoss(θ(t)) t = 1, 2, …
3 Stop updating until some convergence condition is met, i.e.,
loss update is less than a pre-specified threshold.
Here ∇θLoss(θ(t)) (or we can use notations
∂Loss(θ(t))
dLoss(θ(t))
) denotes the gradient vector of Loss(θ(t))
αt > 0 is called learning rate or step size, such that
αt → 0, as t → +∞,
θ(t) is guaranteed to converge to a local minima of Loss(θ).
Motivating example
Suppose below is the loss function (MSE) plot of a simple linear
regression without intercept term model:
β0 is not included for simplicity.
Motivating example
β0 is not included for simplicity.
If β0 is included, how does the loss function plot look like?
Motivating example
Figure: The picture shows an example surface plot of the loss function
Gradient descent
Based on the plot in the pervious slide:
At the current value β, we move down the
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com