程序代写 QBUS 6840 Lecture 10 & 11 Predictive Analytics with Neural Networks and De

QBUS 6840 Lecture 10 & 11 Predictive Analytics with Neural Networks and Deep Learning I & II

QBUS 6840 Lecture 10 & 11

Predictive Analytics with Neural Networks and
Deep Learning I & II

The University of School

Introduction and Neural Networks Architecture

Neural Networks for cross-sectional data

Deep Structure in Neural Networks

Teaching Slides; and

A comprehensive book is Deep Learning by Goodfellow,
Bengio and Courville, freely available at
https://www.deeplearningbook.org

https://www.deeplearningbook.org

Learning objectives

Understand the importance of data representation in data
analysis, and that neural network modeling and deep learning
are efficient data representation tools

Understand some basic concepts of neural network (NN) and
deep learning (DL)

Know the methods used to train/estimate a neural network
model, and the difficulties in training

Know how to use a neural network for prediction with
cross-sectional data and time series data

Know how to use NN&DL in business predictive analytic
context. Most of the text out there in DL uses computer
science terminologies. This lecture looks at DL from a
statistician’s perspective.

Introduction

In regression modelling, sometimes it is advisable to add
interaction terms Xi × Xj or quadratic terms X 2i to the model.
These terms are examples of non-linear effects: when
appropriate non-linear effect terms are added into the
regression/classification model, the prediction accuracy is

How to select non-linear effect terms? When should they be

Sometimes, this can be done manually, but requires
domain-knowledge, trial and error: not efficient and not
always possible!

Introduction

How to select non-linear effect terms? When should they be

Sometimes, this can be done manually, but requires
domain-knowledge, trial and error: not efficient and not
always possible!

A Simple Example

Let’s look at the Direct Marketing dataset
(DirectMarketing.csv provided on Canvas)

There are totally 11 covariates. The response is AmountSpent

Let’s use the first 900 observations as training data, the rest
100 as test data (in practice, data should be shuffled first)

A Simple Example

The MSE of the prediction on the test data Dtest is defined as

(ŷi − yi )2

To ease comparison, let’s use the square root

MSE , to get back to the original scale ($).

First, try the full linear regression model:
Lecture10 Example01.py

impor t numpy as np
impor t pandas as pd
impor t s t a t smod e l s . f o rmu la . a p i as smf
# impor t data
DM = pd . r e a d c s v ( ’ D i r e c tMa rk e t i ng . csv ’ )
DM = DM. dropna ( ) # Drop a l l Nans

n = 900 ; n t e s t = 1000 = n ;
lm = smf . o l s ( ’ AmountSpent˜ Ch i l d r e n + Cata l og s + Sa l a r y + Gender b + Mar r i ed b + Loca t i o n b \

+ Ownhome b + Age y + Age m + Hist m + Hi s t h ’ ,DM. head ( n ) ) . f i t ( )
p r e d i c t i o n s = lm . p r e d i c t (DM. t a i l ( n t e s t ) )
DM = DM. v a l u e s ; DM = DM. a s t ype ( f l o a t ) ; y t e s t = DM[ n : 1001 , 11 ]
MSE lm = np . mean ( ( p r e d i c t i o n s=y t e s t )**2)

p r i n t ( ’ Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : ’ , np . s q r t (MSE lm ) )

Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : 1530.9765347841285

A better linear regression model:
Lecture10 Example02.py

DM = pd . r e a d c s v ( ’ D i r e c tMa rk e t i ng . csv ’ )
lm = smf . o l s ( ’ AmountSpent˜ Ch i l d r e n + Cata l og s* Sa l a r y * Sa l a r y + Loca t i o n b \

+ Hist m ’ ,DM. head ( n ) ) . f i t ( )
lm . summary ( )
p r e d i c t i o n s = lm . p r e d i c t (DM. t a i l ( n t e s t ) )
DM = pd . DataFrame . a s ma t r i x (DM)
DM = DM. a s t ype ( f l o a t )
y t e s t = DM[ n : 1001 , 11 ]
MSE lm = np . mean ( ( p r e d i c t i o n s=y t e s t )**2)
p r i n t ( ’ Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : ’ , np . s q r t (MSE lm ) )

Root o f MSE on the t e s t data f o r l i n e a r r e g r e s s i o n : 1526.2875712736322

Now use a neural network model:
Lecture10 Example03.py

# now bu i l d the n e u r a l net model
model = t f . k e r a s . models . S e q u e n t i a l ( )
model . add ( t f . k e r a s . l a y e r s . Dense (11 , i npu t d im=11, a c t i v a t i o n =’ r e l u ’ ) )
# the f i r s t h idden l a y e r has 11 un i t s , i n pu t has 11
# model . add ( Dense (11 , a c t i v a t i o n =’ r e l u ’ ) ) # add ano the r h idden l a y e r w i th 11 u n i t s
model . add ( t f . k e r a s . l a y e r s . Dense (1 , a c t i v a t i o n =’ l i n e a r ’ ) ) # the output l a y e r has 1 u n i t

# wi th the l i n e a r a c t i v a t i o n
# Compi l i ng model f o r use
s t a r t = t ime . t ime ( )
model . comp i l e ( l o s s =’MSE’ , o p t im i z e r =’adam ’ )
p r i n t (” Comp i l a t i on Time : ” , t ime . t ime ( ) = s t a r t )

# F i t the model
model . f i t ( X t r a i n , y t r a i n , epochs=100 , b a t c h s i z e =10, v e r bo s e =2, v a l i d a t i o n s p l i t =0.05)
# e v a l u a t e the model
MSE nn = model . e v a l u a t e ( X te s t , y t e s t )

p r i n t ( ’\n Root o f MSE on the t e s t data f o r n e u r a l net : ’ , np . s q r t (MSE nn ) )

Root o f MSE on the t e s t data f o r n e u r a l net : 0 .24108619398735537

A Simple Example

So for this dataset, which model is better in terms of
prediction accuracy?

Neural Networks

Neural networks and deep neural networks (called deep
learning) have become an exciting research and application
area in the last few years

Deep learning is widely known for its high prediction accuracy

It has been successfully applied to many large-scale industry
problems, image recognition, language processing

Its secret is Data Representation Learning

Representation Learning

We want to predict a response Y , based on raw/original
covariates X = (X1, …,Xp)

T , using linear regression modelling

Usually, before doing regression modelling, some appropriate
transformation of the covariates Xi is needed: Z1 = φ1(X ),
…, Zd = φd(X ).

The Zi are called predictors or features.

Then we model

E(Y |X ) = β0 + β1Z1 + · · ·+ βdZd = β0 + βTZ

where Z = (Z1, …,Zd)
T is a representation of

X = (X1, …,Xp)
T and β = (β1, …, βd)

A better representation (in terms of predicting Y ) leads to a
better prediction accuracy

Selection of the transformations φi (X ) is an art!

Representation Learning

Neural network modeling is a representation learning method.
It provides an efficient way to design a representation
Z = φ(X ) that is effective for predicting the response Y .

[Image credit: ]

What are neural networks?

They are a set of very flexible non-linear methods for
regression/classification and other tasks.

A neural network, also called artificial neural network (ANN)
is a computational model that is inspired by the network of
neurons in the human brain

What are neural networks?

A neural network is an interconnected assembly of simple
processing units or neurons, which communicate by sending
signals to each other over weighted connections

A neural network is made of layers of similar neurons: an
input layer, (one or many) hidden layers, and an output layer.

The input layer receives data from outside the network. The
output layer sends data out of the network. Hidden layers
receive/process/send data within the network.

A neural network is said to be deep, if it has many hidden
layers. Deep neural network modelling is collectively referred
to as deep learning.

What are neural networks?

In a nutshell, a neural net is a multivariate function: output η
is a function of the inputs X = (X1, …,Xp)

η = f (X ) = f (X1, …,Xp)

More precisely, this function is a layered composite function

Z (1) = f1(Z
(0)) : Z (0) = X

Z (2) = f2(Z

Z (L) = fL(Z

η = fL+1(Z

What are neural networks?

A neural network provides a mechanism for functional
approximation

Suppose that ftrue(X ) is a true, yet unknown, function that we
want to estimate. E.g.,

ftrue(X ) = E(Y |X )

the conditional mean of a response Y given X
A neural net with the output η = f (X ) provides an
approximation of ftrue(X ), i.e. we use f (X ) to approximate
ftrue(X ).

Variants of neural networks

The network structure considered so far is often called
feed-forward neural networks, which are most suitable for
cross-sectional data. Can be used for time series data too.

Later you will study recurrent neural networks, which are most
suitable for time series data.

Elements of a neural network (a generic layer connection)

Elements of a neural network

A (feedforward) neural net includes

a set of processing units (also called neurons, nodes)

weights wik , which are connection strengths from unit i to

biase w0k , which is the extra bias at unit k (may be 0).

a propagation rule that determines the total input Sk of unit
k, from the units that send information to unit k

the output Zk for each unit k , which is a function of the input

an activation function hk that determines the output Zk based
on the input Sk , Zk = hk(Sk)

Elements of a neural network

It’s useful to distinguish three types of units:

input units (often denoted by X ): receive data from outside
the network

hidden units (often denoted by Z ): receive data from and
send data to units within the network.

output units: send data out of the network. The type of the
output depends on the task (regression, binary classification or
multinomial regression). In many cases, there is only one
scalar output unit.

Given the signal from a set of inputs X , a NN produces an output.

Elements of a neural network

The total input sent to unit k is

wikZi + w0k

which is a weighted sum of the outputs from all units i that
are connected to unit k, plus a bias/intercept term w0k

Then, the output of unit k is

Zk = hk(Sk) = hk

wikZi + w0k

Usually, we use the same activation function hk = h for all

Elements of a neural network

Popular activation functions:

Sigmoid activation function: h(S) = σ(S) = 1

Tanh activation function: h(S) = tanh(S) = e

Rectified activation function :

h(S) = ReLU(S) = max(0,S) =

Linear Regression Model

Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y

The linear regression model is

y = w1x1 + w2x2 + w3x3 + b

Let add an intermediate variable z

y := z := w1x1 + w2x2 + w3x3 + b

Draw this model in the following diagram

x1Input #1

x2Input #2

x3Input #3

z Output y

Linear Regression Model

Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y

The linear regression model is

y = w1x1 + w2x2 + w3x3 + b

Let add an intermediate variable z

y := z := w1x1 + w2x2 + w3x3 + b

Draw this model in the following diagram

x1Input #1

x2Input #2

x3Input #3

z Output y

Linear Regression Model

Consider three features (regressors/predictors) x1, x2 and x3,
for one dependent/response variable y

The linear regression model is

y = w1x1 + w2x2 + w3x3 + b

Let add an intermediate variable z

y := z := w1x1 + w2x2 + w3x3 + b

Draw this model in the following diagram

x1Input #1

x2Input #2

x3Input #3

z Output y

A Simple Neural Networks (No hidden layer)

x1Input #1

x2Input #2

x3Input #3

1 Output Z

Information flow through a linear mapping

i1 Xi + w01

At the output, the above value may be then modified using a
nonlinear function such as a sigmoid,

1 + exp (−S(1)1 )

1 + exp (−w (1)11 X1 − w

Without applying the so-called activation function, the
network is equivalent to a linear regression.

Slightly Complex Neural Networks

Consider the following neural network:

x1Input #1

x2Input #2

x3Input #3

x4Input #4

where we have applied activation z

i ) for i = 1, 2, 3, 4, 5.

Models defined by Neural Networks

The previous network actually defines the following
composition nonlinear function, (on the output node, we take
no activation),

1 ) + β2h(s

2 ) + β3h(s

3 ) + β4h(s

4 ) + β5h(s

where h is the chosen activation function

Training means fitting this mathematical model with training
data, i.e., finding the best weights

11 , …,w

01 , …,w

05 , β1, …, β5, β0.

Models defined by Neural Networks

Graphical representation of a neural net with L = 3 hidden layers.
The input layer represents the raw covariates X . The last hidden
layer (hidden layer 3) represents the predictors Z .

Models defined by Neural Networks

Denote the final output of the neural net as

η = β0 + β1Z1 + · · ·+ βdZd = β0 + βTZ

with β = (β1, …, βd)

Note that η is a function of X and depends on W , β0 and β,
denoted by θ = (w ,β, β0),

η = F (X , θ)

where W is the set of weights that connect covariates X to
predictors Z , and (β, β0) is the set of weights that connect Z

We will use F (X , θ) to approximate ftrue(X ).

Forward propagation algorithm (optional)

Slides with “optional” are highly technical. You are encouraged to
go through them, but these will not be tested in the exams.

Forward propagation algorithm for computing the output

Consider a neural net with the structure (p, `(1), …, `(L), 1)

The input layer has p covariates X1,X2, …,Xp.
L hidden layers: the first hidden layer has `(1) units, the second
hidden layer has `(2) , etc.
The last layer is a single output η

Forward propagation algorithm (optional)

0v be the bias at unit v in layer j , and w

weight from unit u in the previous layer j − 1 to unit v in layer
j . Layer j = 0 is the input layer, `(0) = p.
The total input to unit v of layer j is

Its output is Z

v ). Other notations refer to the next

Forward propagation algorithm (optional)

: set of weights sends signal to unit v of layer j

: vector of total inputs to layer j , j = 1, …, L

: vector of outputs from layer j , Z (0) =

Forward propagation algorithm (optional)

The matrix of all weights from layer j − 1 to layer j and the biases
on layer j :

1,1 · · · w

1,2 · · · w

· · · w (j)

`(j−1),`(j)

 and W (j)0 =

The final output of the network is

η = β0 + β1Z
1 + · · ·+ βdZ

Collect by W (j) := [W

Forward propagation algorithm (optional)

Pseudo-code algorithm for computing the output.
Input: covariates X1, …,Xp and weights w = (W

(1), …,W (L)), β0
Output:. η

Z (0) := (X1, …,Xp)

For j = 1, …, L:

Z (j) = h(S (j))

η = β0 + β

w31 = 0.5w32 = 0.6

The model (function) defined by this neural network is,
suppose we use the sigmoid activation function σ,

η =β0 + β1z1 + β2z2 = 0 + 0.7σ(s1) + 0.8σ(s2)

=0.7σ(0.1 ∗ x1 + 0.3 ∗ x2 + 0.5 ∗ x3 + 1)
+ 0.8σ(0.2 ∗ x1 + 0.4 ∗ x2 + 0.6 ∗ x3 + 2)

Neural net for regression

Given a neural network, we now know how to compute its
output η from an input vector X .

How is this output used for forecasting?

Other problems in neural network modelling

How to select the number of hidden layers?
How to select the number of units in each hidden layer?
How to perform variable selection?

We rely on machine learning package to train neural networks

Neural net for forecasting

Suppose that the response Y is numerical.

The model is

Y = η+ ε = F (X , θ) + ε = β0 +β1Z1 +β2Z2 + · · ·+βdZd + ε

where Zi ’s are the last hidden layer outputs and ε is an error
term with mean 0 and variance σ2. Often, we assume
ε ∼ N(0, σ2)
The least squares method can now be used to estimate the
model parameters.

Note on Python: In Python, the activation function of the output
unit for regression is defined as the identity function, named linear.

Least squares method for training

For the given training dataset D = {yi ,Xi = (xi1, …, xip)T},
i = 1, …,N.

The neural net regression model can be written as

yi = F (Xi , θ) + εi , i = 1, 2, …,N.

Define the loss function to be the sum of squared errors

`i (θ), `i (θ) = (yi − F (Xi , θ))2

LS minimizes Loss(θ) to estimate θ.

Difficulties in training a neural net

We need to solve an optimization problem

find θ that minimises Loss(θ)

Difficulties:

There are a huge number of parameters
The surface of the loss function of often highly multimodal
We often need big data, so computational expensive

In most cases, neural net models are trained by the Stochastic
Gradient Descent (SGD) method.

Gradient descent definition

Gradient descent is a first-order iterative optimization
algorithm for finding the minimum of a function.

To find a local minimum of a function using gradient descent,
we take steps proportional to the negative of the gradient (or
approximate gradient) of the function at the current point.

Source: wikipedia.

Gradient descent method for optimization

Suppose that we want to minimise a function Loss(θ), the
gradient descent method for optimization includes the
following steps:

1 Start from an initial θ(0)

θ(t+1) = θ(t) − αt∇θLoss(θ(t)) t = 1, 2, …

3 Stop updating until some convergence condition is met, i.e.,
loss update is less than a pre-specified threshold.

Here ∇θLoss(θ(t)) (or we can use notations
∂Loss(θ(t))

dLoss(θ(t))

) denotes the gradient vector of Loss(θ(t))

αt > 0 is called learning rate or step size, such that

αt → 0, as t → +∞,

θ(t) is guaranteed to converge to a local minima of Loss(θ).

Motivating example

Suppose below is the loss function (MSE) plot of a simple linear
regression without intercept term model:

β0 is not included for simplicity.

Motivating example

β0 is not included for simplicity.
If β0 is included, how does the loss function plot look like?

Motivating example

Figure: The picture shows an example surface plot of the loss function

Gradient descent

Based on the plot in the pervious slide:

At the current value β, we move down the

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts