ECE 657A: Classification – Lecture 8: Neural Networks and Deep Learning
ECE 657A: Classification
Lecture 8: Neural Networks and Deep Learning
Mark Crowley
March 1, 2017
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 1 / 85
Class Admin Announcements
Today’s Class
Announcements
Linear and Logistic Regression
Multilayer Perceptrons
Deep Learning
Decision Trees, Random Forests
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 2 / 85
Class Admin Announcements
Remainder of the Course
March 1: Neural Networks
March 8: Decision Trees*, Ensemble Methods
March 15: Clustering, Association Rule Mining,
CNNs/RNN/LSTM/DeepRL
March 19: Assignment 2 Due
March 15?, 22, 29: Project Presentations
March 29: Project Report Due
April 7: Final Exam at 12:30-3:00 in RCH 302
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 3 / 85
Class Admin Announcements
Usual Materials
[Duda, Pattern Classification, 2001]
R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd
ed.), John Wiley and Sons, 2001.
[Murphy, 2012]
Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT
Press, 2012.
And the following course slides
http://www.cs.cmu.edu/ aarti/Class/10701/slides/Lecture22.pdf
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 4 / 85
Class Admin Announcements
Other Resources
[Goodfellow, 2016]
Goodfellow, Bengio and Courville. ”Deep Learning”, MIT Press, 2016.
http://www.deeplearningbook.org/
Website has free copy of book at pdf’s.
Good quick outline of most topics we’ve covered plus lots of detail on
latest Deep Learning/ANN methods.
Deep Learning Course
Videos of course by Prof. Ali Goshi on campus – all videos online and
slides: https://uwaterloo.ca/data-science/deep-learning
NYU Shanghai Course:
https://sites.google.com/a/nyu.edu/ml-spring2016/home
Many links to deep learning sites:
https://github.com/ChristosChristofidis/awesome-deep-learning
Matrix Cookbook (pdf)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 5 / 85
http://www.deeplearningbook.org/
https://uwaterloo.ca/data-science/deep-learning
http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf
Classification Linear Discriminant Functions
Beyond Linear Classifiers
Last week we talked about SVMs, a powerful classifier algorithm.
SVM learns a discriminant function g(x) such that g(x) ≥ τ means x
is in class 1, and g(x) < τ means x is in class 0.
this is linear in it’s structure, with a possibly nonlinear kernal at the
core that remaps data from it’s original space to one where a linear
divider will work.
Naive Bayes is a probabilistic model. But then choose the more likely
class or apply a threshold τ to determine if a class is valid.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 6 / 85
Classification Linear and Logistic Regression
Linear Regression vs. Logistic Regression
A simpler type of Generalized Linear Models
Linear regression learns a function to predict a continuous variable
output of continous or discrete input variables
Y = b0 +
∑
(biXi ) + �
Logistic regression predcits the probability of an outcome, the
appropriate class for an input vector or the odds of one outcome
being more likely than another.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 7 / 85
Classification Linear and Logistic Regression
Logistic Regression
Define probability of label being 1 by fitting a linear weight vector to the
logistic (a.k.a sigmoid) function.
P(Y = 1|X ) = 1
1 + e−(w0+
∑
i wixi )
=
1
1 + exp(−(w0 +
∑
i wixi ))
Advantage of this is you can turn the continuous [−∞,∞] feature
information into [0, 1] and treat it like a probability.
bias: w0 is called the bias, it basically adjusts the sigmoid curve to the left
or right, biasing what the expected outcome is. The other weights wi
adjust the steepness of the curve in each dimension.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 8 / 85
Classification Linear and Logistic Regression
Logistic Regression as a Graphical Model
o(x) = σ(wT xi ) = σ(w0 +
∑
i
wixi ) =
1
1 + exp(−(w0 +
∑
i wixi ))
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 9 / 85
Classification Linear and Logistic Regression
Logistic Regression Used as a Classifier
Logistic Regression can be used as a simple linear classifier in the same
way we did with Naive Bayes.
Compare probabilities of each class P(Y = 0|X ) and P(Y = 1|X )
Treat the halfway point on the sigmoid as the decision boundary.
P(Y = 1|X ) > 0.5 classify X in class 1
w0 +
∑
i
wixi = 0
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 10 / 85
Classification Linear and Logistic Regression
Properties of Logistic Regression
Very popular simple classification method
Relatively easy to fit to data using gradient descent, many methods
for doing this
Learned paramters can be interpreted as computing the log odds
LO ∼ log p(Y = 1|X )
p(Y = 0|X )
= wTx
If x0 : number of cigarettes per day
x1 : minutes of excercise
y = 1 : getting lung cancer
Then if parameters learned at ŵ = (1.3,−1.1), it means each
cigarette raises odds of cancer by factor of e1.3
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 11 / 85
Classification Linear and Logistic Regression
Training Logistic Regression Model via Gradient Descent
Can’t easily perform Maximum Likelihood Estimation
The negative log-likehood of the logistic function is given by NLL and
it’s gradient by g
NLL(w) =
N∑
i=1
log
(
1 + exp(−(w0 +
∑
i
wixi ))
)
g =
∂
∂w
=
∑
i
(σ(wT xi )− yi )xi
Then we can update the parameters iteratively
θk+1 = θk − νkgk
where νk is the learning rate or step size.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 12 / 85
Classification Linear and Logistic Regression
Failing on the XOR Problem
How is XOR a hard problem? It is if you’re too linear. XOR is the simplest
nonlinear classification problem.
Two binary features x1 and x2 and four patterns to be assigned the
correct labels True or False
(Goodfellow 2016)
Solving XOR
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0 1
x
1
0
1
x
2
Original x space
0 1 2
h
1
0
1
h
2
Learned h space
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers
printed on the plot indicate the value that the learned function must output at each point.
(Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
the model’s output must decrease as x2 increases. A linear model must apply a fixed
coefficient w2 to x2. The linear model therefore cannot use the value of x1 to change
the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented by the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
collapsed into a single point in feature space. In other words, the nonlinear features have
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.
The linear model can now describe the function as increasing in h1 and decreasing in h2.
In this example, the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.
173
Figure 6.1
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 13 / 85
Artificial Neural Networks History and Overview
A Short History
Early work in NN goes back to the 40s with a simple model of the
neuron by McCulloh and Pitt as a summing and thresholding devices.
Rosenblatt in 1958 introduced the perceptron as a two layer network
(one input layer and one output node with a bias in addition to the
input features.
Minsky: 1969. They’re ’just’ linear
AI goes logical
1980s: Neural Network resurgence: Backpropagation
1990s : SVMs! Kernals can do anything! (no, they can’t)
2003-present: Deep Learning (Convolutional Nets Dropout/RBMs,
Deep Belief Networks
Google Cat Youtube, speech recognition, self driving cars, computer
defeats regional Go champion, …
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 14 / 85
Artificial Neural Networks History and Overview
Neural Networks
A neural network connects many logistic/sigmoid units together into
a network
Central difference from Perceptron: a layer of hidden units
Also called: Multilayer Perceptrons, Artificial Neural Networks
(ANNs)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 15 / 85
Artificial Neural Networks History and Overview
Properties of Neural Networks
Important Properties:
Given a large enough layer of hidden units (or multiple layers) an
ANN can represent any function.
One challenge is regularization: how many hidden units or layers to
include in the network? Number of inputs and outputs are provided
but internal structure can be arbitrary.
too few units, network will have too few parameters, may not be able
to learn complex functions
too many units, network will be overparameterized and won’t be forced
to learn a generalizable model
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 16 / 85
Artificial Neural Networks History and Overview
Properties of Neural Networks
The hidden layer will have a number of units ( design parameter).
The hidden layers allow extracting more complex features and
facilitate solving complex problems.
Connection between the units (neurons) of all the layers can be
forward, backward or both.
Units can be fully or partially connected. (Fully connected at first)
Different arrangements make different types or models of the NN.
Each unit will have a thresholding (or activation) function and
connections between the units that have weights
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 17 / 85
Neural Networks to learn f: X Y
• f can be a non-linear function
• X (vector of) continuous and/or discrete variables
• Y (vector of) continuous and/or discrete variables
• Neural networks – Represent f by network of logistic/sigmoid
units, we will focus on feedforward networks:
Input layer, X
Output layer, Y
Hidden layer, H
Sigmoid Unit
/activation function (also linear, threshold)
Differentiable
d
d
d
Artificial Neural Networks Feedforward Artificial Neural Networks
Three-layer Neural Network
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 21 / 85
Artificial Neural Networks Feedforward Artificial Neural Networks
Input Layer
vector data, each input unit reponsible for one feature/dimension of
the data
Hidden Layer
Each hidden unit computes a weighted sum of all the input units and
passes it through a nonlinear function.
output: net activation
Output Layer
Each output unit computes a weighted sum of all the hidden units
and passes it through a (possibly nonlinear) threshold function.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 22 / 85
Artificial Neural Networks Feedforward Artificial Neural Networks
Net Activation
netj =
d∑
i=1
xiwji + wj0 =
d∑
i=0
xiwji
where:
i indexes the inputs units
j indexes the hidden units
Wji denotes the weights on input to hidden layer (“synaptic weights”)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 23 / 85
Artificial Neural Networks Feedforward Artificial Neural Networks
Hidden Layer: Adding Nonlinearity
Each hidden unit emits an output that is a nonlinear activation
function of its net activiation.
yj = f (netj)
This is essential to neural networks power, if it’s linear then it all
becomes just linear regression.
The output is thus thresholded through this
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 24 / 85
Artificial Neural Networks Feedforward Artificial Neural Networks
Activation Functions
tanh function also popular
recently Rectified Linear Units (ReLU) have become standard
max(0, netj) (same as half-wave rectication)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 25 / 85
(Goodfellow 2016)
Rectified Linear Activation
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0
z
0
g
(z
)
=
m
a
x
{0
,
z
}
Figure 6.3: The rectified linear activation function. This activation function is the default
activation function recommended for use with most feedforward neural networks. Applying
this function to the output of a linear transformation yields a nonlinear transformation.
However, the function remains very close to linear, in the sense that is a piecewise linear
function with two linear pieces. Because rectified linear units are nearly linear, they
preserve many of the properties that make linear models easy to optimize with gradient-
based methods. They also preserve many of the properties that make linear models
generalize well. A common principle throughout computer science is that we can build
complicated systems from minimal components. Much as a Turing machine’s memory
needs only to be able to store 0 or 1 states, we can build a universal function approximator
from rectified linear functions.
175
Figure 6.3
Artificial Neural Networks Feedforward Artificial Neural Networks
Output Layer
The output layer combines all the previous hidden layers.
netk =
nH∑
j=1
yjwkj + wk0
where:
k indexes the output units
nH denotes the number of hidden units in the previous layer
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 27 / 85
Artificial Neural Networks Feedforward Artificial Neural Networks
Output Layer
The output layer adds another level of thresholding, usually nonlinear as
well.
zk = f (netk)
f (netk) could be any thresholding function such as the previous activation
functions.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 28 / 85
Neural Networks as Universal Approximators
Neural Nets Can Model XOR Problem
Use a simple sign activation
function for hidden and output units:
f (net) = 1 if net ≥ 0
= −1 if net < 0
First hidden unit:
y1 computes x1 + x2 + 0.5 = 0
y1 = 1 if net1/ge0, y1 = −1
otherwise
Second hidden unit:
y1 computes x1 + x2 − 1.5 = 0
y2 = 1 if net2/ge0, y2 = −1
otherwise
single output unit z1:
emits z1 = 1 iff y1 = 1 and
y2 = 1.
y1 models AND gate
y2 models OR gate
z1 models y1 AND NOT y2
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 29 / 85
Neural Networks as Universal Approximators
Full Neural Network Descriminant Function
The class of decsion functions that can be described by a three-layer
neural network are:
gk(x) = zk = f
nH∑
j=1
wkj f
(
d∑
i=1
wjixi + wj0
)
+ w0
The Universal Approximation Theorem [Hornik, 1989] shows that
this covers any possible decision function mapping continuous inputs
to a finite set of classes to any desired level of accuracy.
But it does not say how many hidden unites would be needed.
In general, for a single layer it could be exponential (degrees of
freedom).
But, using more layers can reduce the number of hidden units needed
and make it more generalizable.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 30 / 85
d
E – Error function,
MSE, cross-entropy
loss
For Neural Networks,
E[w] no longer convex in w
Slides from Tom Mitchell ML Course, CMU, 2010
Using all training data D
l
lly
l
l
l
lly
l
Error Gradient for a Sigmoid Unit
l
l
l
l
l
l
l l
l
l
l
l l
l
l
ly
ly
ly ly
ly
ly
l
l
l
l
l l
l l
l
l
l l l lly
Sigmoid Unit
d
d
d
(MLE)
l l l lly
k
l l l l
l
l l l
o
Using Forward propagation
yk = target output (label)
of output unit k
ok(h) = unit output
(obtained by forward
propagation)
wij = wt from i to j
Note: if i is input variable,
oi = xi
Objective/Error no
longer convex in
weights
Regularization
Regularization to Reduce Overfitting
L2 regularization - augmenting the error function, add 1
2
λw2 to all
weights in the neural network
Need to choose lambda carefully
Interpretation - heavily penalizing ”peaky” weight vectors and
preferring diffuse weight vectors.
Encourages network to use all its inputs instead of relying on
strongest signal.
Also called weight decay since weights would reduce over time
during gradient descent without input.
L1 regularization - add λ|w | to each weight error function
(encourages all weights to zero)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 37 / 85
Regularization
Regularization to Reduce Overfitting
Figure: From http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-networks.html/2
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 38 / 85
Regularization
Dropout
Dropout (Hinton 2014) - randomly ignore certain units during
training, don’t update them via gradient descent, leads to hidden
units that specialize.
With probability p don’t include a weight in the gradient updates.
Reduces overfitting by encouraging robustness of weights in the
network.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 39 / 85
Improving Performance
Heuristics for Improving Backpropagation
There are a number of useful but not necessarily theorically justified tricks
(ie. heuristics) for training Neural Networks that are useful in practice.
Let hidden nodes, just enough complexity to work, not too much to
overfit.
Train multiple networks with different sizes and search for the best
design.
Validation set: train on training set until error on validation set starts
to rise, then evaluate on evaluation set.
Try different activiation functions: sigmoid or ReLU
Randomly choose subsets of the data to training.
Dropout (Hinton 2014) - randomly ignore certain units during
training, don’t update them via gradient descent, leads to hidden
units that specialize
Modify learning rate over time (cooling schedule)
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 40 / 85
(Goodfellow 2016)
Exponential Advantage of
Depth
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(2014) showed that functions representable with a deep rectifier net can require
an exponential number of hidden units with a shallow (one hidden layer) network.
More precisely, they showed that piecewise linear networks (which can be obtained
from rectifier nonlinearities or maxout units) can represent functions with a number
of regions that is exponential in the depth of the network. Figure 6.5 illustrates how
a network with absolute value rectification creates mirror images of the function
computed on top of some hidden unit, with respect to the input of that hidden
unit. Each hidden unit specifies where to fold the input space in order to create
mirror responses (on both sides of the absolute value nonlinearity). By composing
these folding operations, we obtain an exponentially large number of piecewise
linear regions which can capture all kinds of regular (e.g., repeating) patterns.
Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
rectifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
unit has the same output for every pair of mirror points in its input. The mirror axis
of symmetry is given by the hyperplane defined by the weights and bias of the unit. A
function computed on top of that unit (the green decision surface) will be a mirror image
of a simpler pattern across that axis of symmetry. (Center)The function can be obtained
by folding the space around the axis of symmetry. (Right)Another repeating pattern can
be folded on top of the first (by another downstream unit) to obtain another symmetry
(which is now repeated four times, with two hidden layers). Figure reproduced with
permission from Montufar et al. (2014).
More precisely, the main theorem in Montufar et al. (2014) states that the
number of linear regions carved out by a deep rectifier network with d inputs,
depth l, and n units per hidden layer, is
O
✓
n
d
◆d(l�1)
nd
!
, (6.42)
i.e., exponential in the depth l. In the case of maxout networks with k filters per
unit, the number of linear regions is
O
⇣
k(l�1)+d
⌘
. (6.43)
200
Figure 6.5
Using Hidden LeRU units, each hidden layer increases power, exponential
advantage of additional layers.
(Goodfellow 2016)
Better Generalization with
Greater DepthCHAPTER 6. DEEP FEEDFORWARD NETWORKS
3 4 5 6 7 8 9 10 11
92.0
92.5
93.0
93.5
94.0
94.5
95.0
95.5
96.0
96.5
T
e
s
t
a
c
c
u
r
a
c
y
(
p
e
r
c
e
n
t
)
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
figure 6.7 for a control experiment demonstrating that other increases to the model size
do not yield the same effect.
Another key consideration of architecture design is exactly how to connect a
pair of layers to each other. In the default neural network layer described by a linear
transformation via a matrix W , every input unit is connected to every output
unit. Many specialized networks in the chapters ahead have fewer connections, so
that each unit in the input layer is connected to only a small subset of units in
the output layer. These strategies for reducing the number of connections reduce
the number of parameters and the amount of computation required to evaluate
the network, but are often highly problem-dependent. For example, convolutional
networks, described in chapter 9, use specialized patterns of sparse connections
that are very effective for computer vision problems. In this chapter, it is difficult
to give much more specific advice concerning the architecture of a generic neural
network. Subsequent chapters develop the particular architectural strategies that
have been found to work well for different application domains.
202
Figure 6.6
Layers
Increasing accuracy from number of layers in detection of numbers in
photos of adresses.
(Goodfellow 2016)
Large, Shallow Models Overfit More
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters
⇥108
91
92
93
94
95
96
97
T
e
s
t
a
c
c
u
r
a
c
y
(
p
e
r
c
e
n
t
)
3, convolutional
3, fully connected
11, convolutional
Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can learn. Specifically, it expresses a belief that the
function should consist of many simpler functions composed together. This could result
either in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).
203
Figure 6.7
Improving Performance
Other Types of ANNs
Radial Basis Function Networks
Restricted Boltzman Machines (RBM)
Deep Neural Networks
Convolutional Neural Networks (CNN) - neighbours share weights,
learns larger patters, stack in multiple levels (good for images
classification)
Recurrent Neural Networks (RNN) - allow links from outputs back to
inputs, over time, good for time series learning
LSTM
Deep Reinforcement Learning
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 44 / 85
(Goodfellow 2016)
Network Diagrams
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
yy
hh
xx
W
w
yy
h1h1
x1x1
h2h2
x2x2
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
model, we used a vector of weights and a scalar bias parameter to describe an
affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x>W:,i + ci). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation
function g(z) = max{0, z} depicted in figure 6.3.
We can now specify our complete network as
f(x; W , c, w, b) = w> max{0, W>x + c} + b. (6.3)
We can now specify a solution to the XOR problem. Let
W =
1 1
1 1
�
, (6.4)
c =
0
�1
�
, (6.5)
174
Figure 6.2
(Goodfellow 2016)
Mixture Density Outputs
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
x
y
Figure 6.4: Samples drawn from a neural network with a mixture density output layer.
The input x is sampled from a uniform distribution and the output y is sampled from
pmodel(y | x). The neural network is able to learn nonlinear mappings from the input to
the parameters of the output distribution. These parameters include the probabilities
governing which of three mixture components will generate the output as well as the
parameters for each mixture component. Each mixture component is Gaussian with
predicted mean and variance. All of these aspects of the output distribution are able to
vary with respect to the input x, and to do so in nonlinear ways.
to describe y becomes complex enough to be beyond the scope of this chapter.
Chapter 10 describes how to use recurrent neural networks to define such models
over sequences, and part III describes advanced techniques for modeling arbitrary
probability distributions.
6.3 Hidden Units
So far we have focused our discussion on design choices for neural networks that
are common to most parametric machine learning models trained with gradient-
based optimization. Now we turn to an issue that is unique to feedforward neural
networks: how to choose the type of hidden unit to use in the hidden layers of the
model.
The design of hidden units is an extremely active area of research and does not
yet have many definitive guiding theoretical principles.
Rectified linear units are an excellent default choice of hidden unit. Many other
types of hidden units are available. It can be difficult to determine when to use
which kind (though rectified linear units are usually an acceptable choice). We
191
Figure 6.4
This is output for a model called a Mixture Density Network which can
perform multimodel regression where each x may have multiple output y .
Standard way is to fit a Gaussian mixture models.
This can be specified as a neural network. [Goodfellow, p. 185]
Decision Trees Definition
Non-metric Classification
Metric Classification: k-nearest neighbours, SVM, logistic regression,
neural networks
What if the data features are not numeric (discrete or continuous)
values with a well defined order?
How to perform classification of data for color∈{’red’,
’blue’,’orange’}, DNA∈{A,G,C,T}
How to discriminate data with such labels, How to learn a flexible
model that can predict such labels without just learning a mapping.
How to make it generalizable?
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 47 / 85
Decision Trees Definition
Simple Decision Tree
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 48 / 85
Decision Trees Definition
Decision Tree
Any finite pattern can be represented by a finite series of yes/no questions.
Decision Tree:
Directed graph G =< N,E >
Each node n ∈ N represents a choice on a particular variable/feature
F amongst B subsets of the values of F .
Edges (links, branches) connect a node to child (descendent) nodes
which model a split on another feature (or the same feature with
different split subsets)
Leaf nodes are nodes with no children which have a category label
assigned to them.
A decision tree categorizes a data point according to the label of the
leaf node that data point reaches by following the rules defined in the
internal nodes.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 49 / 85
Decision Trees Definition
Decision Tree Properties
A decision tree recursively partitions the input space, defining a local
model in each resulting region of the input space represented by a leaf
node.
If all the dimensions are numerical we can envision with the decision
tree is doing as dividing up space into rectangles.
Decision trees allow us to combine Numerical and non-numerical data
dimensions into a single learning process.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 50 / 85
Decision Trees Training Decision Trees – CART
CART
A generalization of decision trees is the Classification and Regresssion Tree
(CART) framework. Questions we need to answer:
1 Should properties be binary only?
2 Which feature will be tested at each node?
3 When should a node become a leaf?
4 Can a tree be “too large”, how could we make it smaller?
5 If a leaf is impure, what label do we assign?
6 How do we deal with missing data?
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 51 / 85
Decision Trees Training Decision Trees – CART
Splitting Nodes
The question at each node splits the data along that dimension
Number of split choices is the branching factor of the tree.
We can represent any multiple branch split with a series of binary
splits.
Using a higher branching factor raises the risk of overfitting.
Finding the optimal partitioning of the data is NP-complete so we
usually use a greedy approach to find an approximate solution.
Need to specify:
feature to split on
value in feature range to split on
whether the node should be split or left as a leaf
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 52 / 85
Decision Trees Training Decision Trees – CART
Basic fitTree Algorithm
Algorithm 1: fitTree(node,D, depth)
1 node.prediction = mean(yi : i ∈ D) OR label distribution
2 j , t,DL,DR = split(D)
3 if not worthSplitting(depth, cost(D), DL,DR) then
4 return node
5 else
6 node.testfeature = j
7 node.testvalue = t
8 node.left = fitTree(node,DL, depth+1)
9 node.right = fitTree(node,DR , depth+1)
10 return node
From (Murphy, 2012).
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 53 / 85
Decision Trees Node Impurity
Cost as Node Impurity
Basic principle, we want the simplest model at each node, i.e. Occam’s
razor.
A leaf node is pure if all the datapoints associated with it are in have the
same label/category. (We’ll say the impurity cost(D) = 0 in this case)
Entropy Impurity – most popular
Gini Impurity – useful for training trees
Misclassification Impurity – intuitive but generally worse
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 54 / 85
Decision Trees Node Impurity
Entropy Impurity
cost(D) = −
∑
c
P(ωc) log2 P(ωc)
where:
P(ωc) is the fraction (or probability) of points in that node having
label ωc
P(ωc) =
1
|D|
∑
i∈D
I(yi = c)
If all points have same label cost(D) =, and the maximum of cost(D) will
be when points have uniform likelihood of all labels
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 55 / 85
Decision Trees Node Impurity
Gini Impurity
Also called Gini Index
cost(D) =
∑
i 6=j
P(ωi )P(ωj) =
1
2
1−
∑
j
P2(ωj)
This is the expected error rate at node D if the label is selected randomly
from the distribution present in that node.
More strongly peaked at uniform probability than entropy impurity
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 56 / 85
Decision Trees Node Impurity
Misclassification Impurity
cost(D) = 1−maxjP(ωj)
Measures the minimum probability that a datapoint would be miscalssified
for this node.
Very peaked at uniform probability of all labels.
But isn’t smooth, derivative is discontuous so breaks some gradient search
methods.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 57 / 85
Decision Trees Node Impurity
Behaviour of Impurity Models
For the binary two-class case.
Entropy and Gini impurity measures
very similar.
Both are more sensitive to change in
class probability than
misclassifiaction rate.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 58 / 85
Decision Trees Node Impurity
Choosing a Split Value
Given a partial tree, deciding on node D, chosen the feature to split
on
Question: What value should you split on?
∆cost(D) = cost(D)−
(
|DL|
|D| cost(DL) +
|DR |
|D| cost(DR)
)
where:
DL and DR are the left/right descendent nodes and by splitting on
feature T at value s.
|D| means the number of nodes in the set
The best split is the one that maximizes ∆cost(D)
For entropy impurity this amounts to maximizing the information gain of
the split.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 59 / 85
Decision Trees Stopping and Pruning
When to Stop Splitting?
When should we stop splitting?
Simplest approach: keep splitting until each leaf has only datapoints
in the same class or has only a single datapoint. → Tends to overfit
for classification/regression.
Minimize Cross-Validated Error: measure the decrease in error a
new node would cause against a validation set, or on many random
cross-validation sets. Then stop splitting when the gain is minimized
sufficient.
Threshold: Stop splitting if proposed split provides < � reduction in impurity, or if leaf has less than 5 datapoints. Minimize Global Criterion: ... Mark Crowley ECE 657A : Lecture 8 March 1, 2017 60 / 85 Decision Trees Stopping and Pruning Minimize A Global Criterion Define a global criterion. Keep splitting until it reaches a minimum value. α · size + ∑ d∈leafnodes(N) cost(d) where: size - number of nodes, or edges α - some positive constant Properties: This acts as a regularizer that discourages larger trees. If entropy impurity is being used this is equivalent to minimizing the minimum description length, the number of bits used to represent the model. The sum of leaf impurities is a measure of uncertainty in the training data given the current model. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 61 / 85 Decision Trees Stopping and Pruning Pruning the Tree Horizon Effect: what if you stop splitting at a node but later features would have led to better classification? Can’t see past horizon unless you explore it. Stop too early and the model has low predictive power. Stop too late and it overfits. So why not fully grow the tree first then decide what to remove? Mark Crowley ECE 657A : Lecture 8 March 1, 2017 62 / 85 Decision Trees Stopping and Pruning Basic Pruning Algorithm 1 Grow a full tree : until leaf node impurity hits a minimum or very small number of datapoints in leaf. 2 Find a pair of sibling leaf nodes i and j which have not yet been examined 3 Calculate how much impurity of parent node k would go up if i and j were eliminated then remove branches that would cause the smallest increase in error. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 63 / 85 Decision Trees Stopping and Pruning Benefits of Pruning PROs All the data can be used for training unlike the threshold cross validation approaches Avoids the horizon effect CONs More expensive than computing when to stop early Mark Crowley ECE 657A : Lecture 8 March 1, 2017 64 / 85 Decision Trees Stopping and Pruning Another way to think about CART CART can be see as an adaptive basis-function (kernal) model. The basis function defines regions as hyper-rectangles and weights specify the regression response or the matching label proportions for each region. f (x) = E [y |x ] = M∑ m=1 wmI(x ∈ Rm) = M∑ m=1 wmφ(x; vm) where: Rm is the m th regions partitioned by the tree vm encodes the choice of feature and the threshold value to split on I is an indicator function that is 1 iff datapoint x falls within that hyper-rectangular region From (Murphy, 2012) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 65 / 85 Decision Trees Stopping and Pruning Decision Tree Algorithm: ID3 Intended for use with nominal, unordered data. On each iteration, it iterates through every unused feature and calculates the entropy for that attribute. It then selects the one with the smallest entropy and the dataset is split on that feature Creates a branch for each value of the current feature (not binary!) The algorithm continues to recurse on each subset, considering only features never selected before. Stopping: every element in the subset belongs to the same class no more feautres left, if leaf not pure then the most common class label is used ID3 is only prefferred if computational costs are a big issue. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 66 / 85 Decision Trees Stopping and Pruning Deicsion Tree Algorithm: C4.5 Improvement on ID3, most popular in use (see WEKA data mining tool). Performs pruning on the tree after it’s grown Missing Data: what if some datapoints don’t have a value for feature f ? C4.5 adds a ‘?’ and doesn’t use that feature for entropy calculations, algorithm otherwise stays the same Mark Crowley ECE 657A : Lecture 8 March 1, 2017 67 / 85 Decision Trees Stopping and Pruning Pros and Cons of Decision Trees PRO: Simple to implement, Lots of flexibility in impurity measures, training algorithms Resulting model is easy to interpret as logical rules Can be seen as an automated kernal learning method, similar to Neural Networks Universal expressive power (also like neural networks) Handle nominal, discrete and continous data They perform automated variable selection CON: Very easy to overfit the data, create a complete mapping Tend to be unstable, small changes in input lead to very different trees Fairly slow to train (not as bad as Neural networks though) Don’t perform as well as more modern methods (but...) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 68 / 85 Ensemble Methods Definition Ensemble Methods for Improving Classification Basic Idea: Rather than training just one model on to predict the data with low error, try learning multiple models, on subsets of of the data, and combine them to get overall lower error than possible from a single model. General Essemle Methods: Bagging Boosting Adaboost Also called Adaptive Reweighting and Combining (Arcing) Decision Tree Ensembles: Random Forests Extremely Randomized Trees, Isolation Forests Hoeffding Trees, Mondrian Forests, Streaming Random Forests Mark Crowley ECE 657A : Lecture 8 March 1, 2017 69 / 85 Ensemble Methods Bagging Bootstrap Revisited A bootstrap data set is created by randomly selecting n points from the training set X of size n with replacement. There will usually be some duplicates. This is repeated B times with independent draws each time. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 70 / 85 Ensemble Methods Bagging Bagging Bagging (bootstrap aggregation) Draw multiple subsamples of the data of size n′ < n Learn a component classifier on using each bagged sample Final classification decision based on vote from each component classifier Usually use same classifier for each component, (neural network, SVM, decision tree) but they could be different (more on multi-classifer systems later) Advantage: reduces instability (ie. sensitivity of classifier performance to small changes in data) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 71 / 85 Ensemble Methods Boosting Boosting A meta-algorithm that allows us to create an ensemble of classifiers with arbitrarily high accuracy Train mutiple classifiers as in bagging but each bootstrap sample is chosen to maximize the information gain conditioned on the previously trained classifiers. This classifier could be a weak leaner: only guaranteed to be better than chance. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 72 / 85 Ensemble Methods Boosting Boosting Example Divide D into three sets D1:3 of size < n Run your favourite classification algorithm on D1 to create C1 Select subset D2 by randomly selecting 50% of points that C1 gets correct and 50% that C1 gets incorrect. Train C2 on D2 Select D3 to be all points in D that C1 or C2 get wrong Train C3 on D3 Mark Crowley ECE 657A : Lecture 8 March 1, 2017 73 / 85 Ensemble Methods Boosting Adaboost Most popular variation of boosting meta-algorithm We assume the sign of the weak learner hk(x) gives the class, and the magnitude of the value gives the confidence in the classification It essentially classifies the datapoints using a weighted majority vote of different classifiers, each trained to perform well where all the previous classifiers are doing badly. Compute a joint disciminant function g(x) = ∑kmax k=1 αkhk(x) Now, classify the datapoints using sign(g(x)) *Further reading:* great chapter by Shapire about possible explanations of why AdaBoost is so effective: http://rob.schapire.net/papers/explaining-adaboost.pdf Mark Crowley ECE 657A : Lecture 8 March 1, 2017 74 / 85 http://rob.schapire.net/papers/explaining-adaboost.pdf Ensemble Methods Boosting AdaBoost Algorithm 2: AdaBoost Algorithm(X,Y) Input: Datapoints (xi , yi ) where xi ∈ X and the label yi ∈ {−1,+1} for i = 1, . . . , n 1 2 Assign initial weights w1(i) = 1/n 3 foreach k=1,. . . ,K do 4 Train a weak-learner classifier Ck using data sampled by wk(i) 5 �k = Pri∼w1 [Ck(xi ) 6= yi ] /* error rate of Ck */ 6 7 αk = 1 2 ln ( 1−�k �k ) 8 foreach i = 1, . . . , n do 9 wk+1(i) = wk (i)exp(−αkyiCk (xi )) Zk 10 return classification function C (x) = sign (∑K k=1 αkCk(x) ) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 75 / 85 Random Forests Random Forest Very good way to reduce the variance of decision trees, make them more stable to different inputs. Bagging: Train M trees on randomly chosen subsets of the data (with replacement) f (x) = M∑ m=1 1 M fm(x) Bagging leads to highly correlated trees, so it helps but could be better. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 76 / 85 Random Forests Random Forests From (Breiman, 2001) Builds on the bagging algorithm Will learn T trees. For each tree, obtain a bootstrap sample of the datapoints and a sample of K features to use for training that tree. Use the standard CART tree learning algorithm for each tree, But at each split step, find the optimal split by searching the K features and their values. The next tree will use a different sample of featuers and datapoints. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 77 / 85 Random Forests Random Forests Decorrelates the individual trees (base learners) by also randomly choosing the feature set each can choose from. Result: Each tree focusses on a subset of the features and learns them better. Outperforms many other methods in terms of accuracy. Boosting can take this even further Mark Crowley ECE 657A : Lecture 8 March 1, 2017 78 / 85 Random Forests PROS and CONS of Random Forests PROs: Robust to initial data disctribution, sampling process Easily tunable, more treats improves accuracy and reduces variances Relatively fast to train for such a flexible model. CONs: Harder to interpret than decision trees, can try to average rules across trees Mark Crowley ECE 657A : Lecture 8 March 1, 2017 79 / 85 Random Forests Extremely Randomized Trees From (Geurts, Ernst and Wehenkel, 2006) Randomly choose a subset of features Randomly choose a cut point in each of those features (not by searching) Choose the best performing of those features as the split (maximize a score, could be impurity) Totally Randomized Trees: just choose one at random, no maximization (faster, more robust, but bad for classification) Stop when minimize number of datapoints for a leaf K is reached Mark Crowley ECE 657A : Lecture 8 March 1, 2017 80 / 85 Random Forests Extremely Randomized Trees Two main differences with other tree-based ensemble methods: Splits nodes by choosing cut-points fully at random Uses the whole learning sample (rather than a bootstrap replica) to grow the trees Mark Crowley ECE 657A : Lecture 8 March 1, 2017 81 / 85 Random Forests Other Tree Ensemble Methods Isolation Forests (Liu, 2008): anomaly detection (two-class classifcation), unsupervised Hoeffding Trees (Domingos, 2000): streaming data for single trees for regression, confidence bounds. Nodes only split when enough data arrives for confidence level. Mondrian Forests (Lakshminarayanan, 2014): streaming data for forests, confidence bounds Mark Crowley ECE 657A : Lecture 8 March 1, 2017 82 / 85 Classification Performance Comparisons Performance Comparisons From (Murphy, 2012, p.585) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 83 / 85 Classification Performance Comparisons Classification Performance Comparisons From http://scikit-learn.org/stable/auto_examples/ classification/plot_classifier_comparison.html Mark Crowley ECE 657A : Lecture 8 March 1, 2017 84 / 85 http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Classification Performance Comparisons Summary Logistic Regression as a simple classifier Stochastic Gradient Descent Neural Networks / Multilayer Perceptrons Decision Trees Ensemble Methods Mark Crowley ECE 657A : Lecture 8 March 1, 2017 85 / 85 Class Admin Announcements Classification Linear Discriminant Functions Linear and Logistic Regression Artificial Neural Networks History and Overview Feedforward Artificial Neural Networks Neural Networks as Universal Approximators Training Neural Networks - Backpropagation Improving Performance The Threat of Overfitting Regularization Improving Performance Decision Trees Definition Training Decision Trees - CART Node Impurity Stopping and Pruning Ensemble Methods Definition Bagging Boosting Random Forests Classification Performance Comparisons