程序代写代做代考 decision tree data mining flex AI algorithm deep learning Excel ECE 657A: Classification – Lecture 8: Neural Networks and Deep Learning

ECE 657A: Classification – Lecture 8: Neural Networks and Deep Learning

ECE 657A: Classification
Lecture 8: Neural Networks and Deep Learning

Mark Crowley

March 1, 2017

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 1 / 85

Class Admin Announcements

Today’s Class

Announcements

Linear and Logistic Regression

Multilayer Perceptrons

Deep Learning

Decision Trees, Random Forests

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 2 / 85

Class Admin Announcements

Remainder of the Course

March 1: Neural Networks

March 8: Decision Trees*, Ensemble Methods

March 15: Clustering, Association Rule Mining,
CNNs/RNN/LSTM/DeepRL

March 19: Assignment 2 Due

March 15?, 22, 29: Project Presentations

March 29: Project Report Due

April 7: Final Exam at 12:30-3:00 in RCH 302

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 3 / 85

Class Admin Announcements

Usual Materials

[Duda, Pattern Classification, 2001]
R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd
ed.), John Wiley and Sons, 2001.

[Murphy, 2012]
Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT
Press, 2012.

And the following course slides
http://www.cs.cmu.edu/ aarti/Class/10701/slides/Lecture22.pdf

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 4 / 85

Class Admin Announcements

Other Resources

[Goodfellow, 2016]
Goodfellow, Bengio and Courville. ”Deep Learning”, MIT Press, 2016.

http://www.deeplearningbook.org/

Website has free copy of book at pdf’s.

Good quick outline of most topics we’ve covered plus lots of detail on
latest Deep Learning/ANN methods.

Deep Learning Course

Videos of course by Prof. Ali Goshi on campus – all videos online and
slides: https://uwaterloo.ca/data-science/deep-learning

NYU Shanghai Course:
https://sites.google.com/a/nyu.edu/ml-spring2016/home

Many links to deep learning sites:
https://github.com/ChristosChristofidis/awesome-deep-learning

Matrix Cookbook (pdf)

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 5 / 85

http://www.deeplearningbook.org/
https://uwaterloo.ca/data-science/deep-learning
http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

Classification Linear Discriminant Functions

Beyond Linear Classifiers

Last week we talked about SVMs, a powerful classifier algorithm.

SVM learns a discriminant function g(x) such that g(x) ≥ τ means x
is in class 1, and g(x) < τ means x is in class 0. this is linear in it’s structure, with a possibly nonlinear kernal at the core that remaps data from it’s original space to one where a linear divider will work. Naive Bayes is a probabilistic model. But then choose the more likely class or apply a threshold τ to determine if a class is valid. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 6 / 85 Classification Linear and Logistic Regression Linear Regression vs. Logistic Regression A simpler type of Generalized Linear Models Linear regression learns a function to predict a continuous variable output of continous or discrete input variables Y = b0 + ∑ (biXi ) + � Logistic regression predcits the probability of an outcome, the appropriate class for an input vector or the odds of one outcome being more likely than another. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 7 / 85 Classification Linear and Logistic Regression Logistic Regression Define probability of label being 1 by fitting a linear weight vector to the logistic (a.k.a sigmoid) function. P(Y = 1|X ) = 1 1 + e−(w0+ ∑ i wixi ) = 1 1 + exp(−(w0 + ∑ i wixi )) Advantage of this is you can turn the continuous [−∞,∞] feature information into [0, 1] and treat it like a probability. bias: w0 is called the bias, it basically adjusts the sigmoid curve to the left or right, biasing what the expected outcome is. The other weights wi adjust the steepness of the curve in each dimension. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 8 / 85 Classification Linear and Logistic Regression Logistic Regression as a Graphical Model o(x) = σ(wT xi ) = σ(w0 + ∑ i wixi ) = 1 1 + exp(−(w0 + ∑ i wixi )) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 9 / 85 Classification Linear and Logistic Regression Logistic Regression Used as a Classifier Logistic Regression can be used as a simple linear classifier in the same way we did with Naive Bayes. Compare probabilities of each class P(Y = 0|X ) and P(Y = 1|X ) Treat the halfway point on the sigmoid as the decision boundary. P(Y = 1|X ) > 0.5 classify X in class 1
w0 +

i

wixi = 0

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 10 / 85

Classification Linear and Logistic Regression

Properties of Logistic Regression

Very popular simple classification method

Relatively easy to fit to data using gradient descent, many methods
for doing this

Learned paramters can be interpreted as computing the log odds

LO ∼ log p(Y = 1|X )
p(Y = 0|X )

= wTx

If x0 : number of cigarettes per day

x1 : minutes of excercise

y = 1 : getting lung cancer

Then if parameters learned at ŵ = (1.3,−1.1), it means each
cigarette raises odds of cancer by factor of e1.3

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 11 / 85

Classification Linear and Logistic Regression

Training Logistic Regression Model via Gradient Descent

Can’t easily perform Maximum Likelihood Estimation

The negative log-likehood of the logistic function is given by NLL and
it’s gradient by g

NLL(w) =
N∑

i=1

log

(
1 + exp(−(w0 +

i

wixi ))

)

g =

∂w
=

i

(σ(wT xi )− yi )xi

Then we can update the parameters iteratively

θk+1 = θk − νkgk

where νk is the learning rate or step size.
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 12 / 85

Classification Linear and Logistic Regression

Failing on the XOR Problem

How is XOR a hard problem? It is if you’re too linear. XOR is the simplest
nonlinear classification problem.

Two binary features x1 and x2 and four patterns to be assigned the
correct labels True or False

(Goodfellow 2016)

Solving XOR

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

x
1

0

1

x
2

Original x space

0 1 2

h
1

0

1

h
2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers
printed on the plot indicate the value that the learned function must output at each point.
(Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
the model’s output must decrease as x2 increases. A linear model must apply a fixed
coefficient w2 to x2. The linear model therefore cannot use the value of x1 to change
the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented by the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
collapsed into a single point in feature space. In other words, the nonlinear features have
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.
The linear model can now describe the function as increasing in h1 and decreasing in h2.
In this example, the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.

173

Figure 6.1
Mark Crowley ECE 657A : Lecture 8 March 1, 2017 13 / 85

Artificial Neural Networks History and Overview

A Short History

Early work in NN goes back to the 40s with a simple model of the
neuron by McCulloh and Pitt as a summing and thresholding devices.

Rosenblatt in 1958 introduced the perceptron as a two layer network
(one input layer and one output node with a bias in addition to the
input features.

Minsky: 1969. They’re ’just’ linear

AI goes logical

1980s: Neural Network resurgence: Backpropagation

1990s : SVMs! Kernals can do anything! (no, they can’t)

2003-present: Deep Learning (Convolutional Nets Dropout/RBMs,
Deep Belief Networks

Google Cat Youtube, speech recognition, self driving cars, computer
defeats regional Go champion, …

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 14 / 85

Artificial Neural Networks History and Overview

Neural Networks

A neural network connects many logistic/sigmoid units together into
a network

Central difference from Perceptron: a layer of hidden units

Also called: Multilayer Perceptrons, Artificial Neural Networks
(ANNs)

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 15 / 85

Artificial Neural Networks History and Overview

Properties of Neural Networks

Important Properties:

Given a large enough layer of hidden units (or multiple layers) an
ANN can represent any function.

One challenge is regularization: how many hidden units or layers to
include in the network? Number of inputs and outputs are provided
but internal structure can be arbitrary.

too few units, network will have too few parameters, may not be able
to learn complex functions
too many units, network will be overparameterized and won’t be forced
to learn a generalizable model

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 16 / 85

Artificial Neural Networks History and Overview

Properties of Neural Networks

The hidden layer will have a number of units ( design parameter).

The hidden layers allow extracting more complex features and
facilitate solving complex problems.

Connection between the units (neurons) of all the layers can be
forward, backward or both.

Units can be fully or partially connected. (Fully connected at first)

Different arrangements make different types or models of the NN.

Each unit will have a thresholding (or activation) function and
connections between the units that have weights

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 17 / 85

Neural Networks to learn f: X  Y

• f can be a non-linear function
• X (vector of) continuous and/or discrete variables

• Y (vector of) continuous and/or discrete variables

• Neural networks – Represent f by network of logistic/sigmoid
units, we will focus on feedforward networks:

Input layer, X

Output layer, Y

Hidden layer, H

Sigmoid Unit

/activation function (also linear, threshold)

Differentiable

d

d

d

Artificial Neural Networks Feedforward Artificial Neural Networks

Three-layer Neural Network

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 21 / 85

Artificial Neural Networks Feedforward Artificial Neural Networks

Input Layer

vector data, each input unit reponsible for one feature/dimension of
the data

Hidden Layer

Each hidden unit computes a weighted sum of all the input units and
passes it through a nonlinear function.

output: net activation

Output Layer

Each output unit computes a weighted sum of all the hidden units
and passes it through a (possibly nonlinear) threshold function.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 22 / 85

Artificial Neural Networks Feedforward Artificial Neural Networks

Net Activation

netj =
d∑

i=1

xiwji + wj0 =
d∑

i=0

xiwji

where:

i indexes the inputs units

j indexes the hidden units

Wji denotes the weights on input to hidden layer (“synaptic weights”)

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 23 / 85

Artificial Neural Networks Feedforward Artificial Neural Networks

Hidden Layer: Adding Nonlinearity

Each hidden unit emits an output that is a nonlinear activation
function of its net activiation.

yj = f (netj)

This is essential to neural networks power, if it’s linear then it all
becomes just linear regression.

The output is thus thresholded through this

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 24 / 85

Artificial Neural Networks Feedforward Artificial Neural Networks

Activation Functions

tanh function also popular

recently Rectified Linear Units (ReLU) have become standard
max(0, netj) (same as half-wave rectication)

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 25 / 85

(Goodfellow 2016)

Rectified Linear Activation

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0

z

0

g
(z

)
=

m
a
x
{0

,
z
}

Figure 6.3: The rectified linear activation function. This activation function is the default
activation function recommended for use with most feedforward neural networks. Applying
this function to the output of a linear transformation yields a nonlinear transformation.
However, the function remains very close to linear, in the sense that is a piecewise linear
function with two linear pieces. Because rectified linear units are nearly linear, they
preserve many of the properties that make linear models easy to optimize with gradient-
based methods. They also preserve many of the properties that make linear models
generalize well. A common principle throughout computer science is that we can build
complicated systems from minimal components. Much as a Turing machine’s memory
needs only to be able to store 0 or 1 states, we can build a universal function approximator
from rectified linear functions.

175

Figure 6.3

Artificial Neural Networks Feedforward Artificial Neural Networks

Output Layer

The output layer combines all the previous hidden layers.

netk =

nH∑

j=1

yjwkj + wk0

where:

k indexes the output units

nH denotes the number of hidden units in the previous layer

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 27 / 85

Artificial Neural Networks Feedforward Artificial Neural Networks

Output Layer

The output layer adds another level of thresholding, usually nonlinear as
well.

zk = f (netk)

f (netk) could be any thresholding function such as the previous activation
functions.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 28 / 85

Neural Networks as Universal Approximators

Neural Nets Can Model XOR Problem

Use a simple sign activation
function for hidden and output units:

f (net) = 1 if net ≥ 0
= −1 if net < 0 First hidden unit: y1 computes x1 + x2 + 0.5 = 0 y1 = 1 if net1/ge0, y1 = −1 otherwise Second hidden unit: y1 computes x1 + x2 − 1.5 = 0 y2 = 1 if net2/ge0, y2 = −1 otherwise single output unit z1: emits z1 = 1 iff y1 = 1 and y2 = 1. y1 models AND gate y2 models OR gate z1 models y1 AND NOT y2 Mark Crowley ECE 657A : Lecture 8 March 1, 2017 29 / 85 Neural Networks as Universal Approximators Full Neural Network Descriminant Function The class of decsion functions that can be described by a three-layer neural network are: gk(x) = zk = f   nH∑ j=1 wkj f ( d∑ i=1 wjixi + wj0 ) + w0   The Universal Approximation Theorem [Hornik, 1989] shows that this covers any possible decision function mapping continuous inputs to a finite set of classes to any desired level of accuracy. But it does not say how many hidden unites would be needed. In general, for a single layer it could be exponential (degrees of freedom). But, using more layers can reduce the number of hidden units needed and make it more generalizable. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 30 / 85 d E – Error function, MSE, cross-entropy loss For Neural Networks, E[w] no longer convex in w Slides from Tom Mitchell ML Course, CMU, 2010 Using all training data D l lly l l l lly l Error Gradient for a Sigmoid Unit l l l l l l l l l l l l l l l ly ly ly ly ly ly l l l l l l l l l l l l l lly Sigmoid Unit d d d (MLE) l l l lly k l l l l l l l l o Using Forward propagation yk = target output (label) of output unit k ok(h) = unit output (obtained by forward propagation) wij = wt from i to j Note: if i is input variable, oi = xi Objective/Error no longer convex in weights Regularization Regularization to Reduce Overfitting L2 regularization - augmenting the error function, add 1 2 λw2 to all weights in the neural network Need to choose lambda carefully Interpretation - heavily penalizing ”peaky” weight vectors and preferring diffuse weight vectors. Encourages network to use all its inputs instead of relying on strongest signal. Also called weight decay since weights would reduce over time during gradient descent without input. L1 regularization - add λ|w | to each weight error function (encourages all weights to zero) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 37 / 85 Regularization Regularization to Reduce Overfitting Figure: From http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-networks.html/2 Mark Crowley ECE 657A : Lecture 8 March 1, 2017 38 / 85 Regularization Dropout Dropout (Hinton 2014) - randomly ignore certain units during training, don’t update them via gradient descent, leads to hidden units that specialize. With probability p don’t include a weight in the gradient updates. Reduces overfitting by encouraging robustness of weights in the network. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 39 / 85 Improving Performance Heuristics for Improving Backpropagation There are a number of useful but not necessarily theorically justified tricks (ie. heuristics) for training Neural Networks that are useful in practice. Let hidden nodes, just enough complexity to work, not too much to overfit. Train multiple networks with different sizes and search for the best design. Validation set: train on training set until error on validation set starts to rise, then evaluate on evaluation set. Try different activiation functions: sigmoid or ReLU Randomly choose subsets of the data to training. Dropout (Hinton 2014) - randomly ignore certain units during training, don’t update them via gradient descent, leads to hidden units that specialize Modify learning rate over time (cooling schedule) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 40 / 85 (Goodfellow 2016) Exponential Advantage of Depth CHAPTER 6. DEEP FEEDFORWARD NETWORKS (2014) showed that functions representable with a deep rectifier net can require an exponential number of hidden units with a shallow (one hidden layer) network. More precisely, they showed that piecewise linear networks (which can be obtained from rectifier nonlinearities or maxout units) can represent functions with a number of regions that is exponential in the depth of the network. Figure 6.5 illustrates how a network with absolute value rectification creates mirror images of the function computed on top of some hidden unit, with respect to the input of that hidden unit. Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value nonlinearity). By composing these folding operations, we obtain an exponentially large number of piecewise linear regions which can capture all kinds of regular (e.g., repeating) patterns. Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper rectifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification unit has the same output for every pair of mirror points in its input. The mirror axis of symmetry is given by the hyperplane defined by the weights and bias of the unit. A function computed on top of that unit (the green decision surface) will be a mirror image of a simpler pattern across that axis of symmetry. (Center)The function can be obtained by folding the space around the axis of symmetry. (Right)Another repeating pattern can be folded on top of the first (by another downstream unit) to obtain another symmetry (which is now repeated four times, with two hidden layers). Figure reproduced with permission from Montufar et al. (2014). More precisely, the main theorem in Montufar et al. (2014) states that the number of linear regions carved out by a deep rectifier network with d inputs, depth l, and n units per hidden layer, is O ✓ n d ◆d(l�1) nd ! , (6.42) i.e., exponential in the depth l. In the case of maxout networks with k filters per unit, the number of linear regions is O ⇣ k(l�1)+d ⌘ . (6.43) 200 Figure 6.5 Using Hidden LeRU units, each hidden layer increases power, exponential advantage of additional layers. (Goodfellow 2016) Better Generalization with Greater DepthCHAPTER 6. DEEP FEEDFORWARD NETWORKS 3 4 5 6 7 8 9 10 11 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 T e s t a c c u r a c y ( p e r c e n t ) Figure 6.6: Empirical results showing that deeper networks generalize better when used to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow et al. (2014d). The test set accuracy consistently increases with increasing depth. See figure 6.7 for a control experiment demonstrating that other increases to the model size do not yield the same effect. Another key consideration of architecture design is exactly how to connect a pair of layers to each other. In the default neural network layer described by a linear transformation via a matrix W , every input unit is connected to every output unit. Many specialized networks in the chapters ahead have fewer connections, so that each unit in the input layer is connected to only a small subset of units in the output layer. These strategies for reducing the number of connections reduce the number of parameters and the amount of computation required to evaluate the network, but are often highly problem-dependent. For example, convolutional networks, described in chapter 9, use specialized patterns of sparse connections that are very effective for computer vision problems. In this chapter, it is difficult to give much more specific advice concerning the architecture of a generic neural network. Subsequent chapters develop the particular architectural strategies that have been found to work well for different application domains. 202 Figure 6.6 Layers Increasing accuracy from number of layers in detection of numbers in photos of adresses. (Goodfellow 2016) Large, Shallow Models Overfit More CHAPTER 6. DEEP FEEDFORWARD NETWORKS 0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters ⇥108 91 92 93 94 95 96 97 T e s t a c c u r a c y ( p e r c e n t ) 3, convolutional 3, fully connected 11, convolutional Figure 6.7: Deeper models tend to perform better. This is not merely because the model is larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number of parameters in layers of convolutional networks without increasing their depth is not nearly as effective at increasing test set performance. The legend indicates the depth of network used to make each curve and whether the curve represents variation in the size of the convolutional or the fully connected layers. We observe that shallow models in this context overfit at around 20 million parameters while deep ones can benefit from having over 60 million. This suggests that using a deep model expresses a useful preference over the space of functions the model can learn. Specifically, it expresses a belief that the function should consist of many simpler functions composed together. This could result either in learning a representation that is composed in turn of simpler representations (e.g., corners defined in terms of edges) or in learning a program with sequentially dependent steps (e.g., first locate a set of objects, then segment them from each other, then recognize them). 203 Figure 6.7 Improving Performance Other Types of ANNs Radial Basis Function Networks Restricted Boltzman Machines (RBM) Deep Neural Networks Convolutional Neural Networks (CNN) - neighbours share weights, learns larger patters, stack in multiple levels (good for images classification) Recurrent Neural Networks (RNN) - allow links from outputs back to inputs, over time, good for time series learning LSTM Deep Reinforcement Learning Mark Crowley ECE 657A : Lecture 8 March 1, 2017 44 / 85 (Goodfellow 2016) Network Diagrams CHAPTER 6. DEEP FEEDFORWARD NETWORKS yy hh xx W w yy h1h1 x1x1 h2h2 x2x2 Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically, this is the feedforward network we use to solve the XOR example. It has a single hidden layer containing two units. (Left)In this style, we draw every unit as a node in the graph. This style is very explicit and unambiguous but for networks larger than this example it can consume too much space. (Right)In this style, we draw a node in the graph for each entire vector representing a layer’s activations. This style is much more compact. Sometimes we annotate the edges in this graph with the name of the parameters that describe the relationship between two layers. Here, we indicate that a matrix W describes the mapping from x to h, and a vector w describes the mapping from h to y. We typically omit the intercept parameters associated with each layer when labeling this kind of drawing. model, we used a vector of weights and a scalar bias parameter to describe an affine transformation from an input vector to an output scalar. Now, we describe an affine transformation from a vector x to a vector h, so an entire vector of bias parameters is needed. The activation function g is typically chosen to be a function that is applied element-wise, with hi = g(x>W:,i + ci). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation
function g(z) = max{0, z} depicted in figure 6.3.

We can now specify our complete network as

f(x; W , c, w, b) = w> max{0, W>x + c} + b. (6.3)

We can now specify a solution to the XOR problem. Let

W =


1 1
1 1


, (6.4)

c =


0
�1


, (6.5)

174

Figure 6.2

(Goodfellow 2016)

Mixture Density Outputs
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

x

y

Figure 6.4: Samples drawn from a neural network with a mixture density output layer.
The input x is sampled from a uniform distribution and the output y is sampled from
pmodel(y | x). The neural network is able to learn nonlinear mappings from the input to
the parameters of the output distribution. These parameters include the probabilities
governing which of three mixture components will generate the output as well as the
parameters for each mixture component. Each mixture component is Gaussian with
predicted mean and variance. All of these aspects of the output distribution are able to
vary with respect to the input x, and to do so in nonlinear ways.

to describe y becomes complex enough to be beyond the scope of this chapter.
Chapter 10 describes how to use recurrent neural networks to define such models
over sequences, and part III describes advanced techniques for modeling arbitrary
probability distributions.

6.3 Hidden Units

So far we have focused our discussion on design choices for neural networks that
are common to most parametric machine learning models trained with gradient-
based optimization. Now we turn to an issue that is unique to feedforward neural
networks: how to choose the type of hidden unit to use in the hidden layers of the
model.

The design of hidden units is an extremely active area of research and does not
yet have many definitive guiding theoretical principles.

Rectified linear units are an excellent default choice of hidden unit. Many other
types of hidden units are available. It can be difficult to determine when to use
which kind (though rectified linear units are usually an acceptable choice). We

191

Figure 6.4

This is output for a model called a Mixture Density Network which can
perform multimodel regression where each x may have multiple output y .
Standard way is to fit a Gaussian mixture models.
This can be specified as a neural network. [Goodfellow, p. 185]

Decision Trees Definition

Non-metric Classification

Metric Classification: k-nearest neighbours, SVM, logistic regression,
neural networks

What if the data features are not numeric (discrete or continuous)
values with a well defined order?

How to perform classification of data for color∈{’red’,
’blue’,’orange’}, DNA∈{A,G,C,T}
How to discriminate data with such labels, How to learn a flexible
model that can predict such labels without just learning a mapping.
How to make it generalizable?

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 47 / 85

Decision Trees Definition

Simple Decision Tree

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 48 / 85

Decision Trees Definition

Decision Tree

Any finite pattern can be represented by a finite series of yes/no questions.
Decision Tree:

Directed graph G =< N,E >

Each node n ∈ N represents a choice on a particular variable/feature
F amongst B subsets of the values of F .

Edges (links, branches) connect a node to child (descendent) nodes
which model a split on another feature (or the same feature with
different split subsets)

Leaf nodes are nodes with no children which have a category label
assigned to them.

A decision tree categorizes a data point according to the label of the
leaf node that data point reaches by following the rules defined in the
internal nodes.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 49 / 85

Decision Trees Definition

Decision Tree Properties

A decision tree recursively partitions the input space, defining a local
model in each resulting region of the input space represented by a leaf
node.

If all the dimensions are numerical we can envision with the decision
tree is doing as dividing up space into rectangles.

Decision trees allow us to combine Numerical and non-numerical data
dimensions into a single learning process.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 50 / 85

Decision Trees Training Decision Trees – CART

CART

A generalization of decision trees is the Classification and Regresssion Tree
(CART) framework. Questions we need to answer:

1 Should properties be binary only?

2 Which feature will be tested at each node?

3 When should a node become a leaf?

4 Can a tree be “too large”, how could we make it smaller?

5 If a leaf is impure, what label do we assign?

6 How do we deal with missing data?

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 51 / 85

Decision Trees Training Decision Trees – CART

Splitting Nodes

The question at each node splits the data along that dimension

Number of split choices is the branching factor of the tree.

We can represent any multiple branch split with a series of binary
splits.

Using a higher branching factor raises the risk of overfitting.

Finding the optimal partitioning of the data is NP-complete so we
usually use a greedy approach to find an approximate solution.

Need to specify:

feature to split on
value in feature range to split on
whether the node should be split or left as a leaf

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 52 / 85

Decision Trees Training Decision Trees – CART

Basic fitTree Algorithm

Algorithm 1: fitTree(node,D, depth)

1 node.prediction = mean(yi : i ∈ D) OR label distribution
2 j , t,DL,DR = split(D)
3 if not worthSplitting(depth, cost(D), DL,DR) then
4 return node

5 else
6 node.testfeature = j
7 node.testvalue = t
8 node.left = fitTree(node,DL, depth+1)
9 node.right = fitTree(node,DR , depth+1)

10 return node

From (Murphy, 2012).

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 53 / 85

Decision Trees Node Impurity

Cost as Node Impurity

Basic principle, we want the simplest model at each node, i.e. Occam’s
razor.
A leaf node is pure if all the datapoints associated with it are in have the
same label/category. (We’ll say the impurity cost(D) = 0 in this case)

Entropy Impurity – most popular

Gini Impurity – useful for training trees

Misclassification Impurity – intuitive but generally worse

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 54 / 85

Decision Trees Node Impurity

Entropy Impurity

cost(D) = −

c

P(ωc) log2 P(ωc)

where:

P(ωc) is the fraction (or probability) of points in that node having
label ωc

P(ωc) =
1

|D|

i∈D
I(yi = c)

If all points have same label cost(D) =, and the maximum of cost(D) will
be when points have uniform likelihood of all labels

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 55 / 85

Decision Trees Node Impurity

Gini Impurity

Also called Gini Index

cost(D) =

i 6=j
P(ωi )P(ωj) =

1

2


1−

j

P2(ωj)


This is the expected error rate at node D if the label is selected randomly
from the distribution present in that node.
More strongly peaked at uniform probability than entropy impurity

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 56 / 85

Decision Trees Node Impurity

Misclassification Impurity

cost(D) = 1−maxjP(ωj)

Measures the minimum probability that a datapoint would be miscalssified
for this node.
Very peaked at uniform probability of all labels.
But isn’t smooth, derivative is discontuous so breaks some gradient search
methods.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 57 / 85

Decision Trees Node Impurity

Behaviour of Impurity Models

For the binary two-class case.
Entropy and Gini impurity measures
very similar.
Both are more sensitive to change in
class probability than
misclassifiaction rate.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 58 / 85

Decision Trees Node Impurity

Choosing a Split Value

Given a partial tree, deciding on node D, chosen the feature to split
on

Question: What value should you split on?

∆cost(D) = cost(D)−
(
|DL|
|D| cost(DL) +

|DR |
|D| cost(DR)

)

where:

DL and DR are the left/right descendent nodes and by splitting on
feature T at value s.

|D| means the number of nodes in the set
The best split is the one that maximizes ∆cost(D)
For entropy impurity this amounts to maximizing the information gain of
the split.

Mark Crowley ECE 657A : Lecture 8 March 1, 2017 59 / 85

Decision Trees Stopping and Pruning

When to Stop Splitting?

When should we stop splitting?

Simplest approach: keep splitting until each leaf has only datapoints
in the same class or has only a single datapoint. → Tends to overfit
for classification/regression.

Minimize Cross-Validated Error: measure the decrease in error a
new node would cause against a validation set, or on many random
cross-validation sets. Then stop splitting when the gain is minimized
sufficient.

Threshold: Stop splitting if proposed split provides < � reduction in impurity, or if leaf has less than 5 datapoints. Minimize Global Criterion: ... Mark Crowley ECE 657A : Lecture 8 March 1, 2017 60 / 85 Decision Trees Stopping and Pruning Minimize A Global Criterion Define a global criterion. Keep splitting until it reaches a minimum value. α · size + ∑ d∈leafnodes(N) cost(d) where: size - number of nodes, or edges α - some positive constant Properties: This acts as a regularizer that discourages larger trees. If entropy impurity is being used this is equivalent to minimizing the minimum description length, the number of bits used to represent the model. The sum of leaf impurities is a measure of uncertainty in the training data given the current model. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 61 / 85 Decision Trees Stopping and Pruning Pruning the Tree Horizon Effect: what if you stop splitting at a node but later features would have led to better classification? Can’t see past horizon unless you explore it. Stop too early and the model has low predictive power. Stop too late and it overfits. So why not fully grow the tree first then decide what to remove? Mark Crowley ECE 657A : Lecture 8 March 1, 2017 62 / 85 Decision Trees Stopping and Pruning Basic Pruning Algorithm 1 Grow a full tree : until leaf node impurity hits a minimum or very small number of datapoints in leaf. 2 Find a pair of sibling leaf nodes i and j which have not yet been examined 3 Calculate how much impurity of parent node k would go up if i and j were eliminated then remove branches that would cause the smallest increase in error. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 63 / 85 Decision Trees Stopping and Pruning Benefits of Pruning PROs All the data can be used for training unlike the threshold cross validation approaches Avoids the horizon effect CONs More expensive than computing when to stop early Mark Crowley ECE 657A : Lecture 8 March 1, 2017 64 / 85 Decision Trees Stopping and Pruning Another way to think about CART CART can be see as an adaptive basis-function (kernal) model. The basis function defines regions as hyper-rectangles and weights specify the regression response or the matching label proportions for each region. f (x) = E [y |x ] = M∑ m=1 wmI(x ∈ Rm) = M∑ m=1 wmφ(x; vm) where: Rm is the m th regions partitioned by the tree vm encodes the choice of feature and the threshold value to split on I is an indicator function that is 1 iff datapoint x falls within that hyper-rectangular region From (Murphy, 2012) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 65 / 85 Decision Trees Stopping and Pruning Decision Tree Algorithm: ID3 Intended for use with nominal, unordered data. On each iteration, it iterates through every unused feature and calculates the entropy for that attribute. It then selects the one with the smallest entropy and the dataset is split on that feature Creates a branch for each value of the current feature (not binary!) The algorithm continues to recurse on each subset, considering only features never selected before. Stopping: every element in the subset belongs to the same class no more feautres left, if leaf not pure then the most common class label is used ID3 is only prefferred if computational costs are a big issue. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 66 / 85 Decision Trees Stopping and Pruning Deicsion Tree Algorithm: C4.5 Improvement on ID3, most popular in use (see WEKA data mining tool). Performs pruning on the tree after it’s grown Missing Data: what if some datapoints don’t have a value for feature f ? C4.5 adds a ‘?’ and doesn’t use that feature for entropy calculations, algorithm otherwise stays the same Mark Crowley ECE 657A : Lecture 8 March 1, 2017 67 / 85 Decision Trees Stopping and Pruning Pros and Cons of Decision Trees PRO: Simple to implement, Lots of flexibility in impurity measures, training algorithms Resulting model is easy to interpret as logical rules Can be seen as an automated kernal learning method, similar to Neural Networks Universal expressive power (also like neural networks) Handle nominal, discrete and continous data They perform automated variable selection CON: Very easy to overfit the data, create a complete mapping Tend to be unstable, small changes in input lead to very different trees Fairly slow to train (not as bad as Neural networks though) Don’t perform as well as more modern methods (but...) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 68 / 85 Ensemble Methods Definition Ensemble Methods for Improving Classification Basic Idea: Rather than training just one model on to predict the data with low error, try learning multiple models, on subsets of of the data, and combine them to get overall lower error than possible from a single model. General Essemle Methods: Bagging Boosting Adaboost Also called Adaptive Reweighting and Combining (Arcing) Decision Tree Ensembles: Random Forests Extremely Randomized Trees, Isolation Forests Hoeffding Trees, Mondrian Forests, Streaming Random Forests Mark Crowley ECE 657A : Lecture 8 March 1, 2017 69 / 85 Ensemble Methods Bagging Bootstrap Revisited A bootstrap data set is created by randomly selecting n points from the training set X of size n with replacement. There will usually be some duplicates. This is repeated B times with independent draws each time. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 70 / 85 Ensemble Methods Bagging Bagging Bagging (bootstrap aggregation) Draw multiple subsamples of the data of size n′ < n Learn a component classifier on using each bagged sample Final classification decision based on vote from each component classifier Usually use same classifier for each component, (neural network, SVM, decision tree) but they could be different (more on multi-classifer systems later) Advantage: reduces instability (ie. sensitivity of classifier performance to small changes in data) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 71 / 85 Ensemble Methods Boosting Boosting A meta-algorithm that allows us to create an ensemble of classifiers with arbitrarily high accuracy Train mutiple classifiers as in bagging but each bootstrap sample is chosen to maximize the information gain conditioned on the previously trained classifiers. This classifier could be a weak leaner: only guaranteed to be better than chance. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 72 / 85 Ensemble Methods Boosting Boosting Example Divide D into three sets D1:3 of size < n Run your favourite classification algorithm on D1 to create C1 Select subset D2 by randomly selecting 50% of points that C1 gets correct and 50% that C1 gets incorrect. Train C2 on D2 Select D3 to be all points in D that C1 or C2 get wrong Train C3 on D3 Mark Crowley ECE 657A : Lecture 8 March 1, 2017 73 / 85 Ensemble Methods Boosting Adaboost Most popular variation of boosting meta-algorithm We assume the sign of the weak learner hk(x) gives the class, and the magnitude of the value gives the confidence in the classification It essentially classifies the datapoints using a weighted majority vote of different classifiers, each trained to perform well where all the previous classifiers are doing badly. Compute a joint disciminant function g(x) = ∑kmax k=1 αkhk(x) Now, classify the datapoints using sign(g(x)) *Further reading:* great chapter by Shapire about possible explanations of why AdaBoost is so effective: http://rob.schapire.net/papers/explaining-adaboost.pdf Mark Crowley ECE 657A : Lecture 8 March 1, 2017 74 / 85 http://rob.schapire.net/papers/explaining-adaboost.pdf Ensemble Methods Boosting AdaBoost Algorithm 2: AdaBoost Algorithm(X,Y) Input: Datapoints (xi , yi ) where xi ∈ X and the label yi ∈ {−1,+1} for i = 1, . . . , n 1 2 Assign initial weights w1(i) = 1/n 3 foreach k=1,. . . ,K do 4 Train a weak-learner classifier Ck using data sampled by wk(i) 5 �k = Pri∼w1 [Ck(xi ) 6= yi ] /* error rate of Ck */ 6 7 αk = 1 2 ln ( 1−�k �k ) 8 foreach i = 1, . . . , n do 9 wk+1(i) = wk (i)exp(−αkyiCk (xi )) Zk 10 return classification function C (x) = sign (∑K k=1 αkCk(x) ) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 75 / 85 Random Forests Random Forest Very good way to reduce the variance of decision trees, make them more stable to different inputs. Bagging: Train M trees on randomly chosen subsets of the data (with replacement) f (x) = M∑ m=1 1 M fm(x) Bagging leads to highly correlated trees, so it helps but could be better. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 76 / 85 Random Forests Random Forests From (Breiman, 2001) Builds on the bagging algorithm Will learn T trees. For each tree, obtain a bootstrap sample of the datapoints and a sample of K features to use for training that tree. Use the standard CART tree learning algorithm for each tree, But at each split step, find the optimal split by searching the K features and their values. The next tree will use a different sample of featuers and datapoints. Mark Crowley ECE 657A : Lecture 8 March 1, 2017 77 / 85 Random Forests Random Forests Decorrelates the individual trees (base learners) by also randomly choosing the feature set each can choose from. Result: Each tree focusses on a subset of the features and learns them better. Outperforms many other methods in terms of accuracy. Boosting can take this even further Mark Crowley ECE 657A : Lecture 8 March 1, 2017 78 / 85 Random Forests PROS and CONS of Random Forests PROs: Robust to initial data disctribution, sampling process Easily tunable, more treats improves accuracy and reduces variances Relatively fast to train for such a flexible model. CONs: Harder to interpret than decision trees, can try to average rules across trees Mark Crowley ECE 657A : Lecture 8 March 1, 2017 79 / 85 Random Forests Extremely Randomized Trees From (Geurts, Ernst and Wehenkel, 2006) Randomly choose a subset of features Randomly choose a cut point in each of those features (not by searching) Choose the best performing of those features as the split (maximize a score, could be impurity) Totally Randomized Trees: just choose one at random, no maximization (faster, more robust, but bad for classification) Stop when minimize number of datapoints for a leaf K is reached Mark Crowley ECE 657A : Lecture 8 March 1, 2017 80 / 85 Random Forests Extremely Randomized Trees Two main differences with other tree-based ensemble methods: Splits nodes by choosing cut-points fully at random Uses the whole learning sample (rather than a bootstrap replica) to grow the trees Mark Crowley ECE 657A : Lecture 8 March 1, 2017 81 / 85 Random Forests Other Tree Ensemble Methods Isolation Forests (Liu, 2008): anomaly detection (two-class classifcation), unsupervised Hoeffding Trees (Domingos, 2000): streaming data for single trees for regression, confidence bounds. Nodes only split when enough data arrives for confidence level. Mondrian Forests (Lakshminarayanan, 2014): streaming data for forests, confidence bounds Mark Crowley ECE 657A : Lecture 8 March 1, 2017 82 / 85 Classification Performance Comparisons Performance Comparisons From (Murphy, 2012, p.585) Mark Crowley ECE 657A : Lecture 8 March 1, 2017 83 / 85 Classification Performance Comparisons Classification Performance Comparisons From http://scikit-learn.org/stable/auto_examples/ classification/plot_classifier_comparison.html Mark Crowley ECE 657A : Lecture 8 March 1, 2017 84 / 85 http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Classification Performance Comparisons Summary Logistic Regression as a simple classifier Stochastic Gradient Descent Neural Networks / Multilayer Perceptrons Decision Trees Ensemble Methods Mark Crowley ECE 657A : Lecture 8 March 1, 2017 85 / 85 Class Admin Announcements Classification Linear Discriminant Functions Linear and Logistic Regression Artificial Neural Networks History and Overview Feedforward Artificial Neural Networks Neural Networks as Universal Approximators Training Neural Networks - Backpropagation Improving Performance The Threat of Overfitting Regularization Improving Performance Decision Trees Definition Training Decision Trees - CART Node Impurity Stopping and Pruning Ensemble Methods Definition Bagging Boosting Random Forests Classification Performance Comparisons