程序代写代做代考 decision tree Bayesian deep learning ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation

ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation

ECE 657A: Classification
Lecture 5: Classification, Training, Validation and Estimation

Mark Crowley

January 24, 2016

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 1 / 34

Class Admin Announcements

Today’s Class

Announcements

Supervised Learning / Classification

Project Pitch Session

Break

Naive Bayes and MAP estimation

Training, validation, error estimation

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 2 / 34

Class Admin Announcements

Announcements

Today (Feb 1): Project Pitch Session.

Assignment 1 due : Due Feb 10 at 11:59pm (do in groups)

Next week (Feb 8):

I’ll be travelling Feb 3-9
11:30am TA help session
Guest lecture on Data Analytics tools for R.

Oxt week

(Feb 15): TA help session for the assignment or project
(11:30-12:50) and Guest lecture on Data Analytics tools for R by one
of Prof. Fischmeister’s students (1-2:2).

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 3 / 34

http://www.urbandictionary.com/define.php?term=oxt

Class Admin Pitch Session

Pitch Session : Structure

Your group will be matched with two other groups – see table.

Your group will have 5-10 minutes to pitch your idea to two other
groups.

You can use any materials you want, I’ll have flip chart paper, some
can use chalboard or whiteboard. You could use laptop. You could just
talk.

Then 5-10 minutes of discussion and feedback.

Next group

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 4 / 34

Class Admin Pitch Session

Pitch Session : Grading

You will not be graded on your pitch quality or your topic.

You will be graded on your feedback for others. (1.5% out of 4% peer
review)

Peer review forms:

Explain the project idea: you need to provide a concise, fairly accurate
explanation of their idea (shows you listened)
Do you think the project is too easy, too hard or just right?
Give some substantial, constructive suggestion for their project.
Explain how your suggestion would improve the project’s ability to
teach the class about something new.
Submit form online : https://goo.gl/forms/YpMlLC2qAsC5YXJW2

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 5 / 34

Supervised vs. Unsupervised Learning

Outline of

Class Admin
Announcements
Pitch Session

Supervised vs. Unsupervised Learning

Classification

Training, Validation and Error Estimation
Dividing the Data
Confusion Matrix
ROC and AUC Curves
Other Performance Measures

Hypothesis Testing

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 6 / 34

Supervised vs. Unsupervised Learning

Descriptive versus Inferential Analysis

We have data (samples). This data is a sample of a population (more
than just the measured or observed sample).

Descriptive Analysis is the type of analysis and measures that seek
to describe and summarize the data, the available samples. We can
not in general use it for interpretation of unobserved data.

Inferential Analysis (predictive) is the type of analysis that can
describe measures over the population of data. That is observed and
unobserved.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 7 / 34

Supervised vs. Unsupervised Learning

Another way to look at it: Machine Learning [Murphy,
2012]

Supervised Learning Also called: predictive, inferential
Given inputs and outputs
Model based: model or distribution is known

Parameter Estimation: distribution known, fitting
parameters
Linear Regresion, least squares estimation

output is linear: regression, prediction
output is categorical (label) : classification

Unsupervised Learning Also called: descriptive, knowledge discovery
Given inputs only
output is categorical : clustering

Reinforcement Learning Input and output given only when action
(prediction, choice, activity) provided.
Given feedback about utility of choice made.
No given model or labels ahead of time.
Learning is interactive.Mark Crowley ECE 657A : Lecture 5 January 24, 2016 8 / 34

Supervised vs. Unsupervised Learning

Clustering vs. Classification

Clustering

Unsupervised

Uses unlabeled data

Organize patterns w.r.t. an
optimization criteria

Notion of similarity

Hard to evaluate

Example: K- means, Fuzzy C-
means, Hierarchical

Classification

Supervised

Uses labeled data

Requires training phase

Domain sensitive

Easy to evaluate

Examples: Bayesian, KNN,SVM,
Decision Trees

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 9 / 34

Classification

Classification Definition

Definition:

Classification is a learning method that uses training samples (with
known class labels) to learn how to assign the samples into their
proper classes. The task of the clasifier is to use the feature vector
provide by the feature extractor to assign the object (datapoint) to a
category.

This learning can then be used to label test samples (with unknown
labels).

Classification can be facilitated if the features (or a subset of them)
of samples in the same class share characteristics that discriminate
them from other classes.

It can be seen as the discrete form of Prediction, can you map the
input values to a discrete set of output values rather than to a
continuous number.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 10 / 34

Classification

Classification

How do we design a classifier that can deal with the variability in
feature values?

Designing or selecting a classifier for the data or application is not an
obvious task.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 11 / 34

Classification

Definition

Given a dataset X = {x1, x2, . . . , xn} where each datapoint xi ∈ X
contains F features denoted xi ,f ,

and given a set of classes C = {C1, . . . ,CK},
the classification problem is to define a mapping δ(xi ) : X → C
where each xi is assigned to one class.

A class, Cj , contains precisely those points mapped to it, no more
and no less.

Note: the classes are predefined, nonoverlapping and partition the
entire set of datapoints.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 12 / 34

Classification

Considerations for Designing a Classification Approach

1. Data, labelled, one class, multiclass, overlap, missing values

2. Training and Error Estimation procedures

3. Classification methods

4. Performance measures

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 13 / 34

Classification

Classifier Methods

Similarity Based Approach
Template matching
Nearest Mean classifier
Neural Network classifiers
Deep Learning

Other
Decision Trees
Random Forests

Probabilistic Approach
Bayesian Classification
Logistic regression

Decision Boundary Approach
Geometrical
Neural Networks
SVM

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 14 / 34

Training, Validation and Error Estimation Dividing the Data

Training and Error Estimation

Resubstitution

Holdout Method

Leave-one-out

K-fold Cross Validation

Bootstrap

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 15 / 34

Training, Validation and Error Estimation Dividing the Data

Error Estimation and training Methods

Resubstitution Method: Uses all the available data for training and then
the same data is used for testing

Optimistically biased estimate, especially when the ratio
of sample size to dimensionality is small

Holdout Method: Uses half the data for training and the other half is used
for testing; training and test sets are independent

Pessimistically biased estimate
Different partitionings may give different results

Leave-one-out Method: A classifier is designed using (n- 1) samples and
evaluated on the remaining one sample; this is repeated n
times with different training sets of size (n-1)

Estimate is unbiased but it has a large variance
Large computational requirement because n different
classifiers have to be designed

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 16 / 34

Training, Validation and Error Estimation Dividing the Data

Error Estimation Methods

K-fold cross validation (Rotation): divide the available samples into K
disjoint subsets for training and the remaining subset for
test. Can repeat the process with different partitioning.

Estimate has lower bias than the holdout method and is
cheaper to implement than leave-one-out method.

Bootstrap Method: Generate many bootstrap sets of size n by sampling
with replacement.

Bootstrap estimates can have lower variance than the
leave-one-out method.
Computationally more demanding.
Useful in small sample size situations.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 17 / 34

Training, Validation and Error Estimation Dividing the Data

Validation

Sometimes we need to tune certain parameters of the classifier such
as k in the K Nearest Neighbours classifier or parameters that control
the architecture of a neural network.

This can be done using a validation set, leaving the test set to test
the optimal choice

So, the data is divided into 3 subsets: Training, Validation and
Testing

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 18 / 34

Training, Validation and Error Estimation Confusion Matrix

Classification Confidence

For a binary classification problem (ie. C = {C0,C1}) our decision rule
could be

δ(x) = I(f (x) > τ)

where:

f (x) is a measure of confidence that x ∈ C1. This could be a
probability, a distance or other function

τ (“tau”) is a threshold parameter that is used to make a decision
on which class to assign to each x .

For different values of τ we will get a different number of xi ∈ C1.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 19 / 34

ECE 750 T17

Performance Measures

9

TP

True Positive
20

FN

False Negative
10

FP

False Positive
45

TN

True Negative
25

P
30

N
70

Retrieved
Relevant

Not Retrieved

Relevant

Retrieved

Not Relevant

Not Retrieved
Not Relevant

Example:

100 Students 30 Tall, 70 Not Tall

Classifier 65 Tall (20 true tall, 45 false positive)
35 not tall (25 true not tall)

From Retrieval Classifier

R NR

Rel

Not
Rel

65
Yes

35
No

P+N=100

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value
1 0

Estimated
1 TP FP
0 FN TN

Sum N+ = TP + FN N− = FP + TN

TP: True Positive

FP: False Positive (False alarm)

TN: True Negative

FN: False Negative (Missed detection

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value
1 0

Estimated
1 TP FP
0 FN TN

Sum N+ = TP + FN N− = FP + TN

N = TP + FP + FN + TN

N̂+ = TP + FP

N̂− = FN + TN

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value
1 0

Estimated
1 TP FP
0 FN TN

Sum N+ = TP + FN N− = FP + TN

Notes:

For more than one class you could build this for each class, whether
the point is in the class or not.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

The False Positive vs False Negative Tradeoff

From this table we can also compute various success and error rates:

True Value
1 0

Estimated
1 TPR= TP/N+ FPR= FP/N−
0 FNR= FN/N+ TNR= TN/N−

TPR: True Positive Rate (Sensitivity, Recall, Hit rate)

FPR: False Positive Rate (False alarm rate, Type I Error)

TNR: True Negative Rate (Miss Rate, Type II Error)

FNR: True Negative Rate (Specificity)

Remember, this depends on τ , so how do we find the best value of τ?

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 21 / 34

ECE 750 T17

Performance Measures

10

rate)fication (misclassi 55%
NP

Error

rate)n recognitio (overall %45
100

2520
Accuracy



FPFN
NP
TNTP

%6.66
30
20

ySensitivit
or Recall

P

TP
FNTP

TP
tprate

How many of the
true
+ve are classified
+ve, completeness

raterate

raterate

rate

tp
P

FN
fn

fp
N

TN
tn

N
FP

fp

valuepredictivepositive
FPTP

TP

1

1%36
70
25

Specifity

rate) alarm (false %64
70
45

%31
65
20

Precision

How many of the
classified +ve are true
positive, exactness

ECE 750 T17

Performance Measures

11

1
RecallPrecision
Recall Precision 2

Recall
1

Precision
1

2
d


Fmeasure

Close to 1 is Better

Harmonic mean of Precision & Recall

ECE 750 T17

Performance Measures

For multiclass, C classes, there are 2 ways to
calculate overall average F-measure:
(a) Micro-averaged F-measure

Calculate the global F-measure over all classes

12

� �

� �

US
SU

U

S

¦

¦

¦

¦

2
averaged_micro_

Recall

Precision

1

1

1

1

F

FNTP

TP

FPTP

TP

C

i
ii

C

i
i

C

i
ii

C

i
i

ECE 750 T17

Performance Measures

b) Macro-averaged F-measure
Compute F-measure locally over each class

and then average over all classes

13

C

F
F

F

FNTP
TP

FPTP
TP

C

i
i

ii

ii
i

ii

i

ii

i
i

¦

1

i

averaged_macro_

2

US
US

US

Number of classes

Training, Validation and Error Estimation ROC and AUC Curves

Receiver Operating Characteristic (ROC) Curve

ROC Originally conceived during WW II to assess the performance of
radar systems

If we apply our decision rule δ(x) for a range of τ then we can draw a
curve of any of the success/error rates.

If we plot TPR vs FPR we get the ROC Curve

Say we have two classifiers or distributions. One for C0 and another
for C1

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 22 / 34

Training, Validation and Error Estimation ROC and AUC Curves

ROC vs PR Curves

Line A is better than line B in both curves.

0 1
0

1

fpr

tp
r

A
B

0 1
0

1

recall

p
re

ci
si

o
n

AB

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 23 / 34

ECE 750 T17

ROC Curve (Receiver Operating Characteristic)
T. Faweett, An Introduction to ROC Analysis, Pattern Recognition
Letters, 27, 2006, 861-874.
ROC Originally conceived during WW II to assess the performance
of radar systems

14

y 1.0 D

0.8 B
A C

tp rate 0.6

0.4
E

0.2

0 0.2 0.4 0.6 0.8 1.0

fp rate x

conservative

Underperforming
Worse than random

Liberal classifiers
(high tp but high fp)

• Discrete classifier (output only a class label) produces a point
• A is more conservative than B
• C is random guessing (tp=fp) the whole line

TP/P

FP/N

ECE 750 T17

ROC Curve (Receiver Operating Characteristic)

15

Can create a curve from a scoring classifier using a
threshold from extreme conservative to liberal

1.0

0.8

0.6

0.4

0.2

0 0.2 0.4 0.6 0.8 1.0

x

70% Accuracy

60% Accuracy

Instance True Class Score
1 p 0.9
2 n 0.7
3 p 0.55
4 p 0.54
5 n 0.52
6 p 0.51
7 p 0.38
8 n 0.35
9 n 0.33
10 n 0.1

Example: 10 Instance and if Classifier Output is np otherwise consider then Wt

5
1

,
5
1

1 , 1 7.0

0,
5
1

0 , 1 9.0

fptpnp

fptpnp

W

W

ECE 750 T17

m

False Acceptance Rate vs False
Rejection Rate

16

Rate (FN) FRR

Rate
(FP)
FAR

High accuracy
might mean
reject more

Training, Validation and Error Estimation ROC and AUC Curves

Another Confusion Matrix

We can also create a confusion matrix using our estimated count of
positives and negatives:

True Value
1 0

Estimated
1 TP/N̂+=PPV=precision FP/N̂+=FDP

1 = FN/N̂− TNR= TN/N̂−

Precision measures what fraction of our detections are actually positive.
Recall measures what fraction of all the positives that we actually
detected.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 24 / 34

Training, Validation and Error Estimation ROC and AUC Curves

Precision-Recall Curve

If we plot precision P vs recall R as we vary τ then we get a
precision-recall curve which can often show us different performance
behaviour from the ROC curve.

The P-R only uses statistics based on TP, so the curve is useful when
there is a very small number of positive cases in your classifier or
when the number of negatives could scale based on some parameter.

Curve bends the other way.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 25 / 34

Training, Validation and Error Estimation ROC and AUC Curves

ROC vs PR Curves

Line A is better than line B in both curves.

0 1
0

1

fpr

tp
r

A
B

0 1
0

1

recall

p
re

ci
si

o
n

AB

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 26 / 34

Training, Validation and Error Estimation Other Performance Measures

Summary

Naive Bayes Classification

Cross Validation

Evaluation Metrics

Confusion Matricies and ROC curves

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 27 / 34

Class Admin
Announcements
Pitch Session

Supervised vs. Unsupervised Learning
Classification
Training, Validation and Error Estimation
Dividing the Data
Confusion Matrix
ROC and AUC Curves
Other Performance Measures

Hypothesis Testing