RMIT Classification: Trusted
Evaluating Hypotheses
COSC 2673-2793 | Semester 1 2021 (computational) Machine Learning
Image: Freepik.com
RMIT Classification: Trusted
Revision: Supervised Learning
In supervised learning, the output is known: 𝑦 = 𝑓(𝐱)
Experience: Examples of input-output pairs
Task: Learns a model that maps input to desired output
Predict the output for new “unseen” inputs.
Performance: Error measure of how closely the hypothesis predicts the target output
Most typical of learning tasks
Two main types of supervised learning: Classification
Regression
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 2
RMIT Classification: Trusted
Revision: Regression
Experience: x(i),y(i) ni=1 x2Rm y2R
A form of
Supervised learning
Hypothesis space:
• Linear: h✓(x)=✓0+✓1×1+✓2×2+···+✓mxm =✓>x
• Polynomial:
h ✓ ( x ) = ✓ 0 + ✓ 1 x 1 + ✓ 2 x 21 + · · · + ✓ i x m + ✓ j x 2m + · · ·
P
Loss function: J(✓) = 1 n h✓ x(i) y(i) 2n i=1
2
Optimization: Gradient decent
}
Repeat { ∂ θj=θj−α∂θjJ(θ) ∀j
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 3
RMIT Classification: Trusted
Revision: Logistic Regression
x(i),y(i) n x 2 Rm y 2 {0,1} Experience: i=1
Hypothesis space:
• Linear:h✓(x)=g(✓0+✓1×1+✓2×2+···+✓mxm)=g ✓>x
g(z) = 1 z 1+e
Another form of
Supervised learning
Loss function:
J(✓)= 1 n y(i)log h✓ X(i) 1 y(i) log 1 h✓ X(i) 2n i=1
P
Optimization: Gradient decent
}
Repeat { ∂ θj=θj−α∂θjJ(θ) ∀j
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 4
RMIT Classification: Trusted
Measuring Performance
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 5
RMIT Classification: Trusted
Model Development
Km M. Year
Fuel C.
Price
f (x)
h? (x)
$9,750
Price
Km
M. Year
Fuel C.
10,000
102,000
2005
7.8
23,500
25,000
2010
5.2
12,250
256,000
2008
9.9
40,100
12,000
2018
11.2
5,000
23,000
2000
12.7
19,200
55,000
2015
12.4
12,500
121,000
2012
21.0
Assumption:
Nature of the relationship (e.g. Linear, polynomial)
h✓ (x) = ✓0 + ✓1×1 + ✓2×2 + ✓3×3
Training:
Find the ”Best hypothesis from the hypothesis space H” e.g. :
h?✓ (x) = 0.1 + 2.3×1 + 1.1×2 + .05×3
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 8
RMIT Classification: Trusted
How good is the hypothesis (model)?
• Didweusesuitableattributes(features)fortheproblem?
• Istheassumptionsmadevalid?
• E.g. The relationship between attributes and target variable linear.
• Willthederivedhypothesis(model)generalizewelltonew data?
• Has it overfitted to training data
• Is our assumption too limiting (Under fit – bias)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 9
RMIT Classification: Trusted
How good is the hypothesis (model)?
We need an evaluation framework to measure performance of a hypothesis or model:
ØA set of data to measure the performance on.
ØA measure of “goodness” of a hypothesis – performance metric.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 10
RMIT Classification: Trusted
Problem of measuring performance
Ø Ideally, the performance should be a measure of the hypothesis’ prediction capability:
• That is, the performance on predicting unseen examples
• That is, the ability to predict the unknown
Ø Of course, we can’t know this
• So we measure performance against known experience
• That is, examples from the collected data set
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 11
RMIT Classification: Trusted
Review: ML Assumption
All algorithms for Machine Learning make a significant assumption:
The experience is a reasonable representation (or reasonable sample) of the true but unknown target function
“Past performance is not a reliable indicator of future performance.” – The use of past performance in promotional material, ASIC.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 12
RMIT Classification: Trusted
Independent Test Data
Ø To measure performance we require independent test data:
• Data that “simulates” unseen data.
• It mimics the process of using a hypothesis “for real”
• Data which has not been used for training (or testing!)
ØWe will explore two mechanism to generate a test set:
• Hold-out validation
• Cross-validation
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 13
RMIT Classification: Trusted
Hold-Out Validation
RMIT Classification: Trusted
Constructing Training/Testing Data
Data set for Training and Testing are constructed by sub-dividing the experience
Typically a 80% – 20% split is used
Training Data
Testing Data
However, this doesn’t entirely help us train and test effectively
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 15
Stage 1: Find the optimal hypothesis from the hypothesis space
Training Data
RMIT Classification: Trusted
ML Process
Machine Learning Algorithm/Program
Tuneable parameters
Loss
Stage 2: Test the optimal hypothesis on “simulated” unseen data.
Test Data
Machine Learning Algorithm/Program
Price
Are we missing something?
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 16
ML Process
RMIT Classification: Trusted
Machine Learning Algorithm/Program
Tuneable parameters
Loss
Stage 1.1: Find the optimal hypothesis from the hypothesis space given a 𝝀 value
Stage 2: Test the optimal hypothesis on “simulated” unseen data.
Training Data
Stage 1.2: Find the best hypothesis amongst the hypotheses for different 𝝀 value
Best Hypothesis
Price
Test Data
What data should we use in Stage 1.2?
What about hyper-parameters? How should we set them?
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 17
RMIT Classification: Trusted
Hyper Parameter Tuning
What data should we use? Training/Testing?
Ø Training
Ø Will end up with a trivial answer: We can always choose a very high capacity model
to fit the data and then set lambda to zero. This will give the best value for 𝐽(𝜃) 1% (“) (“)) +)
𝐽(𝜃)=𝑛∑ h& 𝐱 −𝑦 +𝜆∑𝜃* “#$ *#$
Ø Testing
Ø Will overfit to test data and select the best hypothesis that do well on test data.
Ø Now our test data is not longer independent.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 18
RMIT Classification: Trusted
Independent Test Data
To measure performance we require independent test data: Data that “simulates” unseen data.
It mimics the process of using a hypothesis “for real” Data which has not been used for training (or testing!)
Once data has been used for testing, it is no longer independent
Care must be taken to make limited use of the testing data
Too much evaluation essentially means testing data is being used to “train”
Think of test data as the exam.
You should never look at the exam questions while studying for the exam. If you do, you might end up only learning the thing that are on the exam.
Exam is not to measure how well you can memories the answers, it is about how well you know the concepts or skill.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 19
ML Process
RMIT Classification: Trusted
Machine Learning Algorithm/Program
Tuneable parameters
Loss
Stage 1.1: Find the optimal hypothesis from the hypothesis space given a 𝝀 value
Training Data
Validation Data
Stage 1.2: Find the best
hypothesis amongst the hypotheses for different 𝝀 value
Stage 2: Test the optimal hypothesis on “simulated” unseen data.
Test Data
Best Hypothesis
Price
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 20
RMIT Classification: Trusted
Constructing Training/Testing Data
Typically, a data set is sub-divided into three sets:
Training data – for training a hypothesis
Validation data – for “testing” and tuning parameters of the ML algorithm Testing data – for evaluating and comparing final hypotheses
Typically a 60% – 20% – 20% split
Training Data
Validation Data
Testing Data
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 21
RMIT Classification: Trusted
Important Considerations
Ø No overlap between the partitions (Train, Validation and Test) Ø E.g. Assume that you want to build a model that predicts if a patient has
diabetes using two attributes: BMI and blood glucose level. Ø Given the data in the table can we do random splitting?
Patient ID
BMI
Glucose
Diabetes
P1 Visit 1
.
.
.
P1 Visit 2
.
.
.
P2 Visit 1
.
.
.
P3 Visit 1
.
.
.
…
.
.
.
Random splitting might not always work. Be careful.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 22
RMIT Classification: Trusted
Important Considerations
Ø Reasonable representation (sampling) of unknown function
Ø E.g. Assume that you want to build a model that predicts if a patient has diabetes using two attributes: BMI and blood glucose level.
Ø We are planning to deploy this model to all hospitals in Victoria
Ø Given the data from only 5 hospitals in Victoria, is random splitting the best option?
Hospital
BMI
Glucose
Diabetes
1
.
.
.
1
.
.
.
2
.
.
.
3
.
.
.
…
.
.
.
Random splitting might not always work. Be careful.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 23
Cross Validation
RMIT Classification: Trusted
RMIT Classification: Trusted
Constructing Training/Testing Data
Ø In theory, this is sufficient, if the splits are “reasonably well distributed”
• In practice, however, we know ML algorithms are biased
• So any hypothesis will be biased towards a particular split
• Any evaluation will also be biased on the particular split
Training Data
Validation Data
Trade off between training and test split size.
Ø We need large training and test sets to get a good model and have confidence in the performance.
• We also have limited data – cannot make both training/test sets large.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 25
RMIT Classification: Trusted
K-fold Cross Validation
• Divide data into 𝑘 partitions
• One partition is assigned the
test set, all other 𝑘 − 1 form
training set
• Evaluate then repeat, with
different partitions as test set
• Average results
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 26
RMIT Classification: Trusted
K-fold Cross Validation
Ø Note!
• K-fold cross validation gives a measure of the average expected error of our ML
• BUT, you still need to choose ONE hypothesis (model)
Ø How to choose a model?
• A simple method is to choose a model which gives an
error which is similar to the average error
• However, this model may still be overly biased
• OR, use techniques such as regularisation to “reduce a model complexity” to an error similar to the average
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 27
RMIT Classification: Trusted
Evaluation Measures
RMIT Classification: Trusted
Evaluation Metrics
Ø Why not use the loss function?
Ø Loss function is selected to make optimization process easy. The
value of the loss function may not be very intuitive to us.
J(✓)= 1 Pn y(i)log h✓ X(i) 1 y(i) log 1 h✓ X(i) 2n i=1
Ø Evaluation metric is selected so that they are intuitive.
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 29
RMIT Classification: Trusted
Regression Evaluation Measures
RMIT Classification: Trusted
Regression Evaluation Metrics
Ø Evaluation Measures:
• •
•
Mean Absolute Error (MAE) Mean Squared Error (MSE)
o Root-Mean Squared Error (RMSE)
𝑅% – R squared, coefficient of determination
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 31
RMIT Classification: Trusted
MAE, MSE and RMSE
Mean absolute error:
MAE= 1 n ̄ hx(i)!y(i) n ̄ i=1
Mean squared error:
MSE= 1 n ̄ hx(i)!y(i)2
n ̄ i=1
Root mean squared error:
RMSE = 1 n ̄ h x(i) ! y(i)2
n ̄ i=1
MSE amplifies the effect of outliers (or errors in a single instance) compared to MAE. Conversely, MAE is dilutes the impact of single instances as the data set increases
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 32
RMIT Classification: Trusted
R-Squared, Coefficient of Determination
𝑅) measures amount of variance in data that is explained by (given) linear model
The percentage (fraction) by which the variance (or change in) the output (dependent variable) is predicted by the variance (or change in) the attributes (independent variables) – But don’t confuse this with the variance of a model/algorithm
Vμ = ni=1 μ ! y(i)2 Km
Vh = ni=1 h x(i) ! y(i)2 Km
Vμ !Vh Vμ
R2 =
R2 = 1 !
Vh Vμ
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 33
Price
Price
RMIT Classification: Trusted
R-Squared, Coefficient of Determination
𝑅) measures amount of variance in data that is explained by (given) linear model
The percentage (fraction) by which the variance (or change in) the output (dependent variable) is predicted by the variance (or change in) the attributes (independent variables) – But don’t confuse this with the variance of a model/algorithm
Vμ = ni=1 μ ! y(i)2 Color
Vh = ni=1 h x(i) ! y(i)2 Color
Vμ !Vh Vμ
R2 =
R2 = 1 !
Vh Vμ
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 34
Price
Price
RMIT Classification: Trusted
Classification Evaluation Measures
RMIT Classification: Trusted
Classification Evaluation
Ø Given a classification problem, we want to evaluate how well our classifier performs in comparison to actual classes
Ø With classification, we can discuss types of errors
• Confusion Matrix
• Accuracy, Precision, Recall, F1-Score
• ROC curve
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 36
RMIT Classification: Trusted
Example: Screening COVID-19 Airport
11 3
Quarantine
Screening done at airport:
Ø False positive (Type I):Non-COVID patients detected as COVID by test.
Ø False negative (Type II): True COVID patient not detected by system.
1
No Action
0
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 37
Screening Test
RMIT Classification: Trusted
Example: Screening COVID-19 Airport
11 3
Quarantine Quarantine
1 No Action 1 No Action 00
Model A Model B
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 38
Screening Test
Screening Test
RMIT Classification: Trusted
Example: Screening COVID-19 Airport
False Positive
Quarantine Quarantine
1 No Action 1 No Action 00
Model A Model B
True Positive
False Negative
True Negative
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 39
Screening Test
Screening Test
RMIT Classification: Trusted
Example: Screening COVID-19
Hospital Admission
Quarantine Quarantine
1 No Action 1 No Action 00
Model A Model B
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 40
Screening Test
Screening Test
RMIT Classification: Trusted
Classification Errors (Type 1 vs Type 2)
Consider binary class problems
Assume there are some classification errors
Predict one class, but actually the test data had the other class
Not all classification errors are the same
From the COVID-19 example:
Person “does not have COVID-19” (negative) but the test result is (positive)
• Type1error,FalsePositive
Person “has COVID-19” (positive) but the test is (negative)
• Type2error,FalseNegative
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 41
RMIT Classification: Trusted
Types of Classification Outputs
True Positive (TP): class 1 predicted as class 1
False Positive (FP): class 0 predicted as class 1 (type 1 error) True Negative (TN): class 0 predicted as class 0
False Negative (FN): class 1 predicted as class 0 (type 2 error)
Total number of instances m = TP + FP +
TN + FN
Predicted
T (1)
F (0)
Actual
T (1)
TP
FN
F (0)
FP
TN
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 42
RMIT Classification: Trusted
Confusion Matrix
The confusion matrix summarise the four types of classification error
y
h(X) -A
h(X) -B
1
0
1
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 43
RMIT Classification: Trusted
Common Measures: Accuracy
Accuracy measures the rate of correct classifications 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 44
RMIT Classification: Trusted
Common Measures: Precision
Precision measures:
Of all the positive instances that were identified, what ratio are actually positive?
Is the classifier good at being correct when it identifies a positive instance?
𝑇𝑃 𝑇𝑃 + 𝐹𝑃
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 45
RMIT Classification: Trusted
Common Measures: Recall
Recall measures:
Of all the actual positive instances, how many does the classifier get correct?
Also called, the true positive rate, or sensitivity 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 46
RMIT Classification: Trusted
Less Common Measures: Specificity
Specificity measures the “inverse” of recall
Also known as the true negative rate
Of all the actual negative instances, how many does the classifier get correct?
𝑇𝑁 𝑇𝑁 + 𝐹𝑃
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 47
RMIT Classification: Trusted
Less Common Measures: Fall-out
False positive rate measures the opposite of sensitivity
Also known as the fall-out
𝐹𝑃 𝐹𝑃 + 𝑇𝑁
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 48
RMIT Classification: Trusted
Common Measures: F1 Score
The F1 score, is the harmonic average of precision and recall An F1 score of 1 is perfect precision & recall
Provides a balance between precision & recall
A more reliable measure than accuracy if the confusion matrix is skewed
2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Predicted
T (1)
F (0)
Actual
T (1)
F (0)
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 49
RMIT Classification: Trusted
How about Multi-class?
Essentially build a larger confusion matrix
E.g., if 5 classes, then the confusion matrix is 5 by 5
Compute TP, etc based on 1 vs all ideas
So we have class dependent precision and recall (and F1-score)
But can take an average to output an average precision, recall and F1-score across the classes (micro and macro averages)
For more details about computing the confusion matrix for multi-class classification, see this nice video: https://www.youtube.com/watch?v=FAr2GmWNbT0
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 50
RMIT Classification: Trusted
Class Balance
Assume that you have a binary classification problem (cancer vs healthy) where the data-set has the following distribution:
Ø Class 0 (Healthy) : 1000 examples Ø Class 1 (Cancer) : 25 examples
Ø What will be the accuracy if you have a trivial model that always says healthy?
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 51
RMIT Classification: Trusted
Numeric Prediction as Classification
RMIT Classification: Trusted
ROC Curve
Plots the recall (true positive rate) against the false positive rate A useful tool for visualising the performance of a classifier
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 53
RMIT Classification: Trusted
Hyper Parameter Tunning
Regularization parameter
RMIT Classification: Trusted
Problem of Overfitting and Underfitting
Temperature Temperature Temperature
𝜃! +𝜃”𝑥” 𝜃! +𝜃”𝑥” +𝜃#𝑥”# 𝜃! +𝜃”𝑥” +𝜃#𝑥”# +𝜃$𝑥”$ +𝜃%𝑥”% …
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 55
Power consumption
Power consumption
Power consumption
RMIT Classification: Trusted
Overfitting & Underfitting
Overfitting can occur if we have too many features/highly complex model The learned hypothesis/model may fit the training data too well
But may not generalise to new examples (predict energy usage on new examples)
Underfitting can occur if we have too few features/too simplistic model The learned hypothesis is not able to fit the data well and is
Unable to make accurate predictions
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 56
RMIT Classification: Trusted
How to Identify over/under fitting
Assume data split into train and validation sets
𝑃,-.”% 𝑋 = B 𝐿(h 𝑋 ,𝑌) /∈ ,-.”%”%1
𝑃*.2″3 𝑋 = B
/∈ *.2″3.,”4%
𝐿(h 𝑋 ,𝑌)
Model complexity Model complexity
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 57
Performance
Performance
RMIT Classification: Trusted
How to Identify over/under fitting
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 58
RMIT Classification: Trusted
Tunning regularization parameter
• •
•
Change 𝜆 and observe the training and validation errors.
Pick the 𝜆∗ value that has the smallest gap and best validation performance.
The best hypothesis will be the the optimal hypothesis at 𝜆∗
Never use test data to do this. Always use validation data or cross validation.
𝜆
COSC2673 | COSC2793 Week 4: Evaluating Hypotheses 59