CS代考 APS1070

APS1070
Foundations of Data Analytics and Machine Learning
Fall 2021
Week 4:
• Clustering
• Probability Theory
• SummaryStatistics
• Gaussian Distribution
• Performance Metrics
Prof.

Slide Attribution
These slides contain materials from various sources. Special thanks to the following authors:
• Sinisa Colic


• Zadeh

Quick review!

Algorithm
➢For a single nearest neighbour
+
4

How do we measure distance?
Dimension of x or y
5

Decision Boundary
➢Can generate arbitrary test points on the plane and apply kNN
➢The boundary between regions of input space assigned to different categories.
6

Tradeoffs in choosing k?
➢Small k
➢Good at capturing fine-grained patterns ➢May overfit, i.e. be sensitive to random noise
Excellent for training data, not that good for new data, too complex
➢Large k
➢Makes stable predictions by averaging over lots of
examples
➢May underfit, i.e. fail to capture important regularities
Not that good for training data, not good for new data, too simple
7

Constructing a Decision Tree
➢Decision trees make predictions by recursively splitting on different attributes according to a tree structure
width (cm)
8

Part 1 K-Means Clustering (Unsupervised Learning)

Clustering
➢Clustering algorithms group samples/instances based on similarities in features
➢Input: set of samples/instances described by features
➢Output: assigned cluster (group) for each sample/instance
➢Clustering are unsupervised techniques and do not have access to the sample/instance labels
10

Clustering Strategies
➢k-Means Clustering
➢Assigns each point to the nearest cluster center
➢Agglomerative clustering
➢Assumes each point is a cluster and iteratively merges the closest clusters

K-Means
➢Most well-known clustering method ➢Distance-based unsupervised learning algorithm,
NOT to be confused with k-NN.
➢ Algorithm:
1. Assign each sample/instance to its closest mean
2. Update the means based on the assignment
3. Repeat until convergence
➢ Requires:
—Selection of the number of clusters ‘k’ (hyperparameter) —Centre of each cluster is randomly initialized at the start
12

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
13

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
14

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
15

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
16

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
17

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
18

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
19

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
20

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
21

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
22

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
23

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
24

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
25

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
26

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
27

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
28

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
29

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
30

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
31

K-means
Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• REPEAT!
32

Doesn’t always work… can get stuck!
33

What are we trying to do?
Note: N points, K clusters Source: Ethan Fetaya, , Emad Andrews
34

Slide credit: Ethan Fetaya, , Emad Andrews
35

k-Means Convergence
➢Whenever an assignment is changed, the sum squared distances, J, of data points from their assigned cluster centers is reduced
➢Whenever a cluster is moved, J is reduced
➢Test for convergence: if the assignments do not change in the assignment step, we have converged (to at least a local minimum).
credit: Ethan Fetaya, , Emad Andrews
36

Local Minima
➢There is nothing to prevent k-means from getting stuck at a local minimum
Options for avoiding local minimum:
➢we could try many random starting points ➢split a big cluster into two
➢merge to nearby clusters
credit: Ethan Fetaya, , Emad Andrews
37

How to choose number of clusters (k)?
➢As k increase the sum of squared distance goes towards zero
➢If the plot looks like an arm, then the elbow of the arm is the optimal k
➢e.g., the elbow is at k=3 indicating the optimal number of clusters is 3
sum of squared distance
“k” number of clusters
38

Shape of K-Means Clusters
➢K-means split the space according on the closest mean:
Source: scikit-learn
39
➢Note that the clusters form convex regions.

Convex Region
➢A region is convex if any line between two points in the region remains in the region.
convex
nonconvex
convex
40

K-Means with Non-Convex Clusters
K-Means (K = 2)
Source:
K-means is unable to separate non-convex clusters
41

Agglomerative Clustering
➢A type of Hierarchical Clustering
➢ Algorithm:
1. Starts with each point in its
own cluster
2. Each step merges the two “closest” clusters
Source: MachineLearningStories
42

Example: Agglomerative Clustering
Dendrogram
bd ae
Distance Threshold
# of Clusters
abcdef
g
g
f
c

Comparison of Clustering Methods
➢ There are many clustering
algorithms to chose from
➢ Performance will depend on your data
Source: scikit-learn
44

Applications in Computer Vision
➢Replace samples with the mean of their cluster (vector quantization)
➢Visual example:
➢Inputs: color pixels represented as RGB values
➢Outputs: cluster (average RGB) value obtained using k-means
➢Q: How can this be applied for compression?
➢Q: How can this be applied for image segmentation?
Source: PRML Bishop 2006 45

K-Means Code Example (Google Colab)
46

Part 2 Probability Theory
Readings:
• Chapter 6.1-5 MML Textbook

Overview
➢Compared several machine learning algorithms
➢Supervised Learning ➢k-Nearest Neighbours ➢Decision Trees
➢Unsupervised Learning ➢k-Means Clustering
➢Q: How did we measure performance?
48

Measuring Uncertainty
➢In machine learning it is important to understand how confident we can be about the decisions (or predictions) being made by our models/algorithms.
➢Q: A model/algorithm is tested on two new samples and classifies both correctly. What is the accuracy and how confident are you with that result?
49

Agenda for parts 2 and 3 of Lecture 4
➢Probability Theory ➢Examples
➢Summary Statistics
➢Gaussians ➢Mixture of Gaussians
➢Anomaly Detection ➢Performance Metrics
➢Precision and Recall ➢Confusion Matrix ➢ROC and AUC
Theme:
Measuring Uncertainty
50

Example: Random Variables (discrete)
P(X=3|Y=2) =5/20 Conditional Distribution
2
Y
1
P(X=5) Marginal Distribution
=(4+5)/35
P(X=2,Y=1) Joint Distribution
=3/35
51
12
34567
X

Example: Marginal Distribution of X
2
Y
1
P(X,Y) – Joint Distribution P(X) – Marginal Distribution
1234567 1234567
XX 8/35
8/35
1/35
1/35
6/35
6/35
3/35
X
52

Example: Marginal Distribution of Y
P(X,Y) – Joint Distribution 22
YY 11
P(Y) – Marginal Distribution
12
34567
X
Y
20/35 15/35
53

Example: Conditional Distribution X|Y
P(X,Y) – Joint Distribution
P(X|Y=1) – Conditional Distribution
2
Y
1
2
Y
1
34567 12
12
34567
X
X
2/15 2/15
3/15 0/15
4/15
X
54
2/15
1/15

Example: From 2 to 3 Random Variables
P(X,Y,Z)
Z=2 Z=1
P(X,Y)
2
Y
1
2
Y
1
2
1234567 Z11234567
XX
P(X=3,Y=2)
=4/35
P(X=3,Y=2) Marginal Distribution
=4/35
P(X=5,Y=2,Z=1) Joint Distribution
=3/35
55

Example: Continuous Distributions
p(x=2.75|y=2.20) Conditional Distribution
y
2
1
p(y=2.20)
Marginal Distribution
➢Probability at a point is
meaningless ➢Needs to add up
to 1 ➢Consider areas
12
34567
p(x=3.75,y=1.40) Joint Distribution
p(x=5.2)
Marginal Distribution
x
56

Probability with Real Data
pd.crosstab(df[‘Pclass’],df[‘Sex’])
Counts
Sex
0
1
All
1
80
136
216
2
97
87
184
3
372
119
491
All
549
342
891
Probabilities
Sex
0
1
All
1
0.090
0.153
0.242
2
0.109
0.098
0.207
3
0.418
0.134
0.551
All
0.616
0.384
1.000
pd.crosstab(df[‘Pclass’],df[‘Sex’])/891
57
Class
Class

Probability with Real Data
Counts
Sex
Probabilities
1 2 3 All
0.090 0.153 0.242 0.109 0.098 0.207 0.418 0.134 0.551 0.616 0.384 1.000
0
1 All
Q: What is the probability of a random passenger from the dataset belonging to the 3rd class?
Given that there were 70 first-class male passengers who survived and 10 first-class male passengers who did not, what is the probability of survival for first- class male passengers?
Q: What is the probability of there being a first-class male passenger that survived?
58
1 80 136 216 10 dead
70 survived
2 97 87 184
3 372 119 491 All 549 342 891
Sex
0 1 All
Class Class

Example: Permutations and Combinations
Permutation ➢arrangement of items in
which order matters Combination
➢selection of items in which order does not matter
n – number of items in a set
r – number of items selected from the set

Example: Permutations (n=5, r=3)
Q: How many ways can we fill 3 positions in a company using a pool of 5 applicants?
Q: How many ways can we fill 3 positions in a start-up using a pool of 5 applicants if each person can potentially be given more than one position?
60

Example: Combinations (n=5, r=3)
Q: How many ways can we select 3 people from a pool of 5 applicants to give them a tour of the company?
Q: How many ways can we distribute three identical coins among 5 individuals?
61

Let us Summarize…
62

Why Probability?
➢Probability theory is a mathematical framework for quantifying our uncertainty about the world.
➢It allows us (and our software) to reason effectively in situations where being certain is impossible.
➢Probability theory is at the foundation of many machine learning algorithms.
63

Perspectives on Probability
➢Objectivist perspective: randomness is fundamental to the universe.
➢They would say that the probability of a fair coin coming up heads is 0.5, because that’s the nature of fair coins.
➢Subjectivist perspective: probabilities represent our degree of belief that an event will occur.
➢If we knew the initial position of the coin and how the force was applied, then we could determine with certainty if it would come up heads or tails.
➢Under this perspective, probability is a measure of our ignorance (like not knowing how the force is applied to the coin).
64

Frequentists
➢Frequentist’s position: estimations come from experiments and experiments only.
➢e.g., if we want to estimate how likely a six-sided die is to roll a 4, we should roll the die many times and observe how frequently 4 appears.
➢This method works well when we have a large amount of data, but with fewer examples we can’t be confident in our estimates.
➢If we haven’t seen a 4 after five rolls, does that mean a 4 is impossible?
➢The other issue is that we can’t inject any of our prior knowledge about dice into our estimates. If we knew the die was fair, not seeing a 4 in the first five rolls is completely understandable.
➢Bayesian perspective allows us to combine our prior beliefs with our observations
65

Bayesian vs Frequentist
➢For example: Imagine that a coin we believe to be fair is flipped three times and results in three heads.
➢Frequentist calculation would suggest the coin is loaded (although with low confidence)
➢Bayesian our prior knowledge that the coin is fair allows us to maintain some degree of belief that a tails is still possible.
➢The actual mechanics of how we combine our prior belief relies on something called Bayes’ rule, which will be covered later.
66

Mathematical Framework
➢Probability theory is a mathematical framework.
➢As with any mathematical framework there is some vocabulary and important rules needed to fully leverage the theory as a tool for machine learning.
67

Probability Spaces
➢Probability is all about the possibility of various outcomes. The set of all possible outcomes is called the sample space.
➢e.g., sample space for coin flip is {heads, tails}.
➢e.g., the sample space for the temperature of water is all
values between the freezing and boiling point.
➢Only one outcome in the sample space is possible at a time, and the sample space must contain all possible values.
68

Random Variables
➢Are variables which randomly takes on values (discrete or continuous) from a sample space.
➢Probability of any event has to be between 0 (impossible) and 1 (certain), and the sum of the probabilities of all events should be 1.
0≤𝑝𝑥≤1
෍𝑝𝑥=1 𝑥
69

Discrete Probabilities
➢Discrete random variables are described with a probability mass function (PMF).
➢PMF maps each value in the variable’s sample space to a probability.
➢e.g., PMF for a loaded die and how does it compare with a normal die
PMF Loaded Die
X
PMF Regular Die
X
70
P(X) P(X)

Probability Distributions
➢Common discrete distribution is the Bernoulli.
—A Bernoulli distribution specifies the probability for a random variable which can take on one of two values.
—e.g., heads or tails
—We can specify the entire distribution with a single parameter p, the probability of the positive outcome.
—e.g., for a fair coin we have p = 0.5,
—e.g., given the probability of rain is p = 0.2, we can infer the probability of no rain is 0.8.
➢Other common discrete distributions are the binomial (e.g., handles multiple tosses of a coin) and multinomial distributions (e.g., rolling a die), and Poisson (events occurring in fixed interval of time).
71

Types of Probabilities
➢Joint Probability
➢a joint distribution over two random variables x, y specifies the probability of any setting of the random variables.
➢Marginal Probability
➢called the marginal probability distribution, since we’ve “marginalized” away the random variable y (uses the sum rule).
➢Conditional Probability
➢the probability of an event given that another event has already been observed.
72

Bayes’ Theorem
➢Product Rule:
P(x, y) = P(x|y) ⋅ P(y).
➢We can write the product rule for two variables in two equivalent ways:
P(x, y) = P(y|x) ⋅ P(x)
➢By setting both equations equal and divide by P(y), we get Bayes’ rule:
Note Bayes’s rule is crucially
important to much of statistics and machine learning. Driving force behind Bayesian statistics (Bayesian perspective).
This simple rule allows us to update our beliefs about quantities as we gather more observations from data.
73

Bayes’ Theorem Example (made-up numbers)
If a random person has a fever, what is the chance that it is COVID?
We know:
1. P(fever | COVID) = 25%: 25% of infected people have fever
2. P (COVID) = 2% : Fraction of world population having COVID
3. P (fever) = 1% : 1 in every 100 persons has fever
74

Bayes’ Theorem Example (made-up numbers)
If a random person feel tired, what is the chance that it is COVID?
We know:
1. P(tired | COVID) = 25%: 25% of infected people feel tired
2. P (COVID) = 2% : Fraction of world population having COVID
3. P (tired) = 25% : 1 in every 4 persons feels tired
Find the answer and discuss over Piazza, if you notice anything interesting…
75

Independence
➢Two variables x and y are said to be independent if P(x, y) = P(x) ⋅ P(y)
Q: Can you think of an example where this would happen?
76

Conditional Independence
➢Two variables x and y are called conditionally independent given another variable z if
P(x, y|z) = P(x|z) ⋅ P(y|z)
➢Q: Can you think of an example where this would happen?
77

Continuous Probabilities
➢Continuous random variables are described by probability density functions (PDF) which can be a bit more difficult to understand.
➢PDFs map an infinite sample space to relative likelihood values.
➢To understand this, let’s look at an example with one of the most famous continuous distributions, the Gaussian (aka Normal) distribution.
y
x
78

Gaussian Distribution
➢The Gaussian distribution is parameterized by two values: the mean μ (mu) and variance σ2 (sigma squared).
➢The mean specifies the center of the distribution, and the variance specifies the width of the distribution.
➢You may have also heard about the standard deviation σ, which is just the square root of the variance.
PDF of Gaussian
variance
mean
x
79
density

Relative Likelihood
➢The value of the PDF is not the actual probability of x.
➢Remember, the total probability for every possible value needs to sum to 1.
➢Q: How can we sum over infinite number of values?
➢A: Need to calculate the area under the PDF to obtain the probability
Since we are interested in the area, it is often more useful to work with a continuous random variable’s cumulative density function (CDF).
PDF of Gaussian
Area
1
p(0x 1)=f (x)dx
0
80

Relative Likelihood Con’t
➢We just determined that the area corresponds to the probability, so F(x) gives us P(X≤x).
➢Use the CDF to determine the probability of any given range [a,b] using P(a≤X≤b) = F(b)-F(a).
CDF
Cumulative distribution function
PDF
Probability density function
➢Note that asking for P(X=x) is equivalent to asking P(x≤X≤x) = F(x)-F(x) = 0
81

Functions of Random Variables
➢Often useful to create functions which take random variables as input.
➢e.g., it costs $2 to play the game, “guess a number between 1 and 10”. Correct guess = $10, Incorrect guess = $0, but it costs $2 to play. Let x be a random variable indicating whether you guessed correctly. We can write a function:
h(x) = {$8 if x = 1, and -$2 if x = 0}
➢You may be interested in knowing in advance what the expected outcome will be.
82

Expectation
➢The expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is the average value of h(x) weighted by P(x). For a discrete x, we write this as:
➢The expectation acts as a weighted average over h(x), where the weights are the probabilities of each x.
e.g., 𝔼[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing) =(1/10) ⋅ $8 + (9/10) ⋅ (-$2) = $0.80 + (-$1.80)
= -$1
If x had been continuous, we would replace the summation with an integral
On average, we’ll lose $1 every time we play!
83

Expectation
➢A nice property of expectations is that they’re linear.
➢Let’s assume h and g are functions of x, and α and β are constants. Then we have:
84

Variance
➢We saw variance with respect to a Gaussian distribution when we were talking about continuous random variables. In general, variance is a measure of how much random values vary from their mean.
➢Similarly, for functions of random variables, the variance is a measure of the variability of the function’s output from its expected value.
85

Describing a Gaussian Distribution
86

Multivariate Data
➢Multiple measurements (e.g., different sensors) ➢d features/attributes (e.g., number of sensors) ➢N instances/observations/examples
columns => features for a single instance
rows => instances
87

Describing Multivariate Datasets
➢What do all these graphs have in common?

Covariance
➢Variances along each axis remain constant, but properties of the dataset change
➢Variances insufficient to characterize the relationship/correlation of two random variables -> we need cross-variance!
89

Covariance Matrix
➢Expectation (mean): centre of the dataset
➢Covariance: “variance” of a d-dimensional random variable is given by a covariance matrix.

Multivariate Gaussian Distribution
f
Determinant of Multivariate Covariance Matrix Multivariate Covariance Matrix Sample Mean

Covariance
➢When the absolute value of covariance is high, the two variables tend to vary far from their means at the same time.
➢When the sign of the covariance is positive, the two functions map to higher values together.
➢When the sign of the covariance is negative, the one function maps to higher values, the other maps to lower values (or vice versa)
Positive Covariance
Negative Covariance
92

Bivariate Normal
93

Bivariate Normal
94

Bivariate Normal
95

Bivariate Normal
96

97

Confidence
f
d = 1 (univariate)
d = 2 (bivariate)
98

Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
X2
P(X1)
X1
P(X2)
01
-1 0.24 0.06 0.3
0 0.16 0.14 0.3 1 0.40 0 0.4
compute the mean
0.8 0.2 1
(-1)(0.3) + (0)(0.3) + (1)(0.4) = 0.1
(0)(0.8) + (1)(0.2) = 0.2

Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
X2
P(X1)
X1
01
-1 0.24 0.06 0.3
0 0.16 0.14 0.3 1 0.40 0 0.4
P(X2) 0.8 compute the covariance
0.2 1
(-1-0.1)2(0.3) + (0-0.1)2(0.3) + (1-0.1)2(0.4) = 0.69
(0-0.2)2(0.8) + (1-0.2)2(0.2) = 0.16
(-1-0.1)(0-0.2)(0.24) + (-1-0.1)(1-0.2)(0.06) + (0-0.1)(0-0.2)(0.16)
+ (0-0.1)(1-0.2)(0.14) + (1-0.1)(0-0.2)(0.40) + (1-0.1)(1-0.2)(0) = -0.08

Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
X2
P(X1)
X1
01
-1 0.24 0.06 0.3
0 0.16 0.14 0.3
1 0.40 0 0.4 P(X2) 0.8 0.2 1
putting it all together

Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
X2
P(X1)
0
1
X1
-1
0.24
0.06
0.3
0
0.16
0.14
0.3
1
0.40
0
0.4
P(X2)
0.8
0.2
1
➢Normalizing the covariance matrix will give you the population correlation coefficient, a measure of how correlated two variables are.

Example: Visualization
f
Multivariate Mean
Bivariate Gaussian Example (See Sample Code)
X2
Covariance Matrix
Determinant of Covariance Matrix
0.104
Determinant Calculation (bivariate example)
(0.69 * 0.16) – (-0.08*-0.08)
Multivariate Sample
Determinant will be covered in week 7
X1

Gaussian Mixture Models (GMM)
104

Gaussian Mixture Models (GMM)
Predicted Loss (Error)
Source: Scikit-learn
➢Can be extended to multivariate data (e.g., bivariate GMM)
105

Anomaly Detection (Semi-Supervised)

What is the goal?
Time of day
4 AM 9 PM
9 AM
1$ 1M$
GMM
Transaction $
107

Outlier
108

Google Colab (Code Example)

Part 3 Performance Metrics

Why Performance?
➢Q: Why do we care about performance?
➢Identifying how well our models are doing is not trivial. Easy to fall into the trap of believing (or wanting to believe) our models are working based on weak assessment.
111

How to measure the performance of a model?
➢Assume a case where we want to detect outliers and we know: ➢Dataset has 100 points
➢98 are non-outlier ➢2 are outliers
➢If we detect all the points as non-outliers:
➢98 True predictions/100 = 98% accuracy for a model that is not working
➢Q: How can we improve our performance measurements?
112

Fraud Detection System
Model
Transaction Features
Positive: Fraud
Negative: Valid
113

Fraud Detection System
(Positive = Fraud)
➢If transaction is Valid:
➢Prediction : Valid (True Negative) ➢Prediction : Fraud (False Positive)
➢If transaction is Fraud:
➢Prediction : Fraud (True Positive) ➢Prediction : Valid (False Negative)
114

Precision and Recall
➢If transaction is Valid:
➢Prediction : Valid (True Negative) OK! ➢Prediction : Fraud (False Positive) Not that bad!
➢If transaction is Fraud:
➢Prediction : Fraud (True Positive) GOOD!
➢Prediction : Valid (False Negative) Super BAD!
P:Valid
P:Fraud
TP TP
(MISS)
(MISTAKE)
Fraud
TP+FP
TP+FN
Valid
115

Confusion Matrix
Actual Value
(as confirmed by experiments)
➢Table used to describe prediction performance on a set of test data
positives
TP
True Positive
FN
False Negative
negatives
FP
False Positive
TN
True Negative
Predicted Value
(predicted by the test)
negatives positives
116

F1 score
𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
➢A balanced measure of accuracy giving equal importance to recall and precision
➢The highest possible value of F1 is 1, and lowest possible value is 0
117

Confusion Matrix (by heart)
Recall
False positive rate
Precision
https://en.wikipedia.org/wiki/Confusion_matrix
118

Confusion Matrix
https://en.wikipedia.org/wiki/Confusion_matrix
119

Precision and Recall – tug of war
➢ Application – spam detection
➢ Improving precision usually reduces recall, vice-versa
You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P)
Model output
What is the precision (emails flagged as spam)? tp/(tp+fp) = 8/(8+2) = 0.8
What is the recall (actual spam correctly classified)? tp/(tp+fn) = 8/(8+3) = 0.73 Source: ML Crash Course
120

Precision and Recall – tug of war
➢ Application – spam detection
➢ Improving precision usually reduces recall, vice-versa
You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P)
Model output
What is the precision (emails flagged as spam)? tp/(tp+fp) = 7/(7+1) = 0.88 What is the recall (actual spam correctly classified)? tp/(tp+fn) = 7/(7+4) = 0.64
Source: ML Crash Course
121
how do we choose a threshold?

ROC (Receiver Operating Characteristic) Curve
You want this to be high
You want this to be low
Source: Wikipedia 122

ROC Curve
Closest point to the top left = best accuracy
You want this to be high
ROC:
Receiver Operating Characteristic (plot for binary classifiers)
Q: What would be a perfect classifier?
Random classifier
You want this to be low 123

ROC Curve
Negative Positive
TPR=1 FPR=0.95
FP TN
Negative
Positive
124

ROC Curve
TPR=1 FPR=0.8
Negative Positive
TN
FP
125

ROC Curve
Negative Positive
TPR=1 FPR=0.05
TN
FP
126

ROC Curve
Negative
Positive
TPR=1 FPR=0
127

ROC Curve
Negative
Positive
TPR=0.7 FPR=0
TN
FN
128

AUC (Area Under the Curve)
Source:
129

Data Selection
➢How we select our data is another aspect that is often overlooked.
➢Can have serious effects on the prediction’s performance.
➢Q: What are some issues with arbitrary data selection?
130

Splitting the Dataset
➢Training, Validation and Testing
60-80% 10-20% 10-20%
➢More training data => better model
➢More testing data => more confidence in assessment ➢Cross-Validation is an attempt to have both…
131

Independent and Identically Distributed
➢It is important that the generative process of data is the same for all data and the process has no memory of past generated samples.
➢Sampling of data needs to be independent
132

Limited Data
➢There are often situations where it is difficult to obtain sufficient data to train a machine learning algorithm.
➢For example: A common practice with medical data is to increase the sample size by obtaining multiple samples from the same patient.
➢Q: What are some concerns resulting from this?
133

Generalization
➢We have seen previously that correctly predicting on new data is what we’re interested in.
➢Golden Rule: No model selection decisions should come from the test data!
134

Next Time
➢Week 4 Q/A Support Session on 30 Sep ➢Project Questions
➢Project 1 is due on 1 Oct. at 21:00 ➢Reading assignment 4 (due 4 Oct at 21:00):
➢ -Read pages 33-41 (page numbers from the pdf file) Section 2.3 in Chapter 2 of “Mathematics for Machine Learning” by . Deisenroth et al., 2020 link
➢Week 5 Lecture – Data Processing ➢Linear Algebra
➢Analytical Geometry ➢Data Augmentation
135