Foundations of Data Analytics and Machine Learning
Summer 2022
• ProbabilityTheory
• SummaryStatistics
Copyright By PowCoder代写 加微信 powcoder
• GaussianDistribution • PerformanceMetrics
10 min quick review!
ØFor a single nearest neighbour
How do we measure distance?
Dimension of x or y
Decision Boundary
ØCan generate arbitrary test points on the plane and apply kNN
ØThe boundary between regions of input space assigned to different categories.
Tradeoffs in choosing k?
ØGood at capturing fine-grained patterns
ØMay overfit, i.e. be sensitive to random noise
Excellent for training data, not that good for new data, too complex
ØMakes stable predictions by averaging over lots of
ØMay underfit, i.e. fail to capture important regularities
Not that good for training data, not good for new data, too simple
Constructing a Decision Tree
ØDecision trees make predictions by recursively splitting on different attributes according to a tree structure
width (cm)
Slide credit: Ethan Fetaya, , Emad Andrews
Agglomerative Clustering
ØA type of Hierarchical Clustering
Ø Algorithm:
1. Starts with each point in its
own cluster
2. Each step merges the two “closest” clusters
Source: MachineLearningStories 9
Comparison of Clustering Methods
Ø There are many clustering
algorithms to chose from
Ø Performance will depend on your data
Source: scikit-learn
Project 1 ….
ØDeadline: Sat June 4 @ 11 PM
ØLast Support Session for Project 1 ØThursday at noon
Probability Theory
• Chapter 6.1-5 MML Textbook
Example: Random Variables
P(X=3|Y=2) =5/20 Conditional Distribution
P(X=5) Marginal Distribution
P(X=2,Y=1) Joint Distribution
Example: Marginal Distribution of X
P(X,Y) – Joint Distribution P(X) – Marginal Distribution
1234567 1234567
Example: Marginal Distribution of Y
P(X,Y) – Joint Distribution 22
P(Y) – Marginal Distribution
20/35 15/35
Example: Conditional Distribution X|Y
P(X,Y) – Joint Distribution
P(X|Y=1) – Conditional Distribution
Example: From 2 to 3 Random Variables
1234567 Z11234567
P(X=3,Y=2)
P(X=3,Y=2) Marginal Distribution
P(X=5,Y=2,Z=1) Joint Distribution
Example: Continuous Distributions
p(x=2.75|y=2.20) Conditional Distribution
Marginal Distribution
ØProbability at a point is meaningless
ØConsider areas
p(x=3.75,y=1.40) Joint Distribution
Marginal Distribution
Probability with Real Data
pd.crosstab(df[‘Pclass’],df[‘Sex’])
70 Survived
Probabilities
pd.crosstab(df[‘Pclass’],df[‘Sex’])/891
Probability with Real Data
Probabilities
Q: What is the probability of a first-class male Sex passenger surviving?
70 Survived
Q: What is the probability of there being a first-class male passenger that survived?
Example: Permutations and Combinations
Permutation Øarrangement of items in
which order matters Combination
Øselection of items in which order does not matter
n – number of items in a set
r – number of items selected from the set
Example: Permutations and Combinations
Example: Permutations and Combinations
Permutation Øarrangement of items in
which order matters Combination
Øselection of items in which order does not matter
Can you show this with an example?
Example: Permutations
Q: (Lottery Game) A player chooses 6 numbers 1 to 55. If all numbers match the 6 winning numbers and in the correct order, the player wins. What is the probability that the winning numbers are 4, 15, 30, 55, 10, 1
Q: Same problem as above, but now each number can be chosen only once. How does the probability change?
Example: Combinations
Q: (Lottery Game) A player chooses 6 numbers 1 to 55. If all numbers match the 6 winning numbers, regardless of order, the player wins. What is the probability that the winning numbers are 4, 15, 30, 55, 10, 1
Q: Same problem as above, but now each number can be chosen only once. How does the probability change?
Let us Summarize…
Probability Spaces
ØProbability is all about the possibility of various outcomes. The set of all possible outcomes is called the sample space.
Øe.g., sample space for coin flip is {heads, tails}.
Øe.g., the sample space for the temperature of water is all
values between the freezing and boiling point.
ØOnly one outcome in the sample space is possible at a time, and the sample space must contain all possible values.
Random Variables
ØVariables which randomly takes on values (discrete or continuous) from a sample space.
0≤𝑝𝑥≤1 !𝑝𝑥=1
ØProbability of any event must be between 0
(impossible) and 1 (certain), and the sum of the ! probabilities of all events should be 1.
Discrete Probabilities
ØDiscrete random variables are described with a probability mass function (PMF).
ØPMF maps each value in the variable’s sample space to a probability.
Øe.g., PMF for a loaded die and how does it compare with a normal die
PMF Loaded Die
PMF Regular Die
Types of Probabilities
ØJoint Probability
Øa joint distribution over two random variables x, y specifies the probability of any setting of the random variables.
ØMarginal Probability
Øcalled the marginal probability distribution, since we’ve “marginalized” away the random variable y (uses the sum rule).
ØConditional Probability
Øthe probability of an event given that another event has already been observed.
Bayes’ Theorem
ØProduct Rule:
P(x, y) = P(x|y) ⋅ P(y).
ØWe can write the product rule for two variables in two equivalent ways:
P(x, y) = P(y|x) ⋅ P(x)
ØBy setting both equations equal and divide by P(y), we get Bayes’ rule:
Note Bayes’s rule is crucially
important to much of statistics and machine learning. Driving force behind Bayesian statistics (Bayesian perspective).
This simple rule allows us to update our beliefs about quantities as we gather more observations from data.
Bayes’ Theorem
If you feel sick, what is the chance that it is COVID?
1. P(Sick | COVID) = 30%: if you have COVID, 30% you will feel sick
2. P (COVID) = 2% : World population has COVID
3. P (Sick) = 25% : 1 in every 4 person feels some kind of sickness
Independence
ØTwo variables x and y are said to be independent if P(x, y) = P(x) ⋅ P(y)
Q: Can you think of an example where this would happen?
Functions of Random Variables
ØOften useful to create functions which take random variables as input.
Øe.g., it costs $2 to play the game, “guess a number between 1 and 10”. Correct guess = $10, Incorrect guess = $0, but it costs $2 to play. Let x be a random variable indicating whether you guessed correctly. We can write a function:
h(x) = {$8 if x = 1, and -$2 if x = 0}
ØYou may be interested in knowing in advance what the expected outcome will be.
Expectation
ØThe expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is the average value of h(x) weighted by P(x). For a discrete x, we write this as:
ØThe expectation acts as a weighted average over h(x), where the weights are the probabilities of each x.
e.g., 𝔼[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing) =(1/10) ⋅ $8 + (9/10) ⋅ (-$2) = $0.80 + (-$1.80)
If x had been continuous, we would replace the summation with an integral
On average, we’ll lose $1 every time we play!
Expectation
ØA nice property of expectations is that they’re linear.
ØLet’s assume h and g are functions of x, and α and β are constants. Then we have:
ØWe saw variance with respect to a Gaussian distribution when we were talking about continuous random variables. In general, variance is a measure of how much random values vary from their mean.
Variance of Discrete random variable X:
ØSimilarly, for functions of random variables, the variance is a measure of the variability of the function’s output from its expected value.
Example: Calculate the Variance
X2 P(X1) 0
(-1)(0.3) + (0)(0.3) + (1)(0.4) = 0.1 Var (X1) = (-1-0.1)2(0.3) + (0-0.1)2(0.3) + (1-0.1)2(0.4) = 0.69
Continuous Probabilities
ØContinuous random variables are described by probability density functions (PDF). Think of it as a histogram for continues data.
ØPDFs map an infinite sample space to relative likelihood values.
ØTo understand this, let’s look at an example with one of the most famous continuous distributions, the Gaussian (aka Normal) distribution.
PDF of Gaussian
ØThe value of the PDF is not the actual probability of x.
ØRemember, the total probability for every possible value needs to sum to 1.
ØQ: How can we sum over infinite number of values?
ØA: Need to calculate the area under the PDF to obtain the probability
Since we are interested in the area, it is often more useful to work with a continuous random variable’s cumulative density function (CDF).
PDF of Gaussian
p(0
rows => instances
Describing Multivariate Datasets
ØWhat do all these graphs have in common?
Covariance
ØVariances along each axis remain constant, but properties of the dataset change
ØVariances insufficient to characterize the relationship/correlation of two random variables -> we need cross-variance!
Covariance Matrix
ØExpectation (mean): center of the dataset
ØCovariance: “variance” of a d-dimensional random variable is given by a covariance matrix.
Covariance
ØWhen the absolute value of covariance is high, the two variables tend to vary far from their means at the same time.
ØWhen the sign of the covariance is positive, the two functions map to higher values together.
ØWhen the sign of the covariance is negative, the one function maps to higher values, the other maps to lower values (or vice versa)
Positive Covariance
Negative Covariance
Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
compute the mean
(-1)(0.3) + (0)(0.3) + (1)(0.4) = 0.1
(0)(0.8) + (1)(0.2) = 0.2
Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
compute the covariance
(-1-0.1)2(0.3) + (0-0.1)2(0.3) + (1-0.1)2(0.4) = 0.69 (0-0.2)2(0.8) + (1-0.2)2(0.2) = 0.16
(-1-0.1)(0-0.2)(0.24) + (-1-0.1)(1-0.2)(0.06) + (0-0.1)(0-0.2)(0.16)
+ (0-0.1)(1-0.2)(0.14) + (1-0.1)(0-0.2)(0.40) + (1-0.1)(1-0.2)(0) = -0.08
Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
putting it all together
Example: Calculate the Covariance Matrix
Joint Probability P(X1,X2)
ØNormalizing the covariance matrix will give you the population correlation coefficient, a measure of how correlated two variables are.
Multivariate Gaussian Distribution
Determinant of Multivariate Covariance Matrix Multivariate Covariance Matrix Sample Mean
Bivariate Normal
Bivariate Normal
Bivariate Normal
Bivariate Normal
Example: Visualization
Multivariate Mean
Bivariate Gaussian Example (See Sample Code)
Covariance Matrix
Determinant of Covariance Matrix
Multivariate Sample
Determinant Calculation (bivariate example)
(0.69 * 0.16) – (-0.08*-0.08)
Determinant will be covered in week 7
Gaussian Mixture Models (GMM)
Gaussian Mixture Models (GMM)
Predicted Loss (Error)
Source: Scikit-learn
ØCan be extended to multivariate data (e.g., bivariate GMM)
Google Colab (Code Example)
Anomaly Detection (Semi-Supervised)
What is the goal?
Time of day
Transaction $
Performance Metrics
Why Performance?
ØQ: Why do we care about performance?
ØIdentifying how well our models are doing is not trivial. Easy to fall into the trap of believing (or wanting to believe) our models are working based on weak assessment.
How to measure the performance of a model?
ØAssume a case where we want to detect outliers and we know: ØDataset has 100 points
Ø98 are non-outlier
Ø2 are outliers
ØThis is called an imbalance dataset!
ØIf we detect all the points as non-outliers:
Ø98 True predictions/100 = 98% accuracy for a model that is not working
ØQ: How can we improve our performance measurements?
Fraud Detection System
Transaction Features
Positive: Fraud
Negative: Valid
Fraud Detection System
(Positive = Fraud, Sick, …)
ØIf transaction is Valid:
ØPrediction : Valid (Negative): True Negative ØPrediction : Fraud (Positive): False Positive
ØIf transaction is Fraud:
ØPrediction : Fraud (Positive): True Positive ØPrediction : Valid (Negative): False Negative
Precision and Recall
ØIf transaction is Valid:
ØPrediction : Valid (True Negative) OK! ØPrediction : Fraud (False Positive) Not that bad!
ØIf transaction is Fraud:
ØPrediction : Fraud (True Positive) GOOD!
ØPrediction : Valid (False Negative) Super BAD! Detected Fraud
(FP MISTAKE)
Target: Fraud
Flagging Valdis
Missed Frauds
Confusion Matrix
Actual Value
(as confirmed by experiments)
True Positive
False Positive
False Negative
True Negative
ØTable used to describe prediction performance on a set of test data
Predicted Value
(predicted by the test)
negatives positives
𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
ØA balanced measure of accuracy giving equal importance to recall and precision
ØThe highest possible value of F1 is 1, and lowest possible value is 0
Precision and Recall – tug of war
Ø Application – spam detection
Ø Improving precision usually reduces recall, vice-versa
Model output
Source: ML Crash Course 74
Precision and Recall – tug of war
Ø Application – spam detection
Ø Improving precision usually reduces recall, vice-versa
You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P)
Model output 0.73
What is the precision (emails flagged as spam)? tp/(tp+fp) = 8/(8+2) = 0.8 What is the recall (actual spam correctly classified)? tp/(tp+fn) = 8/(8+3) = 0.73
Source: ML Crash Course
Precision and Recall – tug of war
Ø Application – spam detection
Ø Improving precision usually reduces recall, vice-versa
You’re saying everything on this side IS NOT spam (N)
Model output
What is the precision (emails flagged as spam)? tp/(tp+fp)
You’re saying everything on this side IS spam (P)
= 7/(7+1) = 0.88
What is the recall (actual spam correctly classified)? tp/(tp+fn) = 7/(7+4) = 0.64 Source: ML Crash Course
Precision and Recall – tug of war
Ø Application – spam detection
Ø Improving precision usually reduces recall, vice-versa
You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P) Problem: The performance of the model is
a function of threshold!
What is the precision (emails flagged as spam)? tp/(tp+fp) = 7/(7+1) = 0.88
What is the recall (actual spam correctly classified)? tp/(tp+fn) = 7/(7+4) = 0.64 Source: ML Crash Course
Model output
ROC (Receiver Operating Characteristic) Curve
You want this to be high
You want this to be low
Source: Wikipedia 78
Closest point to the top left = best accuracy
You want this to be high
Receiver Operating Characteristic (plot for binary classifiers)
Q: What would be a perfect classifier?
Random classifier
You want this to be low
“everyone is fraud”
Negative Positive
TPR=1 FPR=0.95
TPR=1 FPR=0.8
TPR=1 FPR=0.05
Negative Positive
TPR=1 FPR=0
TPR=0.7 FPR=0
AUC (Area Under the Curve)
ØWeek 4 Q/A Support Session ØProject Questions
ØWeek 5 Lecture – Data Processing ØLinear Algebra
ØAnalytical Geometry ØData Augmentation
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com