程序代写代做代考 data mining python information retrieval algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 9
Jean Honorio

Purdue University

Goal of machine learning?

Goal of machine learning
• Use algorithms that will perform well in unseen data

Goal of machine learning
• Use algorithms that will perform well in unseen data

• How to measure performance?

• How to use unseen data?

Goal of machine learning
• Use algorithms that will perform well in unseen data

• How to measure performance?

• How to use unseen data?

• Variability?

• By-product: a way to set hyper-parameters
- e.g., C for SVMs, λ for kernel ridge regression, etc.

Measures of Performance: Regression
• Assume that for a point , we predict

• Mean square error:

• Root mean square error:

• Mean absolute error:

MSE(g) =
1
n

(g(xi )− yi )
2

i=1

n

g(x)x

RMSE(g) = MSE(g)

1
n

g(xi )− yii=1
n

Measures of Performance: Classification
• True Positive (TP)
• True Negative (TN)
• False Positive (FP)
• False Negative (FN)

• Accuracy
• Error
• Recall / Sensitivity
• Precision
• Specificity

• Use jointly: (Precision,Recall) or (Sensitivity,Specificity)

True Label

+1 -1

+1

-1

TP

TN

P
re

di
ct

ed
L

ab
el

FP

FN

(TP +TN ) / (TP +FP +FN +TN )

TP / (TP +FN )
TP / (TP +FP)
TN / (TN +FP)

(FP +FN ) / (TP +FP +FN +TN )

Precision and Recall
• Idea comes from information retrieval

Wikipedia

Sensitivity and Specificity
• Idea comes from signal detection theory
• Assume Gaussian distributions

• By sliding the offset we get different (TP, FP, TN, FN)
and thus, different sensitivity and specificity

Pattern Classification, Duda et al., 2nd Ed., Chapter 2

p(x | y = +1) = N(µ1,σ
2 )

p(x | y = −1) = N(µ2,σ
2 )

p(x | y = +1) p(x | y = −1)

θ0

g(x) =
+1, if x <θ0 −1, if x ≥θ0 # $ % &% Classifier: θ0 Receiver Operating Characteristic (ROC) • By varying the hyperparameter of a classifier (e.g., C for SVM with linear kernel, β for SVM with RBF kernels) we can get different: - Sensitivity - Specificity • Summarized with an Area Under the Curve (AUC) - Random: 0.5 - Perfect classifier: 1 The Elements of Statistical Learning, Hastie et al. SVM with linear kernel SVM with RBF kernels Logistic regression Other Loss Functions • Let +1 mean “diseased patient” and -1 mean “healthy patient” True Label +1 -1 +1 -1 0 0 P re di ct ed L ab el 1 1 True Label +1 -1 +1 -1 0 0 P re di ct ed L ab el 1 10 1 n 1[g(xi ) ≠ yi ]i=1 n ∑ 1 n Cost(g(xi ), yi )i=1 n ∑ 2) Using “Unseen” Data • Overfitting: - Bigger model class , better fit in training data (linear to quadratic to cubic) - Find hyper-parameters that better fit training data - Usually poor performance in unseen data • To prevent overfitting, how can we “see” unseen data? - Simulate it ! Pattern Classification, Duda et al., 2nd Ed., Chapter 9 Training, Validation, Testing • Three data sets: Training set Validation set Test set Report measures Try different algorithms different hyper-parameters k-Fold Cross Validation • Split training data D into k disjoint sets S1,…,Sk - Either randomly, or in a fixed fashion -  If D has n samples, then each fold has approximately n / k samples - Popular choices: k=5, k=10, k=n (leave-one-out) • For i = 1…k: train with sets S1,…,Si–1, Si+1,…,Sk test on set Si let Mi be the test measure (e.g., accuracy, MSE) • Mean and variance are: µ̂ = 1 k Mii=1 k ∑ σ̂ 2 = 1 k (Mi −i=1 k ∑ µ̂)2 0.632 Bootstrapping µ̂ = 1 B Mii=1 B ∑ σ̂ 2 = 1 B (Mi −i=1 B ∑ µ̂)2 • Let B>0, and n be the number of training samples in D

• For i = 1…B:
Pick n samples from D with replacement, call it Si
(Si might contain the same sample more than once)

train with set Si
test on the remaining samples (D – Si)

let Mi be the test measure (e.g., accuracy, MSE)

• Mean and variance are:

0.632 Bootstrapping
• Why 0.632 ?

• Recall that:
- We pick n items with replacement from out of n items
- We choose uniformly at random

• The probability of:
- not picking one particular item in 1 draw is
- not picking one particular item in n draws is
- picking one particular item in n draws is

• Finally: limn→∞1− 1−1/ n( )
n
=1−1/ e ≈ 0.632

1−1/ n
(1−1/ n)n

1− (1−1/ n)n

3) Variability
• How to compare two algorithms?

- Not only means, also variances !

• Statistical hypothesis testing

Statistical Hypothesis Testing
• How to compare two algorithms?

- Not only means, also variances !

• Let be mean and variance of algorithms 1 and 2.

• When to reject null hypothesis in favor of ?

• Let:

µ̂1,σ̂1
2, µ̂2,σ̂ 2

2

x =
(µ̂1 − µ̂2 ) n
σ̂1

2 + σ̂ 2
2

ν =
(σ̂1

2 + σ̂ 2
2 )2 (n−1)

σ̂1
4 + σ̂ 2

4

#
#

$

%
%

µ1 > µ2µ1 = µ2

Degrees of freedom of
Student’s t-distribution

Statistical Hypothesis Testing
• Student’s t-distribution:

• For significance level , degrees of freedom
- Find the value for which CDF =
- Python: from scipy.stats import t

t.ppf(1–alpha, v)

• If reject null hypothesis in favor of µ1 > µ2µ1 = µ2

να

Probability density function (PDF) Cumulative density function (CDF)

1−αx1−α,ν

x > x1−α,ν

1−α

x1−α,ν

Wikipedia

What is a Sample?
• In this lecture we assume that each sample is a different

“unit of interest” for the experimenter

• Never sample the same “unit of interest” several times
-  In a medical application, we might be interested on knowing

the accuracy (and variance) with respect to patients.

- Taking two visits of the same patient as two different samples
would be incorrect.

• Collect more data, if necessary
• Never duplicate (copy & paste) data.