Lecture 3: Classification and Validation
Instructor:
Outline of this lecture
Copyright By PowCoder代写 加微信 powcoder
} Logistic Regression
} Prediction Function, Cost Function, and Optimization } Evaluation (Measurements)
} Case Study: University Admission System
} Other Linear Classifiers
} Model Selection and Validation
Recap: Linear Regression
} How the prediction function is defined? } Linear Regression for single feature
} How the cost function is defined? } Residual square of sum
} How the optimization is conducted? } Gradient Descent Method
Binary Classification Problems
Housing Priceàreal value
Spam Email detectionàspam or normal Fish classificationàsea bass or salmon
Regression for Fish Classification
Prediction Function: f! x = 𝜃”𝑥
}Iff! x >0.5,thenitissalmon } Otherwise, it is See Bass
Classification vs Regression
} Classification output: 1 or 0
} Regression h!(𝑥) output: continuous values
} Adapting Regression for Classification Problems: 0 ≤ h!(𝑥) ≤ 1
Logistic Regression
} Three Components
} Prediction Function
} Cost Function
} Optimization (Learning)
Logistic Regression
} How to formulate a binary classification problem as a regression problem?
h!(𝑥) = 𝑔(𝜃”𝑥) 0≤h!# ≤1
𝑔 : sigmoid function or logistic function
𝑔𝑥=1 1+𝑒$#
Logistic Regression
} Sigmoid Function is not the only choice } Other choice: Tangent
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8
-5 -4 -3 -2 -1 0 1 2 3 4 5
Understanding: h!(𝑥)
} h!(𝑥) is the estimated probability that output is 1 on
E.g. If h!(𝑥)=0.7, then the fish is sea bass at a probability of 0.7
Understanding: h!(𝑥) h!(𝑥) = 𝑔(𝜃”𝑥)
𝑔𝑥=1 1+𝑒$#
}If𝜃”𝑥≥0,theng 𝜃”𝑥 ≥0.5,anditislikelytobey=1 }If𝜃”𝑥≤0,theng 𝜃”𝑥 ≤0.5,anditislikelytobey=0
Understanding: h!(𝑥)
} Consdier h!(𝑥) denote the probability of being positive for
} 1 − h!(𝑥) is the probability of being negative, i.e.:
1−h! 𝑥 =1− 1 = 𝑒”!!#
1 + 𝑒”!!# 1 + 𝑒”!!#
} The quantity $” # is called ‘odds’ %”$” #
} E.g., 1 in 5 students with an odds of 1⁄4 will get A
} E.g., 9 out 10 students will an odds of 9 will graduate } Logit-Odds or Logit is defined as
logh!𝑥 =𝜃&𝑥 1−h! 𝑥
which is linear in x
Learning of logistic regression
} Training sample 𝑥#, 𝑦#, 𝑥$, 𝑦$ … , , 𝑦% ∈ {0,1}
} Predication function
} Cost function: how to choose parameters 𝜃 ? For every sample i
}if𝑦% = 1, h& }If𝑦% =1, h& }If𝑦% =1, h& }If𝑦% =1, h&
}If𝑦% =0, h& }If𝑦% =0, h& }If𝑦% =0, h& }If𝑦% =0, h&
𝑥% = 1, no cost!
𝑥% =0.7,smallcost! 𝑥% =0.3,biggercost! 𝑥% =0,infinite!
𝑥% =0,nocost!
𝑥% =0.3,smallcost! 𝑥% =0.7,biggercost! 𝑥% =1,inifinite!
Magic function I
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.3 0.7 1
} if 𝑦% = 1, h& 𝑥% = 1, no cost! }If𝑦% =1, h& 𝑥% =0.7,smallcost! }If𝑦% =1, h& 𝑥% =0.3,biggercost! }If𝑦% =1, h& 𝑥% =0,infinite!
Magic function II
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.3 0.7 1
}If𝑦% =0, h& 𝑥% =0,nocost! }If𝑦% =0, h& 𝑥% =0.3,smallcost! }If𝑦% =0, h& 𝑥% =0.7,biggercost! }If𝑦% =0, h& 𝑥% =1,inifinite!
Cost Function for Logistic Regression
} Training sample 𝑥%, 𝑦%, 𝑥&, 𝑦& … , , 𝑦’ ∈ {0,1} } Predication function 1
h! 𝑥 =1+𝑒$!!#
} Cost function: 𝐶𝑜𝑠𝑡(h! 𝑥’ ,𝑦’)=
−log[h! 𝑥’ ] 𝑦’ =1 −log[1−h! 𝑥’ ] 𝑦’ =0
Understanding Cost function
−logh!(𝑥’) −log(1 − h!(𝑥’)
𝑦’ =1 𝑦’ = 0
𝐶𝑜𝑠𝑡(h! 𝑥’ ,𝑦’)=
} if𝑦’ =1,h! 𝑥’ =1,thencost=0
ash! 𝑥’ 2.5
→0, 𝑐𝑜𝑠𝑡 →∞
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Understanding Cost function
𝐶𝑜𝑠𝑡(h! 𝑥’ ,𝑦’)=
} if𝑦’ =0,h! 𝑥’ =0,thencost=0
−logh!(𝑥’) −log(1 − h!(𝑥’)
𝑦’ =1 𝑦’ = 0
→1, 𝑐𝑜𝑠𝑡 → ∞
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Compact Cost Function
𝐶𝑜𝑠𝑡(h! 𝑥’ ,𝑦’)=
−logh!(𝑥’) 𝑦’ =1 −log(1 − h!(𝑥’) 𝑦’ = 0
𝐹𝜃 =− H𝑦’logh! 𝑥’ + 1−𝑦’ log(1−h!(𝑥’)) ‘
What’s the difference between Least Square Loss and the following log loss (cross-entropy)?
𝐹𝜃 =−𝑚1 1𝑦!logh” 𝑥! + 1−𝑦! log(1−h”(𝑥!)) !
How to Optimize the Cost Function?
arg! min𝐹 𝜃
Gradient Descent
} Sigmoid function y(x) has derivative: 𝑑𝑦 = 1 − 𝑦 𝑥 𝑦 𝑥
} Repeat {
𝜃(=𝜃(−𝛼 𝜕𝐹(𝜃) 𝜕𝜃(
} Derivative of Cost function
𝐹 𝜃 =−𝑚1 1𝑦! logh” 𝑥! + 1−𝑦! log(1−h” 𝑥! ) !
𝜕𝐹𝜃=11h”𝑥! −𝑦!𝑥#!
𝜕𝜃# 𝑚 ! Identical format as
} How the prediction function is defined? } How the cost function is defined?
} How the optimization is conduced?
Multi-class problems
} Logistic Regression is designed for binary classification } How to deal with multi-class classification?
} Emailàspam,work,friends,family } Whether à sunny, cloudy, rain, snow } Fish à sea bass, salmon,
Multi-class problems: one-vs-all
} For each class i, train a logistic regression classifier h !’ ( 𝑥 )
where the positive class is for the class i, and the negative class is for the rest classes
} For a new input x, classify it using the following : arg’ max h!’ (𝑥)
Evaluation for classification
} One might simply report accuracy over the predictions for all testing samples, e.g., the percentage of correctly classified samples
} Comprehensive metric: Confusion Matrix
Binary classification
}Foreverysample,denote𝑦asthegroundtruthlabel, O𝑦as the predicted label
} We use a confusion matrix to analyze the results of a model
Prediction
True-positive
False- Negative
False-positive
True-negative
Groundtruth
Prediction
A: True-positive
B: False- Negative
C: False-positive
D: True-negative
} Accuracy:
𝐴+𝐷 𝐴+𝐵+𝐶+𝐷
Groundtruth
Prediction
A: True-positive
B: False- Negative
C: False-positive
D: True-negative
} Precision for one class (e.g., 1 or 0):
𝐴 or 𝐷 𝐴+𝐶 𝐵+𝐷
Groundtruth
Prediction
A: True-positive
B: False- Negative
C: False-positive
D: True-negative
} Recall for one class ( 1 or 0):
𝐴 𝑜𝑟 𝐷 𝐴+𝐵 𝐶+𝐷
Groundtruth
Quiz: binary-classification
Prediction
The classification results of animal photos are summarized in the above table. What’re the accuracy, precision and recall?
Groundtruth
Quiz: multi-class classification
Prediction
The classification results of animal photos are summarized in the above table. What’re the accuracy, precision and recall?
Groundtruth
Outline of this lecture
} Logistic Regression
} Prediction Function, Cost Function, and Optimization } Evaluation (Measurements)
} Case Study: University Admission System } Other Linear Classifiers
} Model Selection and Validation
Case Study: University Admission
Python implementation
} File structure
} data.csv: data file
} main_logit.py: main entrance } util.py: functions
} Features: Two scores
} Labels: admitted, not admitted
78.02469282
43.89499752
72.90219803
86.3085521
75.34437644
56.31637178
96.51142588
46.55401354
87.42056972
43.53339331
1. Data processing
2. Data splitting
3. Method 1: Training and testing (sklearn)
4. Method 2: training and testing (own codes)
5. Comparing two methods
Step 1: data processing
} Scale all data to be between -1 and 1
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-1,1)) df = pd.read_csv(“data.csv”, header=0)
df.columns = [“grade1″,”grade2″,”label”]
x = df[“label”].map(lambda x: float(x.rstrip(‘;’)))
X = df[[“grade1″,”grade2”]]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = df[“label”].map(lambda x: float(x.rstrip(‘;’))) Y = np.array(Y)
Step 1: data processing
} Scale all data to be between -1 and 1
Step 2: data splitting
} Split the dataset into testing and training subsets X_train,X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.33)
Step 3: using sklearn (method 1)
} Using sklearn class for training and testing logistic regression models
clf = LogisticRegression() # class instance clf.fit(X_train, Y_train) # train
print ‘score Scikit learn: ‘, clf.score(X_test, Y_test)
Step 4: using own method (method 2)
} Solver: Gradient descent method, implemented in a separate function Gradient_Descent()
theta = [0,0] #initial model parameters alpha = 0.1 # learning rates
max_iteration = 1000 # maximal iterations
m = len(Y) # number of samples
for x in xrange(max_iteration):
new_theta = Gradient_Descent(X, Y,theta,m,alpha) theta = new_theta
if x % 200 == 0:
# calculate the cost function with the present theta Cost_Function(X, Y,theta,m)
Step 4: using own method (method 2)
} Functions in util.py
} Sigmoid(s)
} Prediction (theta, x)
} Cost_function(X, Y, theta, m)
} Cost_Function_Derivative (X, Y, theta, j, m, alpha)
} Gradient_Descent(X, Y, theta, m, alpha)
Step 4: using own method (method 2)
} Implementation of sigmoid function
def Sigmoid(x):
g = float(1.0 / float((1.0 + math.exp(-1.0*x)))) return g
Step 4: using own method (method 2)
} Implementation of prediction function
def Prediction(theta, x): z=0
for i in xrange(len(theta)): z += x[i]*theta[i]
return Sigmoid(z)
Step 4: using own method (method 2)
} Implementation of cost function
def Cost_Function(X, Y,theta,m): sumOfErrors = 0
for i in xrange(m): xi = X[i]
est_yi = Prediction(theta,xi) if Y[i] == 1:
error = Y[i] * math.log(est_yi) elif Y[i] == 0:
error = (1-Y[i]) * math.log(1-est_yi) sumOfErrors += error
const = -1/m
J = const * sumOfErrors return J
Step 4: using own method (method 2)
} Derivative of cost function
def Cost_Function_Derivative(X, Y,theta,j,m,alpha): sumErrors = 0
for i in xrange(m): xi = X[i]
xij = xi[j]
hi = Prediction(theta,X[i]) error = (hi – Y[i])*xij sumErrors += error
m = len(Y)
constant = float(alpha)/float(m) J = constant * sumErrors return J
Step 4: using own method (method 2)
} Gradient Descent Function
def Gradient_Descent(X, Y,theta,m,alpha): new_theta = []
constant = alpha/m
for j in xrange(len(theta)):
deltaF = Cost_Function_Derivative(X, Y,theta,j,m,alpha) new_theta_value = theta[j] – deltaF
new_theta. append(new_theta_value)
return new_theta
Step 5: comparing two methods
} Gradient Descent Function
winner = “”
# accuracy for sklearn
scikit_score = clf.score(X_test, Y_test) length = len(X_test)
for i in xrange(length):
prediction = round(Prediction(X_test[i],theta)) answer = Y_test[i]
if prediction == answer:
score += 1 my_score = float(score) / float(length)
if my_score > scikit_score: print ‘You won!’
print ‘Scikit won.. :(‘
Outline of this lecture
} Logistic Regression
} Prediction Function, Cost Function, and Optimization } Evaluation (Measurements)
} Case Study: University Admission System } Other Linear Classifiers
} Model Selection and Validation
Discriminant Analysis
} It is a popular way to employ the Bayesian Theorem to solve classification problems, especially these multi-class classification problems.
} While using normal (Gaussian) distributions to model each class, we got Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA)
} We will briefly introduce what LDA is
Discriminant Analysis
} To classify a sample x to be one of K classes (K>=2), we aim to find a class k that maximize the following conditional probability:
𝑝) 𝑥 = Pr 𝑦 = 𝑘 𝑥)
} With Bayes Theorem, we have
Pr𝑦=𝑘 𝑥)=Pr𝑥𝑦=𝑘 Pr(𝑦=𝑘)
} Let 𝜋) denote the prior probability for class k,
f* x =Pr𝑥𝑦=𝑘 denotethedensityforxinclassk, we have
Pr 𝑦 = 𝑘 𝑥) = +” # ,” ∑# +# # ,#
Discriminant Analysis
𝑝) 𝑥 = Pr 𝑦 = 𝑘 𝑥) = +” # ,” ∑# +# # ,#
} Testing: give a given sample x, we will label it as class k*, 𝑘∗=arg)max𝑝) 𝑥
} The key is what form the density function takes? } Let p denote the number of features used for
representing x
} If p=1 or there is only one feature used for x, we have
𝑓)𝑥= 1 exp(−1 𝑥−𝜇)&) 2𝜋𝜎) 2𝜎)&
where 𝜇) and 𝜎) are the mean and variance of the samples in class k
Discriminant Analysis (p=1)
𝑝$ 𝑥 = Pr 𝑦 = 𝑘 𝑥) = %! & ‘! ∑” %” & ‘”
} Each class is modeled with a Gaussian distribution
} Assuming all classes (total K) share the same variance, we have
1 exp(−1>𝑥−𝜇=>)𝜋= 𝑝= 𝑥 = 2𝜋𝜎= 2𝜎=
∑A 1 exp(−1 𝑥−𝜇?>)𝜋? 2𝜋𝜎? 2𝜎?>
} Taking the log of the above equation, and applying argmax on both ends, the class label 𝑘∗ can be found by maximizing the following function >
𝛿= 𝑥 = 𝑥 𝜇= − 𝜇= + log 𝜋= 𝜎> 2𝜎>
} The above function is called discriminant function
Discriminant Analysis (p>1)
𝑝$ 𝑥 = Pr 𝑦 = 𝑘 𝑥) = %! & ‘! ∑” %” & ‘”
} Each class is modeled with a multi-variate Gaussian distribution
} Assuming all classes (total K) share the same co-variance matrix Σ (p by p), each class has the mean vector 𝜇) (p by1),wehave1 1 ” $%
𝑓)𝑥=2𝜋//&|∑|%/&exp(−2𝑥−𝜇) Σ (𝑥−𝜇)))
} The discriminant function is1
𝛿) 𝑥 =𝑥”Σ$%𝜇)−2𝜇)”Σ$%𝜇)+log𝜋)
Why Discriminant Analysis?
} When classes are well-separate, logistic regression models are not stable but LDA/QDA is
} When the number of samples is relatively small and the features are approximately normal in each class, LDA is much more stable for logistic regression
} LDA can be naturally applied over multi-class classification problems
https: //scikit-learn. org/stable/modules/generated/sklearn. discriminant_analysis. LinearDiscriminant Analysis. html
Outline of this lecture
} Logistic Regression
} Prediction Function, Cost Function, and Optimization } Evaluation (Measurements)
} Case Study: University Admission System } Other Linear Classifiers
} Model Selection and Validation
Why Model Selection
} Practical Questions in Machine Learning Systems
} How to set the learning rate for gradient descent algorithms?
} How to select the best algorithm from multiple algorithms which are all applicable for the same problem?
} How do we set the best algorithm’s parameters?
} Model Selection
} A practical solution
Model Selection: Validation
} Given a set of sample-label pairs, we often hold out a subset for testing and evaluation purpose
} Training Samples } Testing Samples
} Validation: to select models or model parameters, we further divide the training samples into two subsets
} One for training
} The other for validation!
Dataset: training & testing
Features Predictions Training Testing
validation
validation
Regression: Validation
} Regression Problems } L1 error
} L2 error
} Calculate the above metrics for the following three subsets
} Training
} Validation } Testing
How many samples do we need to effectively validate a hypothesis predictor?
Theorem for Model Validation
Let h be some predictor and assume that the loss function is in [0,1]. Then, for every 𝛿 ∈ 0,1 , with probability of at least 1 − 𝛿 over the choice of a validation set V of size 𝑚1, we have
log(𝛿2) 2𝑚1
𝐿2 h −𝐿𝒟 h ≤
L4 h : validation risk of h L𝒟 h : trueriskofh
Validation: Classification
} With prediction functions, one need to make a strategy to make discrete predictions over testing samples
} For binary classification, e.g., the label of a sample 𝑥’ 𝑦j’=k1, 𝑖𝑓𝑓!𝑥’ >0.5
0, 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
} The threshold could be varying case by case
}Considerabinaryclassifier𝑓! 𝑥 wherexrepresentsthe symptoms of a patient and its return value is a binary label, representing ‘having cancer’ (positive) or ‘not having cancer’ (negative). What’s the RISK of using a relatively small threshold?
Validation: Classification
} The predicated labels should be compared to the corresponding true labels for calculating measures of success
} Two Cases:
} With fixed threshold, how the machine learning model
} Without fixed threshold, how the machine learning model perform?
Measure 1: Confusion Matrix
} With a fixed threshold, every testing sample receives one and only one label
Measure 2: ROC Curve
} Without a fixed threshold, we still need to evaluate how the system perform under all possible thresholds (with different decision risks)
Measure 2: ROC Curve
} ROC: Receiver operating characteristic for Binary classification
} Each point of the curve represents a threshold } For each threshold, we calculate two Rates:
False Positive Rate= False Positives / Number of Negatives = False Positives / ( False Positives +True Negatives) True Positive Rate = True Positives / Number of Positives = True Positives / (True Positives + False Negatives)
Measure 2: ROC Curve
} To generate a ROC curve, one might change the threshold from Min to Max
} With Min, a majority of testing samples will be classified to be positive
} With Max, a majority of testing samples will be classified to be negative
Sklearn.metrics.roc_curve
!”#$%&'()$*&+,!(&-,.,/&0$(y_true, y_score, *, pos_label=None, sample_weight=None)
https: //scikit-learn. org/stable/modules/generated/sklearn. metrics. roc_curve. html
Source codes availables
} Lecture03_roc_curve.ipynb
K-Fold Cross Validation
} Motivation: to use each and every training sample for both training and validation purposes
} K-Fold Cross Validation: partition the training set into K subsets
} For each subset (or fold), train the model using other subsets and validate the model using the subset.
} The average of these K-fold validation errors is used as the measure of success
} If K is equal to the number of training samples, called leave-one-out validation.
Outline of this lecture
} Logistic Regression
} Prediction Function, Cost Function, and Optimization } Evaluation (Measurements)
} Case Study: University Admission System
} Other Linear Classifiers
} Model Selection and Validation
The derivatives of a logistic neuron
y= 1 =(1+e−z)−1 1+e−z
dy −1(−e−z) ” 1 %” e−z %
dz = because
−z 2 (1+e )
= $ −z ‘$ −z’ = y(1−y) #1+e +e &
(1+e−z)−1 (1+e−z) −1
−z= −z = −z −z=1−y 1+e 1+e 1+e 1+e
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com