CS代写 COMP3115/4115 Exploratory Data Analysis and Visualization

COMP3115/4115 Exploratory Data Analysis and Visualization
Lecture 7: Data Classification Prof. .
§ What Is Data Classification?
§ Data Classification Pipeline and Case Studies § Classification Algorithms: Perceptron Algorithm § Classification Performance Evaluation

An Example of Classification Problem
§ Learn to recognize apple or banana
Feature Engineering
To predict it is apple or banana
apple apple
banana ……
shape (x2)
Model Building
Predicted it as apple

A Typical Classification Pipeline
Collecting Feature Labelled Data Engineering
Building Classification model
Model Evaluation
Model Deployment

Application 1: Covid-19 Mortality Prediction
(Data Preprocessing)
https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-021-01742-0#Tab3

Application 1: Covid-19 Mortality Prediction
(Feature Selection)

Application 1: Covid-19 Mortality Prediction
(Feature Statistics)

Application 1: Covid-19 Mortality Prediction
(Model Selection)

Application 1: Covid-19 Mortality Prediction
§ Optimal use of hospital resources for
– treating the patients with more critical conditions and
– assisting in providing more qualitative care and
– reducing medical errors due to fatigue and long working hours in the ICU

Application 2: Spam Email Filtering
(Data Exploration) Spam
https://towardsdatascience.com/email-spam-detection-1-2-b0e06a5c0472

Application 2: Spam Email Filtering
§ convert all letters to lower/upper case
§ removing numbers
§ removing punctuation
§ removing white spaces
(Pre-processing)
§ removing hyperlink
§ removing stop words such as a, about, above, down, doing and the list goes
§ Word stemming
§ Word lemmatization

Application 2: Spam Email Filtering
(Pre-processing)

Application 2: Spam Email Filtering
§ Term Frequency
§ Word Embedding (to be discussed in deep learning lecture)
(Feature Extraction)
… then feed the feature vectors to the classifier

Overview of Classification Models
§ Simple Models
Perceptron K nearest Neighbors
§ Complex Models
Decision Tree
Support Vector Machine (Deep) Neural Networks
Ensemble methods:
Random Forest; Gradient Boosting Tree

Notation (Recap)
sepal_length
sepal_width
petal_length
petal_width
5.1 3.5 1.4 0.2 4.9 3 X1.4 0.2 7 3.2 4.7 1.4 6.3 3.3 6 2.5 …………
setosa setosya versicolor virginica …
§ X: the n*d feature matrix
– n: the number of data samples
– d: the dimensionality of each data sample
§ y: n*1 label vector
§ X(i,:): the i-th row in matrix X – i.e., the i-th data sample
§ xi: the i-th sample in column vector representation
– xi = X(i,:)T
§ x(i,j): the j-th feature of the i-th sample § yi: the label of the i-th sample
Capital bold letter for matrix, small bold letter for vector, small (italic) letter for scalar. Vectors are column vectors

Notation (Recap)
sepal_length
sepal_width
petal_length
petal_width
5.1 3.5 1.4 0.2
setosa setosya versicolor virginica …
4.9 3 1.4 0.2
7 3.2 4.7 1.4
6.3 3.3 6 2.5 …………
x11 x13 X(2,:) = [4.9, 3, 1.4, 0.2] it is a row vector
The subscript T means transpose.
Both of them are presenting the second sample for X. xi = X(i,:)T

Linear Classification Task – An Illustrated Example
How about non-linear?

Perceptron Algorithm

Introduction to Perceptron Algorithm
§ One of the earliest algorithms for linear classification (Rosenblatt, 1958)
§ Try to find a hyperplane separating the labeled data
§ Guaranteed to find a separating hyperplane if the data is linearly separable

Linear Decision Boundary for Classification: Example
§ What is the formula for this linear boundary?
Øx2 =-2×1+2⇒2×1+x2-2=0 Ø General form:
𝑓 𝐱 =∑$ 𝑤𝑥 +𝑏=𝐰𝑇𝐱+𝑏=0 !”# 𝑗 𝑗
§ What label would we predict for a new data point x?
Ø [1, 2] Ø2*1+2-2=2 > 0
Øpredicted it as positive
Ø [-1, -1]
Ø2*-1+-1-2 = -5 < 0 ØPredicted it as negative Introduction to Perceptron Algorithm § Problem Definition – Given a training dataset 𝐱 , 𝑦 & , where 𝐱 is a d dimensional input feature vector, 𝑦 ∈ 𝑖 %%"# {−1,1} is the corresponding class label. – Objective: Train a linear classifier that can separate positive and negative samples. – Training: Learn a linear classification function 𝑓 𝐱 = ∑$ 𝑤 𝑥 + 𝑏 from training data Ø𝑦=1if𝑓𝐱 =∑' 𝑤𝑗𝑥𝑗+𝑏>0 $%&
Ø𝑦=−1if𝑓𝐱 =∑’ $%&
Øwj, b are the model parameters that we need to learn from training data
– Prediction: for any new input x, predict its class label as 𝑦 = sign(𝑓 𝐱 )
§ Simplify the notation
– By introducing an artificial feature 𝑥0 = 1, 𝐱 → [𝑥0, 𝑥1, … , 𝑥$], 𝑓 𝐱 can be rewritten in
vector form as
𝑓 𝐱 =∑’ 𝑤𝑗𝑥𝑗+𝑏=∑’ 𝑤𝑗𝑥𝑗+𝑏𝑥* =∑’ 𝑤𝑗𝑥𝑗 =𝐰𝑇𝐱 $%& $%& $%*

Introduction to Perceptron Algorithm
Initialize w = 0 Repeat
𝐰 ← 𝐰 + 𝑦5𝐱𝑖
Iteratively do prediction on the training data,
If the current data point is correctly classified
do nothing
If the current point is wrongly classified
Update the model
𝑦= 𝐰𝑇𝐱𝑖 ≤ 0 means the training data point 𝐱𝑖 is misclassified.
• true label 𝑦= = 1, but the prediction sign 𝐰𝑇𝐱 = −1, or • true label 𝑦= = −1, but the prediction sign 𝐰𝑇𝐱 = 1

Perceptron: An Illustrated Example
Suppose we have a small training data with only 4 labelled data samples. Each data sample has only two features.

An Illustrated Example
• Initialize w = [0, 0, 0]
• Cyclically go through the data The 1st pass
([1,1,-1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 0∗1+0∗1+0∗(−1) =0≤0 𝐰←𝐰+𝑦#𝐱𝑖 = 0,0,0 +1∗ 1,1,−1 =[1,1,−1]
([1,1,1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 1∗1+1∗1+(−1)∗1 =1>0 do not update w
([1,-1,1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 1∗1+1∗(−1)+(−1)∗1 =−1≤0 𝐰←𝐰+𝑦#𝐱𝑖 = 1,1,−1 +1∗ 1,−1,1 =[2,0,0]
([1,-1,-1],-1):𝑦# 𝐰𝑇𝐱𝑖 =−1∗ 2∗1+0∗(−1)+0∗(−1) =0≤0 𝐰←𝐰+𝑦#𝐱𝑖 = 2,0,0 +(−1)∗ 1,−1,−1 =[1,1,1]
The2nd pass ([1,1,-1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 1∗1+1∗1+1∗(−1 )=1>0 do not update w
([1,1,1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 1∗1+1∗1+1∗(1 )=3>0 do not update w
([1,-1,1],1): 𝑦# 𝐰𝑇𝐱𝑖 =1∗ 1∗1+1∗(−1)+1∗(1 )=1>0 do not update w
([1,-1,-1],-1): 𝑦# 𝐰𝑇𝐱𝑖 =−1∗ 1∗1+1∗(−1)+1∗(−1 )=1>0 do not update w

How to Plot the Decision Boundary?
§ The learnt w by using perceptron algorithm is [1, 1, 1], then the corresponding decision boundary is w0*x0 + w1*x1 + w2*x2 = 0. (Note that w is perpendicular to the line)
§ We know that w0 is 1 and x0 is an artificial feature which alwaysequalto1.Alsoweknoww1 =1andw2 =1. Therefore, the decision boundary is:
1+x1 +x2 =0→x1 +x2 =-1
§ The line (i.e. decision boundary) can be plotted by connecting two data points lies on the line.
– Supposex1 =0,thenx2 =-1 – Supposex2 =0thenx1 =-1

Implementation of Perceptron in Python

Why Perceptron Works?
Without loss of generality, we assume ||w|| = 1. Visually, wTx = ||w||||x||cosθ is the scalar value you get if you “project x onto w”.
If the value is positive, predict it as positive.
If the value is negative, predict is as negative.
The line perpendicular to w divides the vectors classified as positive (1) from the vectors classified as negative -1.
negative (-1) positive (1)

Why Perceptron Works?

Why Perceptron Works? (Visually)
§ Consider a positive data point was wrongly classified as negative
𝐰 ← 𝐰 + 𝑦!𝐱𝑖
xi After update
After update w to w + yixi, this positive data point will be correctly classified as positive

Why Perceptron Works?
§ Consider a negative positive

Why Perceptron Works? (Visually)
§ Consider a negative data point was wrongly classified as positive
𝐰 ← 𝐰 + 𝑦!𝐱𝑖
After update
After update w to w + yixi, this negative data point will be correctly classified as negative

Convergence of Perceptron (Optional)
§ Theorem (Block & Novikoff): If the training data is linearly separable with margin 𝛾 by a unit norm hyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/ 𝛾” mistakes during training (assuming ||x||< R for all x) Concept of Margins hyperplane Margin of a set: is the minimum margin of all samples 𝛾=min𝛾+=min wTx𝑖 &,+,- &,+,- 𝐰 Margin of a sample : 𝛾+ of a sample xi is its distance from the If the data is not linearly separable § The perceptron algorithm would not converge if the training data is not linear separable. Perceptron: Some Additional Notes § Given a linearly separable training set 𝐱 , 𝑦 + , perceptron algorithm finds a classification model with “zero training error”. § We can formulate the goal of learning of the separating hyperplane as an optimization problem (minimizing a loss function related to the training error). Learning as Optimization Learning as Optimization § The loss function of perceptron: & 𝑙(𝐰) = ? max{0, −𝑦% 𝐰𝑇𝐱𝑖 } %"# – Loss = 0 on samples where Perceptron prediction is correct, i.e., 𝑦𝑖 𝐰𝑇𝐱𝑖 – Loss > 0 on samples where Perceptron prediction is wrong, i.e., 𝑦𝑖 𝐰𝑇𝐱𝑖
Perceptron loss

Learning as Optimization
§ Therefore, the classification model w is by minimizing the perceptron loss function: &
min 𝑙(𝐰) = ? max{0, −𝑦% 𝐰𝑇𝐱𝑖 } 𝐰 %”#
§ Gradient Descent: 𝐰 = 𝐰 − 𝜂
– Gradient is computed based on the entire training data
§ Stochastic Gradient Descent
– Works like gradient descent, now the (‘stochastic’) gradient is computed based on only
one data sample
0 if 𝑦 𝐰𝑇𝐱 >0 Bysetting𝜂=1,itreturnsthe
𝐰=𝐰−𝜂𝜕max{0,−𝑦+ 𝐰𝑇𝐱𝑖 } ; + 𝑖 perceptronupdatingrule: 𝜕𝐰 −𝑦+𝐱𝑖if𝑦+𝐰𝑇𝐱𝑖 ≤0
learning rate
If𝑦+w𝑇x𝑖 ≤0
𝐰 ← 𝐰 + 𝑦+𝐱𝑖

Learning as Optimization
§ Optimization is one of the core components of machine learning algorithms. § Most machine learning algorithms is to learn the parameters by minimizing
the loss function based on training data.
§ Linear Regression:
min𝑙(𝐰)=D(𝑦+− 𝐰/𝐱++𝑏0
§ Support Vector Machine:
𝐰 2 – min 2 +𝐶D𝜉𝑖
subjectto 𝑦𝑖 𝐰𝑇𝐱𝑖+𝑏 ≥1−𝜉𝑖,𝑖=1,…,𝑛

Optimization using Gradient Descent
§ Gradient Descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It is a general algorithm that can be applied to many machine learning model.
§ Intuition of Gradient Descent: Moving in the direction of steepest descent
negative slope
What is the gradient of l(w) at this point?
• The gradient of the function at this point is the slope
of the tangent line (i.e., the straight line that “just
touches” the curve at this point)
• This is a positive slope (increasing)
• Therefore, we need to go to opposite direction.
The gradient determines the direction of steepest increase of l(w). We need to go to opposite direction of gradient to minimize the l(w).

Gradient Descent Algorithm
Gradient Descent
Initialize w Repeat T times
stepsize (e.g., 0.1)

Stochastic Gradient Descent
§ Works like gradient descent, now the (‘stochastic’) gradient is computed based on only one data sample
Stochastic Gradient Descent
Initialize w Repeat T times
For a data sample (xi, yi)
CD(𝐰, 𝐱!, E!) C𝐰
stepsize (e.g., 0.1)
stochastic gradient based on one data sample

Perceptron: additional notes
Perceptron was criticized for its inability to handle non linearly separable problem (e.g. XOR function).

Perceptron: Nonlinear separable problems x2
Perceptron:
§ Perceptron can not solve nonlinear separable problems
§ Nonlinear separable problems can be solved by
– Making it linearly separable using kernel (Support Vector Machine) – Using multiple layers perceptron (Neural Networks)

Classification Performance Evaluation

Classification Evaluation
§ Why Evaluation?
§ Methods for Estimating a Classifier’s Performance
§ Evaluation Metrics – Accuracy
– Confusion Matrix
– Precision, Recall, Specificity, F-Score

Why Evaluation?
§ Multiple methods are available to build a classification model – Perceptron
– Support Vector Machine (SVM) – Deep Neural Networks, …
§ For each method, there can also be different model structure and parameter settings.
§ To choose the best model, we need a meaningful performance evaluation metric for classification-related projects.

Methods for Estimating a Classifier’s Performance

Training Sets and Test Sets
§ How can we get an unbiased estimate of the accuracy of a learnt model?
When learning a model, you should pretend you do not have the test data yet.
Accuracy estimates will be biased if you used the test data during the training.
Estimated accuracy
Labeled dataset
Training dataset
Test dataset
Training a classification model
Learned Model

Validation (Tuning) Sets
§ We want to estimate the accuracy during the training stage for tuning the hyper-parameter of a model.
Labeled dataset
Training dataset
test dataset
Training dataset
validation dataset
Training classification models
select model
Estimated accuracy
Learned Model

Validation Set to Avoid Overfitting

Limitations of Using A Single Training/Test Partition
§ We may not have enough data to make sufficiently large training and test sets
– a larger test set gives us more reliable estimate of accuracy (i.e., a lower variance estimate)
– but… a larger training set can better represent the underlying data for learning
§ A single training set doesn’t tell us how sensitive accuracy is to a particular training set.

Random Resampling
Labeled dataset
Training dataset
test dataset
Iteration 1
Iteration 2
Iteration 3
Iteration k
§ In each iteration, random sampling a fraction of data as training data and use the rest as test data.
Training dataset
Training dataset
test dataset
test dataset

Cross Validation
§ K-fold cross validation
– Create K equal size partitions of the dataset
– Each partition has N/K samples
– Train using K-1 partitions, and test on the remaining partition
– Repeat the process for K times, each with a different test partition
Labeled dataset
Training Dataset
S2, S3, S4, S5
S1, S3, S4, S5
S1, S2, S4, S5
S1, S2, S3, S5
S1, S2, S3, S4
Test Dataset

Cross Validation Example
§ 5-fold Cross Validation
– Suppose we have 100 labeled samples, we use 5-fold cross validation to estimate the accuracy
Training Dataset
Test Dataset
S2, S3, S4, S5
S1, S3, S4, S5
S1, S2, S4, S5
S1, S2, S3, S5
S1, S2, S3, S4
Accuracy = 73/100 = 73%

Leave-one-out (LOO) Cross Validation
§ A special case of K-fold cross validation when K = N (number of training samples)
– Each partition is now a data sample
– Training using N-1 data samples, test on one data sample
– Repeat for N times
– Can be expensive for large N. Typically used when N is small.

Evaluation Metrics
§ Accuracy
§ Confusion Matrix
§ Precision, Recall, Specificity, F-Score
§ ROC Curve and Precision-Recall Curve (next lecture)

Evaluation Metrics in Practices

antigen Rapid Test

§ Accuracy is widely used.
– Fraction of correctly classified samples over the whole set of samples – Number of Correct Classified Samples / Total Number of Samples
Accuracy = 4/5 = 0.80
§ However, accuracy may be misleading sometimes. Why?
Predicted Label
True Label

Accuracy Could Be Misleading
§ Accuracy with imbalanced classes
§ Supposed you have a test set with 2 classes containing 1000 samples – 10 of them are positive class
– 990 of them are negative class
§ You build a classifier, and the accuracy on this test set is 96% § Is it good?

Accuracy Could Be Misleading
§ An imbalanced dataset with 2 classes containing 1000 samples – 10 of them are positive class
– 990 of them are negative class
§ Suppose we have a “dummy” classifier that does not analyse the features at all, and just blindly predict the most frequent class (so always negative for this case).
§ What is the accuracy of this “dummy” classifier on this imbalanced data – 990/1000 = 99% > 96%

Confusion Matrix
§ Binary Classification Example Predicted Label
• Label 1 = positive class (class of interest)
• Label -1 = negative class (everything else)
• TP = True Positive
• Positive samples correctly classified
as belonging to positive class • FN = False Negative
• Positive samples misclassified as belonging to negative class
• FP = False Positive
• Negative samples misclassified as
belonging to positive class • TN = True Negative
• Negative samples correctly classified as belonging to negative class
True Label
Confusion matrix provides more information than accuracy. Many evaluation metrics can be derived from it.
Negative Positive

Confusion Matrix
§ Binary Classification Example Predicted Label
Positive Negative
True Label
What are the possible reasons leading to FN and FP?
Confusion matrix provides more information than accuracy. Many evaluation metrics can be derived from it.
Negative Positive

Visualization of Different Error Types
Predicted Label Positive Negative
True Label Negative Positive

Accuracy and Classification Error Based on Confusion Matrix
Predicted Label Positive Negative
§ Accuracy
True Label
§ Classification Error = 1 – Accuracy
Negative Positive

Recall (True Positive Rate / Sensitivity / Prob. of Detection)
§ What fraction of all positive samples does the classifier correctly identify as positive?
𝑇𝑃 𝑇𝑃 + 𝐹𝑁
= !” = 0.60 !”#$%
False Negative Rate (FPR) = 1- TPR = 0.4
Predicted Label
True Label
Negative Positive

True Negative Rate (Specificity / Selectivity)
§ What fraction of all negative samples does the classifier correctly identify as negative?
𝑇𝑁 𝑇𝑁 + 𝐹𝑃
= FGG =0.98 FGGHI
False Positive Rate (FNR) = 1- TNR = 0.02
Predicted Label
True Label
Negative Positive

§ What fraction of positive predictions is correct? Predicted Label
True Label
𝑇𝑃 𝑇𝑃 + 𝐹𝑃
= JK =0.79 JKHI
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
Negative Positive

A Graphical Illustration of Precision and Recall

The Precision-Recall Tradeoff

High Precision, Lower Recall

Low Precision, High Recall

F-Score (Harmonic Mean of Precision and Recall)
§ F-Score (also called F1-Score): combining precision & recall into a single
𝐹 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
§ 𝐹L -Score: generalizes F-score for combining precision & recall into a single
𝐹 = (1 + 𝛽”) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝛽”𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
β allows adjustment of the metric to control the emphasis on recall vs precision

Tradeoff between Precision and Recall
§ Recall-oriented machine learning tasks
– Search and information extraction in legal discovery
– Tumor detection
– Often paired with a human expert to filter out false positives
§ Precision-oriented machine learning tasks
– Search engine ranking, query suggestion
– Many customer-facing tasks (users remember failures!)

§ Data Classification Pipeline and Case Studies
– Carry out data exploration to get an overall picture
– Data classification pipeline is more than just applying algorithms
§ Classification Algorithms: Perceptron Algorithm – Learning hyperplane in the feature space
– Learning as optimization
§ Classification Performance Evaluation
– Accuracy, Recall/Sensitivity, Specificity, Precision, F-Score

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts