CS计算机代考程序代写 data mining algorithm Bayesian COMP9517: Computer Vision

COMP9517: Computer Vision
Pattern Recognition Part 2
Week 4 COMP9517 2021 T1 1

• Separable classes
Separability
• if a discrimination hyperspace exists that separates the feature space such that only objects from one class are in each region, then the recognition task has separable classes
• Linearly separable
• if the discrimination hyperspaces are hyperplanes, it is linearly separable
Week 4 COMP9517 2021 T1 2

Linear Classifier
• If we have training set of N observations: 􏰜􏰜􏰜􏰝􏰜
• A binary classification problem can be modeled by using the data such that:
􏰜 􏰜􏰜
􏰜 􏰜
• So in this approach Week 4
COMP9517 2021 T1
3

Linear Classifier
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/linear_classification.html
• A linear classifier has the form:
= 􏰐 􏰐 􏰓 􏰓+…+ 􏰝 􏰝
• This is equivalent to a line in 2D, a plane in 3D and a hyperplane in nD
• We use the training data to learn and
Week 4 COMP9517 2021 T1 4
􏰶

Linear Classifier
• Which line is the best line?
• For the purpose of generalization a line with large
margin is preferred
• A maximum margin solution is the most stable model under perturbation of the input
Week 4 COMP9517 2021 T1 5

Support Vector Machines
• We are looking for a which satisfies: 􏰶
• We know that this is equivalent to: 􏰶
• This means that we have freedom to choose 𝑊
• If we choose such that
𝑊􏰶𝑥􏰗 + 𝑏 = −1
which we want to
COMP9517 2021 T1 6
𝑊􏰶𝑥􏰷 + 𝑏 = +1 • Then the margin is 2
maximize 𝑊 Week 4

Support Vector Machines
• Support Vector Machines (SVM) can be formulated as an optimization problem: 􏰓􏰶
􏰸􏰸􏰜􏰜
• and this is equivalent to: 􏰸􏰹􏰶
􏰸􏰓􏰜􏰜
• This is a quadratic optimization problem subject to linear constraints, which has a unique minimum and can be solved with Lagrangian multipliers method
• This is called “hard margin SVM” which does not allow any misclassification, but if classes are not fully separable, we have to use “soft margin SVM” which allows some degree of misclassification
Week 4 COMP9517 2021 T1 7

Soft Margin Support Vector Machines
• In hard margin SVM, we assume classes are linearly separable, but what if separability assumptions doesn’t hold?
• Introduce “slack” variables 􏰜 to allow misclassification of instances
Week 4 COMP9517 2021 T1 8

Soft Margin Support Vector Machines
When classes were linearly separable, we had:
𝑦􏰜 𝑤.x􏰜+𝑏 ≥1
But if we get some data that violate this slack value:
So, the total violation:
𝑦􏰜 𝑤.x􏰜+𝑏 ≥1−𝜉􏰜 𝑎𝑛𝑑𝜉􏰜≥0
􏰻
𝑡𝑜𝑡𝑎𝑙 𝑣𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛 = 􏰺 𝜉􏰜 􏰜􏰘􏰐
This is a measure of violation of margin and now, we optimize for:
𝑊􏰓􏰻
arg𝑚𝑖𝑛􏰸 2 +𝐶􏰺𝜉􏰜 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑡𝑜 𝑦􏰜 𝑤.x􏰜 +𝑏 ≥1−𝜉􏰜 𝑎𝑛𝑑𝜉􏰜 ≥0
Week 4
COMP9517 2021 T1 9
􏰜􏰘􏰐

Soft Margin Support Vector Machines
• “Soft margin” SVMs to handle noisy data
• Parameter bounds influence of any one training instance on
decision boundary
• If goes to infinity you go back to the hard margin SVM (doesn’t tolerate error)
• Still a quadratic optimization problem
Week 4 COMP9517 2021 T1 10

Nonlinear Support Vector Machines
• To generate non-linear decision boundaries, we can map the features into new feature space where classes are linearly separable and then apply the SVM there
Image source: https://medium.com/analytics-vidhya/how-to-classify-non-linear-data-to-linear-data-bb2df1a6b781
• In SVM, feature mapping into high dimensional space can be done using Kernel trick which helps with the complexity of the optimization problem
Week 4 COMP9517 2021 T1 11

Support Vector Machines
• Pros:
• Very effective in high dimension
• Effective when the number of features is larger than the training data size
• Among the best algorithms when classes are separable
• SVM algorithms can work very well when data is sparse (i.e. many values are
• Cons:
• For larger dataset, it takes longer time to process
• Does not perform well for overlapping classes
• Hyperparameters need to be tuned for sufficient generalization
Week 4 COMP9517 2021 T1
)
12

Multiclass Classification
• If there are more than two classes in our observations, we have to build multiclass classifier
• Some methods may be directly used for multiclass classification • K-nearestneighbours
• Decisiontrees
• Bayesiantechniques
• For those that cannot be directly applied to multiclass problems, we can transform them to binary classification by building multiple binary classifiers.
• There are two techniques:
• Onevsrest:buildsoneclassifierforoneclassvstherestandassignatestsampletotheclass
that has the highest confidence score
• Onevsone:buildsoneclassifierforeverypairofclassesandassignatestsampletotheclass that has the highest number of predictions
Week 4 COMP9517 2021 T1 13

Evaluation of Error
• Error rate
• error rate of classification system measures how well the system solves the problem
it was designed for
• Reject class
• generic class for objects that cannot be placed in any of the known classes
• Performance
• Performance determined by both error and rejections made
• Classifying all inputs into reject class means system makes no errors, but is useless!
• Classification error
• The classifier makes classification error whenever it classifies input object as class
Week 4
􏰜
when true class is 􏰒, , and 􏰜 􏰼 (the reject class)
COMP9517 2021 T1 14

Evaluation of Error
• Empirical error rate
• Empirical error rate is the number of errors made on independent test data
divided by number of classifications attempted
• Empirical reject rate
• is the number of rejects on independent test data divided by number of
classifications attempted
• Independent test data
• are sample objects with true class (labels) known, including objects from the reject class, and that were not used in designing the feature extraction and classification algorithms
• Samples used for training and testing should be representative
Week 4 COMP9517 2021 T1 15

False Alarms and False Dismissals
• For two-class problems, the errors have a special meaning and are not symmetric
• For example, in medical diagnosis, when a person has disease versus person does not have disease:
• If the person does NOT have the disease, but the system incorrectly says she does, then the error is a false alarm/false positive
• On the other hand, if the person DOES have the disease, but the system incorrectly says she does NOT, then the error is a false dismissal or false negative
• Consequences and costs of the two errors are very different
Week 4 COMP9517 2021 T1 16

False Alarms and False Dismissals
• There are bad consequences to both, but false negative is generally more catastrophic
• So, we generally try to bias the system to minimize false negatives, possibly at the cost of increasing the false positives
• The Receiver Operator Curve (ROC) relates the false alarm rate to correct detection rate
• In order to increase correct detections, we may have to pay the cost of higher number of false alarms.
Week 4 COMP9517 2021 T1 17

Receiver Operating Curve ROC
• plots correct detection rate versus false alarm rate
• generally, false alarms go up with attempts to detect higher percentages of known objects
• Area Under (RO)C -AUC
Week 4 COMP9517 2021 T1 18

Confusion Matrix
• Matrix whose entry (i, j) records the number of times that an object truly of class i was classified as class j (True positive)
• used to report results of classification experiments
• diagonal entries indicate the successes
• high off-diagonal numbers indicate
confusion between classes
Week 4 COMP9517 2021 T1 19

Confusion Matrix
• Table of Confusion
• For binary classification
Prediction Outcome
P
N
Actual Value
P’
True Positive(TP)
False Negative (FN)
N’
False Positive(FP)
True Negative(TN)
Accuracy = TP + TN
TP + TN + FP + FN
• Accuracy
Week 4 COMP9517 2021 T1 20

Precision versus Recall
• Precision/correctness
• isthenumberofrelevantobjectsclassifiedcorrectly
divided by the total number of relevant objects classified
Precision = TP TP+FP
• Recall/sensitivity/completeness
• is the number of relevant objects classified correctly
divided by total number of relevant/correct objects
Recall = TP TP+FN
https://en.wikipedia.org/wiki/Precision_and_recall
Week 4
COMP9517 2021 T1 21

More Terminology and Metrics
Week 4 COMP9517 2021 T1 22

Regression
• Suppose that we have a training set of N observations: 􏰜􏰜􏰜􏰝􏰜
• Similar to classification, the regression problem is to estimate from the training data such that:
􏰜􏰜
• But here the output variable has a continuous value
Week 4 COMP9517 2021 T1 23

Linear Regression
• In linear regression, we assume there is a linear relationship between the output and features:
􏰙 􏰐 􏰐 􏰓 􏰓+…+ 􏰝 􏰝 􏰐 􏰓,…, 􏰝],
􏰙 􏰐,…,􏰝􏰶
• How to find the best line?
• The most popular estimation model is “least squares”
Week 4 COMP9517 2021 T1 24

Least Squares Regression
• The idea is to minimize the residual sum of
squares
􏰚
􏰜􏰜 􏰜􏰘􏰐
􏰓􏰶
• How to find the fit?
𝑊􏰽 = arg𝑚𝑖𝑛􏰸(𝑅𝑆𝑆 𝑊 )
• It turns out that is a quadratic function and we can differentiate RSS with respect to
Week 4 COMP9517 2021 T1 25

Least Squares Regression
􏰶
􏰓
􏰶
• If we assume that X has full rank, then 􏰶 is positive and it means we have a convex function which has a minimum so:
􏰶
Week 4
􏰶
􏰶 􏰗􏰐 􏰶
COMP9517 2021 T1 26

Linear Regression: Example
• Assume that we have the length and width of some fishes and we want to estimate their weights from this information (features)
• Start with one feature 􏰐 which is easier for visualization.
Length (x1)
Width (x2)
Weight (y)
100
40
5
102
35
4.5
92
33
4
83
29
3.9
87
36
3.5
95
30
3.6
87
37
3.4
104
38
4.8
101
34
4.6
97
39
4.3
Week 4
􏰙􏰐􏰐
,􏰙, 􏰐
COMP9517 2021 T1 27

Linear Regression: Example
􏰶 􏰗􏰐 􏰶
􏰚
𝑅𝑆𝑆 𝑊 = 􏰺(𝑦􏰜 −𝑓 𝑥􏰜 )􏰓 =(𝑦−𝑋𝑊)􏰶 𝑦−𝑋𝑊 =0.9438 􏰜􏰘􏰐
• For two features 􏰐 􏰓 we repeat the same procedure, the only difference is in :
,
Week 4
COMP9517 2021 T1 28

Regression Evaluation Metrics
• Root Mean Square Error (RMSE)
• Itrepresentsthestandarddeviationofthepredictedvaluesfromtheobservedvalues
1􏰚
𝑅 𝑀 𝑆 𝐸 = 𝑁 􏰺 ( 𝑦 􏰜 − 𝑦􏰾 􏰜 ) 􏰓
􏰜􏰘􏰐
• Mean Absolute Error (MAE)
• Itrepresentstheaverageoftheabsolutedifferencesbetweenthepredictedvaluesand
observed values
1􏰚
𝑀 𝐴 𝐸 = 𝑁 􏰺 | 𝑦 􏰜 − 𝑦􏰾 􏰜 |
􏰜􏰘􏰐
RMSE penalizes big differences between predicted values and observed values more heavily. Smaller values of RMSE and MAE are more desirable.
Week 4 COMP9517 2021 T1 29

Regression Evaluation Metrics
• R-Squared ( 𝟐)
• Explainshowwelltheselectedfeature(s)explaintheoutputvariable
𝑅 􏰓 = 1 − ∑ 􏰚􏰜 􏰘 􏰐 ( 𝑦 􏰜 − 𝑦􏰾 􏰜 ) 􏰓 ∑ 􏰚􏰜 􏰘 􏰐 ( 𝑦 􏰜 − 𝑦􏰿 ) 􏰓
• OneproblemwithR-squaredisthatbyaddinganyextrafeatures,itsvalueincreasesevenifthe feature does not actually improve the model
• Adjuster R-Squared (Adjuster 𝟐)
• Explainshowwelltheselectedfeature(s)explaintheoutputvariable,adjustedforthenumberof
features:
𝑅􏰓 =1− (1−𝑅􏰓)(𝑁−1) 􏰮􏰝􏰒. 𝑁−𝑑−1
where 𝑁 is the number of samples and 𝑑 is the number of features Larger values of R-squares and adjuster R-squared are more desirable
Week 4 COMP9517 2021 T1 30

Normalization
• Goal: to change the scale of numeric values to a common scale • Commonly applied techniques:
• Z-score: re-scales the data(features) such that it will have a standard normal distribution (with mean=0, variance=1). It works well for data which is normally distributed:
• Min-max normalization: re-scale the range of the data to [0,1]. So the minimum value will be mapped to 0 and the maximum value will be mapped to 1.
Week 4
􏰻􏰜􏰨 􏰻􏰮􏰖 􏰻􏰜􏰨
COMP9517 2021 T1 31

Cross Validation
• We want to make sure that when we are done with training our model, it will work well on out-of-sample data. To evaluate the performance of the model in terms of under- fitting/over-fitting/generalizability, we need to test it on unseen data for which we know the ground truth/labels. Cross validation (CV) is a technique to test the effectiveness of our model.
• •
Train-test split: in this approach, we randomly split the available data into training and test sets (usually 80:20). We train the model on the training set and then evaluate it on the test set.
K-fold cross validation: We split the data into K subsets or folds, and at each iteration we keep one fold out for cross validation and use the rest for training. We repeat this procedure K times, until all K folds have been used once as the test set. The performance of the model will be the average of the performance on K test sets.
COMP9517 2021 T1 32
Week 4

References and Acknowledgements
• Shapiro and Stockman, Chapter 4
• Duda, Hart and Stork, Chapters 1, 2.1
• Hastie, Tibshirani & Friedman, “the elements of statistical learning”, Chapter 2, chapter 12
• More references
– Sergios Theodoridis, Konstantinos Koutroumbas, Pattern Recognition, 2009
– Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2005
• Some diagrams are extracted from the above resources
Week 4 COMP9517 2021 T1 33