Machine Learning 1, Course Outline
� Covered topics
� Bayes Decision Theory, Parameter Estimation
� Component/Discriminant Analyses (PCA, Fisher)
� Model Selection, Learning Theory
� SVMs, Regression, Neural Nets, Boosting, Decision Trees
� Clustering, Latent Variable Models
� Weekly homeworks
� Elective course (only for Machine Learning 1-X)
1/29
Machine Learning 1, Important Dates
Registering homework groups
� Request need to be sent before Friday 6 November 2020 at 14:00.
First homework submission deadline
� Monday 9 November 2020 at 14:00.
Exam registration
� Opens in January 2021. To be able to register, you need to first pass the prerequisites (homeworks / elective) and then request an U ̈bungsschein/Genehmigung.
Exams
� March 2021, two sessions (you can choose the session you prefer), exact exam dates still to be determined.
2/29
Textbook Reference
Duda et al. Pattern Classification, 2nd Edi- tion (2000)
� The first four lectures will be based on this textbook.
� This week: Sections 2.1-2.6;2.9
3/29
Example of a Machine Learning Classifier
Image source: Duda et al. (2000)
Information about the input (fish on the conveyor belt) is first acquired through sensors (e.g. wage, camera, etc.)
The data is preprocessed (e.g. cropping, rescaling, normaliza- tion).
Then a finite number of features is extracted.
Finally, a classifier can be learned on these features.
4/29
Single-Feature Example (Length)
5/29
Two-Feature Example (Length + Lightness)
6/29
Bayesian Decision Theory
Goals:
� Merging empirical observations with prior knowledge about the task (e.g. prior class probabilities).
� Incorporating distributional assumptions (e.g. Gaussian-generated data, independence of features).
� Build optimal classifiers (in terms of classification accuracy or in terms of costs).
7/29
Bayesian Decision Theory
Notation:
� ω1 , ω2 , . . . Different classes (e.g. salmon, sea bass, …)
� x ∈ Rd vector of observations
(e.g. x1 is the length and x2 is the lightness). we now add probabilities …
� P(ωj): Probability of being of class j.
� p(x): Data density function (e.g. histogram)
� p(x|ωj):Class-conditioneddatadensityfunction(e.g.histogram). � P(ωj |x): Probability of being of a certain class after observing x.
8/29
Bayes Theorem
P(ωj |x)= p(x|ωj)·P(ωj) p(x)
Image source: Duda et al. (2000)
9/29
Optimally Accurate Classifier
Decide � ω1 if P(ω1 |x) > P(ω2 |x) ω2 else.
Alternate formulations:
10/29
Multivariate Normal Distributions
P (x|ωj ) = � 1
(2π)d det(Σj )
exp � − 1 (x − μj )� Σ−1 (x − μj )� 2 j
� μj is the mean (center of the data distribution)
� Σj is the covariance (elongation of the data distribution).
Image source: Duda et al. (2000)
11/29
Optimal Classifier for Gaussians (Σ1 = Σ2)
Note: logP(x|ωj)=−d log2π− 1 logdet(Σ)− 1(x−μj)�Σ−1(x−μj) 222
12/29
Optimal Classifier for Gaussians (Σ1 = Σ2)
Image source: Duda et al. (2000)
� Decision boundary is linear and oriented by mean and covariance. � Offset is controlled by class prior probabilities.
13/29
Optimal Classifier for Gaussians (Σ1 �= Σ2)
� When covariances Σ1 and Σ2 are not the same, the decision boundary is quadric instead of linear. Quadrics include circle, ellipse, parabola, hyperbola, and degenerate forms.
Image source: Duda et al. (2000)
14/29
Binary Data Distributions
Assume that our data is now binary: x ∈ {0, 1}d . With each dimension generated independently according to some parameters p and q.
P(xi =0|ω1)=1−pi P(xi =0|ω2)=1−qi P(xi =1|ω1)=pi P(xi =1|ω2)=qi
The probability of the whole multivariate observation can be written as: P(x|ω1)=�d �pixi +(1−pi)·(1−xi)�
i=1
Question: How to express the decision boundary
P(ω1|x) > P(ω2|x)
15/29
Optimal Classifier for Binary Data (1st try)
Recall: P(x|ω1)=�di=1�pixi +(1−pi)·(1−xi)�
16/29
Optimal Classifier for Binary Data
Observation:
For binary data, the data likelihood P(x|ω1) = �d �pixi +(1−pi)·(1−xi)�
i=1
can be equivalently expressed as
P(x|ω1) = �d �pxi · (1 − pi )(1−xi )� i
i=1
17/29
Optimal Classifier for Binary Data (2nd try)
Recall: P(x|ω1) = �d �pxi · (1 − pi )(1−xi )� i=1 i
18/29
From Binary to Multi-Class
The two-class problem: ω = (ω1, ω2)
decide � ω1 if P(ω1 |x) > P(ω2 |x)
ω2 else, can be generalized to multiple classes.
Let ω = (ω1, ω2, . . . , ωC ) be our C possible classes. The optimal decision now takes the form:
decide ωi if ∀j�=i : P(ωi |x) > P(ωj |x) … or equivalently …
decideωi where i=argmaxP(ωj|x) j
19/29
Example: Multi-Class Gaussian
Decision function:
“decide ωi ” i=argmax P(ωj|x)
j
The decision surface of a multi-class classifier can be quite complex and nonlinear.
Image source: Duda et al. (2000)
Observation:
20/29
From Classification to Cost Minimization
Example: Suppose you would like to purchase a second-hand car. After observing the car, you assess that it has a defect with probability
P(defect | x) = 0.1 P(no defect | x) = 0.9 Concretely, the decision you need to take is not to classify the whether
the car has a defect or not, but whether to buy the car or not. For this, we need to evaluate the cost of each scenario, e.g.
cost(buy | defect) = 100.0 cost(buy | no defect) = −20.0 cost(not buy | defect) = 0.0 cost(not buy | no defect) = 0.0
and take the action with the lowest expected cost.
21/29
From Classification to Cost Minimization
Classification Accuracy Maximization:
“classify as ωi ” where i = argmax P (ωj |x)
j
Expected Cost Minimization: Let (αi)i be the set of actions. The
expected cost of taking a certain action is given by
�C
R(αj |x) = λ(αj |ωk )P(ωk |x)
k=1
where λ(αj , ωk ) is the cost of action αj given the class ωk . We then decide to
“take action αi ” where i = argmin R (αj |x) j
22/29
Classification Accuracy as a Special Case
Recall: R (αj |x) = �Ck =1 λ(αj |ωk )P (ωk |x)
Show that the problem of maximum accuracy classification is a special instance of expected cost minimization with a particular set of actions (αj)j andaparticularcostfunctionλ(αj|ωk).
23/29
Analyzing the Two-Class Case
Recall: R (αj |x) = �Ck =1 λ(αj |ωk )P (ωk |x)
Show that in the two-action two-class case, changing the cost function
means shifting the decision boundary.
(Note: when differences of lambdas are negative, we need to stop before applying the log)
24/29
Measuring Classification Error
So far, we have studied what the decision boundary should be in order to predict optimally.
However, in certain cases, it is also important to determine what is the expected error of the classifier (e.g. to determine whether the classifier is good enough for practical use).
The expected error is the probability that the data is of a different class than the one predicted:
P(Err|x) = � P(ω1 |x) if “decide ω2” P(ω2 | x) if “decide ω1”
For a maximally accurate classifier, this reduces to P(Err | x) = min{P(ω1 | x), P(ω2 | x)}
25/29
Measuring Classification Error
The expected error of this maximally accurate classifier is computed as the integral of its error probability over the distribution p(x).
P(Err)=� P(Err|x)p(x)dx x
= min{P(ω1 | x), P(ω2 | x)}p(x)dx x
This is also known as the Bayes error rate.
Generally, this integral cannot be solved analytically, because of the min function. Error must instead be bounded or evaluated empirically.
26/29
Bounding the Error of a Classifier
Very basic bound
Observe for binary classification that P (ω1 | x) = 1 − P (ω2 | x). The error can be bounded as:
P(Err)=� min{P(ω1|x),P(ω2|x)}p(x)dx x
≤ 0.5p(x)dx = 0.5 x
The optimal classifier predicts the correct class at least 50% of the time.
27/29
Slightly Better Bound
Recall: P(ωj |x)= p(x|ωj)·P(ωj) p(x)
The error can be bounded as:
P(Err)=� min{P(ω1|x),P(ω2|x)}p(x)dx
x
=� min{p(x|ω1)P(ω1),p(x|ω2)P(ω2)}dx x
≤ max{p(x | ω1), p(x | ω2)} min{P(ω1), P(ω2)}dx x
≤ 2 min{P(ω1), P(ω2)}
Additional insight: The optimal classifier improves its accuracy when one class prior probability is strongly dominant over to the other class.
28/29
Summary
� Bayesian decision theory allows to build optimal classifiers based on prior knowledge about the classes and the data distributions.
� For certain simple data distributions, optimal classifiers are linear.
� The framework is general enough to extend to multi-class
scenarios and user-defined cost functions.
� Error of the optimal classifier is typically harder to compute than the actual decision function, but it is still possible to compute bounds on the error.
29/29