IN2064 Machine Learning Lecture 5
Lecture 5: Linear Classification
Data Analytics and Machine Learning Technical University of Munich
Winter term 2020/2021
Copyright By PowCoder代写 加微信 powcoder
s scalar is lowercase and not bold s vector is lowercase and bold
S matrix is uppercase and bold yˆ predicted class label
y actual class label
I(a) Indicator function; I(a) = 1 if a is true, else 0
There is not a special symbol for vectors or matrices augmented by the bias term, w0. Assume it is always included as was done with linear regression.
Linear Classification 2
Data Analytics and Machine Learning
Introduction to linear classification
Classification vs. regression
Regression
Output y is continuous (i.e. y ∈ R).
For example, predict the price of a house given its area.
Classification
Output y belongs to one of C predetermined classes (i.e. y ∈ {1, …, C}). For example, determine whether the picture shows a cat or a dog.
Linear Classification 4
Data Analytics and Machine Learning
Classification problem Given
• observations 1
X = {x1,x2,…,xN}, xi ∈ RD
• set of possible classes. • C={1,…,C}
y = {y1,y2,…,yN}, yi ∈ C Find
• functionf:RD →Cthatmaps observations xi to class labels yi
yi =f(xi) fori∈{1,…,N}
1Like before, we represent samples as a data matrix X ∈ RN×D.
Linear Classification 5
Data Analytics and Machine Learning
Zero-one loss
How do we measure quality of a prediction yˆ := f(X)? 2
Zero-one loss denotes the number of misclassified samples.
l01(y,yˆ)= How do we choose a good f(·)?
I(yˆi ̸=yi).
2For brevity, we denote the generated prediction f(xi) as yˆi, i.e. yˆ is the vector of predictions for entire X
Linear Classification 6
Data Analytics and Machine Learning
Hyperplane as a decision boundary
For a 2 class problem (C = {0, 1}) we can try to separate points from the two classes by a hyperplane.
Linear Classification 7
Data Analytics and Machine Learning
Hyperplane as a decision boundary
etry of a mensions. is perpen-
from the meter w0. of a gen- e is given
y > 0 y = 0
A hyperplane be defined by a normal vector wandanoffsetw .
illustrated in Figure 4.1.
For a 2 class problem (C = {0, 1}) we can try to separate points from the
two classes by a hyperplane. S FOR CLASSIFICATION
=0 ifxontheplane
wT x + w > 0 0
p+larnes. arecomputa(t4.i6o)nallyveryconvenient:easytoevaluate. ∥w∥
both sides of this result by wT and adding w0, and making use of y(x) = ndy(x⊥)=wTx⊥ +w0 =0,wehave
r= . (4.7)
the linear regression models in Chapter 3, it is sometimes convenient e compact notation in which we introduce an additional dummy ‘input’
point x and let x⊥ be its orthogonal projection onto the decision surface,
Hypx =erx⊥
if x on normal’s side <0 else
andthendefinew =(w0,w)andx=(x0,x)sothat
Data Analytics and Machine Learning
Linear Classification 7
y(x) = wTx. (4.8)
Hyperplane as a decision boundary
For a 2 class problem (C = {0, 1}) we can try to separate points from the
two classes by a hyperplane. S FOR CLASSIFICATION
etry of a mensions. is perpen-
from the meter w0. of a gen- e is given
y > 0 y = 0
A hyperplane be defined by a normal vector wandanoffsetw .
=0 ifxontheplane
wT x + w > 0 0
p+larnes. arecomputa(t4.i6o)nallyveryconvenient:easytoevaluate. ∥w∥
point x and let x⊥ be its orthogonal projection onto the decision surface,
Hypx =erx⊥
illustrated in Figure 4.1.
both sides of this result by wT and adding w0, and making use of y(x) = ndy(x⊥)=wTx⊥ +w0 =0,wehave
if x on normal’s side <0 else
A data set D = {(xi, yi)} is linearly separable if there exists a hyperplane y(x)
r= . (4.7)
forwhic∥wh∥ allxi withyi =0areononeandallxi withyi =1onthe
other side.
the linear regression models in Chapter 3, it is sometimes convenient e compact notation in which we introduce an additional dummy ‘input’
andthendefinew =(w0,w)andx=(x0,x)sothat
Data Analytics and Machine Learning
Linear Classification 7
y(x) = wTx. (4.8)
Perceptron
The perceptron algorithm is one of the oldest methods for binary classification.
Decision rule
yˆ = f ( w T x + w 0 )
where f is the step function defined as:
f(t)=1 ift>0,
0 otherwise.
Linear Classification 8
Data Analytics and Machine Learning
Learning rule for the perceptron
Initialize parameters to any value, e.g., a zero vector: w, w0 ← 0.
3However, there is no way to determine the number of required iterations in advance.
Linear Classification 9
Data Analytics and Machine Learning
Learning rule for the perceptron
Initialize parameters to any value, e.g., a zero vector: w, w0 ← 0. For each misclassified sample xi in the training set update
w←w+xi ifyi =1, w − xi if yi = 0.
w0←w0+1 ifyi=1, w0−1 ifyi=0.
until all samples are classified correctly.
This method takes a finite number of steps to converge to a (w,w0)
discriminating between two classes if it exists.3
3However, there is no way to determine the number of required iterations in advance.
Linear Classification 9
Data Analytics and Machine Learning
Does this scale up to multiple classes?
One-versus-rest classifier
Each hyperplane Hi makes a decision class Ci ↔ not class Ci
Linear Classification 10
Data Analytics and Machine Learning
Does this scale up to multiple classes?
One-versus-one classifier
Hyperplane Hij makes a decision for each pair of classes
class Ci ↔ class Cj Use majority vote to classify.
Linear Classification 11
Data Analytics and Machine Learning
Does this scale up to multiple classes?
Multiclass discriminant
Define C linear functions of the form f c ( x ) = w Tc x + w 0 c
with the decision rule Ri yˆ = arg max fc(x)
That is, assign x to the class c xA which produces the highest fc(x),
dividing the domain into convex
decision regions Ri.
Linear Classification 12
Data Analytics and Machine Learning
What if the classes are not linearly separable?
Linear Classification 13
Data Analytics and Machine Learning
Basis functions
Like in the Linear Regression lecture last week, we can apply a nonlinear transformation φ : RD → RM . We need to choose a φ that maps samples to a space where they are linearly separable.
Here, φ(x) = (θ, r) = (angle(x), ∥x∥2).
Here we talk about fixed basis functions. A deeper discussion of this method will take place in Kernels. Adaptive basis functions will be covered in Deep Learning.
Linear Classification 14
Data Analytics and Machine Learning
Limitations of hard-decision based classifiers
• No measure of uncertainty • Can’t handle noisy data
• Poor generalization
• Difficult to optimize
What are the alternatives?
Linear Classification 15
Data Analytics and Machine Learning
Probabilistic models for classification
Solution: model the distribution of the class label y given the data x. p(y=c|x)= p(x|y=c)·p(y=c)
Two types of models:
Generative
• Modelthejointdistributionp(x,y=c)=p(x|y=c)·p(y=c) Discriminative
• Directly model the posterior p(y = c | x)
Given p(y | x) we can make the prediction yˆ based on our problem. Popular choice is the mode: yˆ = arg maxc∈C p(y = c | x).
Linear Classification 16
Data Analytics and Machine Learning
Probabilistic generative models for linear classification
Generative model
The idea is to obtain the class posterior using Bayes’ theorem p(y=c|x)∝ p(x|y=c) ·p(y=c) (1)
class conditional class prior
The model consists of
• class prior – a priori probability of a point belonging to a class c
• class conditional – probability of generating a point x, given that it belongs to class c
Linear Classification 18
Data Analytics and Machine Learning
Applying a generative model
Applying a generative model typically works as following
• Choose a parametric model for the class conditional p(x | y = c, ψ)
and the class prior p(y = c | θ).
• Estimate the parameters of our model {ψ, θ} from the data D (e.g., using maximum likelihood – obtain estimates {ψˆ, θˆ}). This step is called learning.
Once fitted, we can perform inference – classify a new x using Bayes rule p(y = c | x, ψˆ, θˆ) ∝ p(x | y = c, ψˆ) p(y = c | θˆ) (2)
Additionally, we can generate new data – hence the name. • sample a class label ynew ∼ p(y | θˆ)
• sample a feature vector xnew ∼ p(x | y = ynew , ψˆ)
Linear Classification 19
Data Analytics and Machine Learning
How do we choose the class prior p(y = c)? The label y can take one of C discrete values.
=⇒ Use categorical distribution!
y ∼ Categorical(θ)
The parameter θ ∈ RC specifies the probability of each class C
p(y = c) = θ or equivalently p(y) = θI(y=c) cc
and is subject to the constraints 0 ≤ θc ≤ 1 and Cc=1 θc = 1.
The maximum likelihood estimate for θ given the data D = {(xi, yi)} is MLE 1N
θc = N I(yi = c) i=1
Linear Classification 20
Data Analytics and Machine Learning
How do we choose the class conditionals p(x | y = c)?
The feature vector x ∈ RD is continuous.
=⇒ Use a multivariate normal for each class!
p(x|y=c)=N(x|μc,Σ) = 1
(3) exp −1(x−μc)TΣ−1(x−μc) (4)
(2π)D/2|Σ|1/2
We use the same Σ for each class, as estimating all Σc’s behaves badly
numerically, unless we have lots of data.
The MLE estimates for {μ1, …, μC , Σ} will be derived in the tutorial (or see Bishop 4.2.2).
Linear Classification 21
Data Analytics and Machine Learning
Posterior distribution
Now that we have chosen p(x | y) and p(y), and have estimated their parameters from the training data, how do we perform classification?
Let’s assume for simplicity that we have two classes C = {0, 1}.
p(y = 1 | x) = =
where we defined
p(x | y = 1) p(y = 1) (5) p(x | y = 1) p(y = 1) + p(x | y = 0) p(y = 0)
1 =: σ(a) (6) 1+exp(−a)
a = log p(x | y = 1) p(y = 1) (7) p(x | y = 0) p(y = 0)
and σ is the sigmoid function.
To avoid clutter, we implicitly condition the distributions on their respective parameters (θ, μc, Σ)
Linear Classification 22
Data Analytics and Machine Learning
Linear discriminant analysis (LDA)
Let’s look at how this function looks for Gaussian class-conditionals with the same covariance Σ
a = log p(x | y = 1) p(y = 1) (8) p(x | y = 0) p(y = 0)
= − 12 (x − μ1 )T Σ−1 (x − μ1 ) + log p(y = 1) (9) + 12 (x − μ0 )T Σ−1 (x − μ0 ) − log p(y = 0) (10)
= wT x + w0 where we define
w = Σ−1(μ1 − μ0)
w0 = −1μ1Σ−1μ1 + 1μ0Σ−1μ0 + log p(y = 1) (13) 2 2 p(y = 0)
Linear Classification 23
Data Analytics and Machine Learning
LDA for C = 2 classes
This means, that the posterior distribution is a sigmoid of a linear
function of x
or equivalently
p(y = 1 | x) = 1 (14) 1 + exp(−(wT x + w0))
= σ(wT x + w0) (15) y | x ∼ Bernoulli(σ(wT x + w0)) (16)
This is how this function looks for D = 2
p(x | y = 1) – red p(x | y = 0) – blue
p(y = 1 | x)
Linear Classification 24
Data Analytics and Machine Learning
LDA for C > 2 classes
Using Bayes’ theorem, the posterior for the C > 2 case is
p(y=c|x)= p(x|y=c)p(y=c) . (17) C′′
Working out the math, we get
c′=1 p(x | y = c )p(y = c ) = exp(wcTx+wc0)
c′=1 exp(wc′ x + wc′0)
wc = Σ−1μc
wc0 =−12μcΣ−1μc+logp(y=c)
Linear Classification 25
Data Analytics and Machine Learning
Softmax function
On the previous slide we made use of the softmax function. Softmax σ is a generalization of sigmoid to multiple dimensions
σ:RK →△K−1 (21)
△K−1 = x∈RK| xk =1andxk ≥0,k=1,…,K (22)
is the standard probability simplex.
Softmax is defined as
σ(x)i = exp(xi) (23) Kk=1 exp(xk)
Linear Classification 26
Data Analytics and Machine Learning
Class conditionals: Variant II (Naive Bayes)
Naive Bayes: Assume that the d features of a sample
x = (x1, x2, . . . , xd) are conditionally independent given the class, i.e.
p(x1,x2,…,xd|y = c) =
In the case of continuous data where the likelihood is assumed to be a
normal distribution, this corresponds to diagonal covariance matrices, i.e. p(x|y = c) = N (x|μc, Σc)
Important: We use a different covariance matrix Σc for each class c! LDA: All class conditionals share the same covariance matrix Σ.
p(xi|y = c)
Linear Classification 27
Data Analytics and Machine Learning
Naive Bayes
Let’s assume for simplicity that we have two classes C = {0, 1}. Like in LDA we need to compute
a = log p(x | y = 1) p(y = 1) p(x | y = 0) p(y = 0)
= 1xT[Σ−1 −Σ−1]x+xT[Σ−1μ −Σ−1μ ] 2011100
− 1μTΣ−1μ + 1μTΣ−1μ +log π1 + 1 log |Σ0| 2 1 1 1 2 0 0 0 π0 2 |Σ1|
= x T W 2 x + w 1T x + w 0 where we define
W =1[Σ−1−Σ−1] 2201
w =Σ−1μ −Σ−1μ 11100
w =−1μTΣ−1μ +1μTΣ−1μ +logπ1 +1log|Σ0| 0 2 1 1 1 2 0 0 0 π0 2 |Σ1|
Linear Classification 28
Data Analytics and Machine Learning
Naive Bayes: Decision Boundary
2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5
−2 −1 0 1 2
Naive Bayes results in a quadratic decision boundary. Recap: LDA leads to a linear decision boundary
Linear Classification 29
Data Analytics and Machine Learning
Naive Bayes: Advantage
Class conditional for Naive Bayes:
p(x1,x2,…,xd|y = c) =
Advantage: Independence of the features allows to easily handle different
data types/mixed data types.
Simply choose a suitable (univariate) distribution for each feature: x1 – Gaussian, x2 – Categorical, . . .
p(xi|y = c)
Linear Classification 30
Data Analytics and Machine Learning
Probabilistic discriminative models for linear classification
Probabilistic discriminative model
An alternative approach to generative modeling is to model the posterior distribution p(y | x) directly. Such models are called discriminative.
We saw in the previous section that a generative approach with Gaussian class-conditionals with a shared covariance matrix Σ (LDA) leads to the posterior distribution
p(y = 1 | x) = σ(xT w + w0), (24) p(y = 0 | x) = 1 − σ(xT w + w0) (25)
where w, w0 depend on the parameters of class-conditionals μ0, μ1, Σ. Why not just let w and w0 be free parameters and choose them directly?
Linear Classification 32
Data Analytics and Machine Learning
Logistic regression
We model the posterior distribution as
y | x ∼ Bernoulli(σ(wT x + w0)) (26)
σ(a) = 1 (27) 1+exp(−a)
and w, w0 are the free model parameters.
This model is called logistic regression.
Linear Classification 33
Data Analytics and Machine Learning
Absorbing the bias term
Like in the previous lecture, we again absorb the bias term by overloading the notation and defining
wTx:=w0 +w1x1 +…+wDxD (28) Which is equivalent to defining x0 = 1.
Linear Classification 34
Data Analytics and Machine Learning
Likelihood of logistic regression
Learning logistic regression comes down to finding a“good”setting of parameters w that ”explain”the training set D = {(xi,yi)}Ni=1.
Assuming that all samples (xi,yi) are drawn i.i.d., we can write the likelihood as
p(y | w,X) = =
p(yi | xi,w) (29)
1 − p(y = 1 | xi, w) (30)
=1 if yi=0
σ(wT xi)yi (1 − σ(wT xi))1−yi (31)
p(y = 1 | xi, w)yi
=1 if yi=1
Linear Classification 35
Data Analytics and Machine Learning
Negative log-likelihood
Similarly to the linear regression case, we can define an error function, the negative log-likelihood:
E(w) = −logp y | w,X (32) NTT
=− yilogσ(w xi)+(1−yi)log(1−σ(w xi)) (33) i=1
This loss function is called binary cross entropy.
Finding the maximum likelihood estimate for w is equivalent to solving
w∗ =argminE(w) (34) w
Linear Classification 36
Data Analytics and Machine Learning
Solving the minimization problem
There doesn’t exist a closed form solution for logistic regression. This means, we cannot represent the optimal w∗ directly using standard mathematical operations, such as multiplication, matrix inversion, etc.
However, there is still hope! We can use optimization to numerically solve our problem. We will cover this in the next lecture.
For now, just assume that we can find w∗.
Linear Classification 37
Data Analytics and Machine Learning
Logistic regression + weights regularization
As we already well know, maximum likelihood estimation may often lead to overfitting. Just like in case of linear regression, we can control this by penalizing large weights.
E(w) = − log py | w, X + λ||w||q (35) Like before, for q = 2 this corresponds to MAP estimation with a
Gaussian prior on w.
Again, there is no closed form solution available.
Linear Classification 38
Data Analytics and Machine Learning
Multiclass logistic regression
For the binary classification we used the sigmoid function to“squeeze”the unnormalized probability wT x into the range (0, 1).
The same can be done for multiple classes using the softmax function.
exp(wTc x) p(y = c | x) = T
c′ exp(wc′x) How does this relate to multiclass LDA?
Linear Classification 39
Data Analytics and Machine Learning
Loss for multiclass logistic regression
The negative log-likelihood for multiclass LR can be written as
E(w) = −logp(Y | w,X) (36)
= − and is called cross entropy.
N C i=1 c=1
yic logp(yi = c | xi,w) (37)
yic log T (38)
encoded as a binary matrix Y ∈ {0, 1}N ×C , where
yic = 1 if sample i belongs to class c (39) 0 else
c′ exp(wc′x)
Here we use one-hot encoding: vector of categorical variables y ∈ CN is
Linear Classification 40
Data Analytics and Machine Learning
Generative vs. discriminative models
• In general, discriminative models achieve better performance when it comes to pure classification tasks.
• While generative models work reasonably well when their assumptions hold, they are quite fragile when these assumptions are violated.
• Generative modeling for high-dimensional / strongly correlated data like images or graphs is still an open research challenge.
• Nevertheless, generative models provide the added benefits of better handling missing data, detecting outliers, generating new data and being more appropriate in the semi-supervised setting.
Linear Classification 41
Data Analytics and Machine Learning
Reading material
Main reading
• “Pattern Recognition and Machine Learning” by Bishop [ch. 4.1.1, 4.1.2, 4.1.7, 4.2, 4.3.0–4.3.4]
Slides are based on an older version by G. Jensen and C. Osendorfer. Some figures are from Bishop’s “Pattern Recognition and Machine Learning”.
Linear Classification 42
Data Analytics and Machine Learning
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com