COMP9417 – Machine Learning Tutorial: Classification
Weekly Problem Set: Please submit questions 2c, 3a and 3b on Moodle by 12pm Tuesday 8th March, 2022. Please only submit these requested questions and no others.
Question 1 (Bayes Rule)
Assume that the probability of a certain disease is 0.01. The probability of testing positive given that a person is infected with the disease is 0.95, and the probability of testing positive given that the person is not infected with the disease is 0.05.
Copyright By PowCoder代写 加微信 powcoder
(a) Calculate the probability of testing positive.
Let T denote the event that the test returns a positive, and let D denote the event that an individual has the disease, and D the complement of D. Then we wish to find P (T ). It follows by the law of total probability that
P(T) = P(T|D)P(D) + P(T|D)P(D) = 0.95×0.01+0.05×(1−0.01) = 0.059.
In other words, the probability that a randomly chosen individual tests positive is 5.9%.
(b) Calculate the probability of being infected with the disease, given that the test is positive.
We want to compute the probability: P (D|T ). From Bayes rule, we have P(D|T)= P(D∩T)
= P(T|D)P(D)
P(T) = 0.95 × 0.01
0.059 = 0.16,
so the probability that someone has the disease given that the test is positive is only 16%. This result is somewhat surprising, but is due to the fact that the disease is so rare, that it pops up
only in 1% of the population. In other words, our prior information on the disease pushes the probability of having the disease even after testing positive to be quite low.
(c) Now assume that you test the individual a second time, and the test comes back positive (so two tests, two positives). Assume that conditional on having the disease, the outcomes of the two tests are independent, what is the probability that the individual has the disease? (note, conditional independence in this case means that P(TT|D) = P(T|D)P(T|D), and not P(TT) = P(T)P(T).) You may also assume that the test outcomes are conditionally independent given not having the disease.
We now want to compute the probability P (D|T T ). First note that
P(TT) = P(TT|D)P(D) + P(TT|D)P(D) = P(T|D)2P(D) + P(T|D)2P(D)
= (0.95)2 × 0.01 + (0.05)2 × (1 − 0.01) = 0.0115.
P(D|TT) = P(TT|D)P(D) P(TT)
= (0.95)2 × 0.01 0.0115
Now we see that our data (the two positives) is starting to contribute more and more evidence, and so our prior belief about the disease contributes less to the probability.
Question 2 (Lecture Review)
In this question, we will review some important ideas from the lecture.
(a) What is probabalistic classification? How does it differ from non-probabalistic classification meth- ods?
In a classification problem, we are given a set of classes, C = {c1 , . . . , cK }, a set of examples D = {(xi,ci) : i = 1,…,n}, where xi = (xi1,xi2,…,xip)T is a vector of features for the i-th subject. The goal of classification is, given a new datum, x⋆, to predict the ‘best’ class, c⋆. The definition of ‘best’ varies across methods, for example, under kNN (which is not a probabalistic classification method), ‘best’ meant the class in agreement with the k neighbours of x⋆. Under the probabalistic approach ‘best’ is taken to mean ‘most probable’. In order to talk about probabilities, we need to talk about probability distributions. In particular, we are most interested in the conditional probability distribution of a class given the data:
p(ck |x⋆ ).
We would like to compute this probability for each of the k classes, and return our prediction as
c⋆ =argmaxp(ck|x⋆). k
There are a couple of approaches we could take to learning this conditional distribution. One approach, called the discriminative approach, is to learn the distribution p(ck|x) directly – this is what Logistic regression does for example. An alternative approach that we will focus on here would be the generative approach. To understand this second approach, recall Bayes rule:
p(ck|x) = p(x|ck)p(ck), p(x)
which tells us that instead of focusing on P(ck|x), we could instead model p(x|ck) and p(ck), for each k – this is called a generative approach. p(x|ck) is termed the class conditional distri- bution, and p(ck) is termed the class prior distribution. Note that we do not need to model p(x) since it can be computed from knowledge of the class conditional and prior distributions, i.e.
p(x) = p(x|ck)p(ck).
Now, what we need to do under a generative approach is to choose the form of the class condi- tional distribution (e.g. continuous, multinomial, multivariate bernoulli, etc.) and class prior distribution (e.g. discrete uniform, categorical, etc). The prior is usually taken to be a discrete distribution since the number of classes is finite, whilst the class conditional distribution can be either continuous or discrete. One example of this approach is Gaussian (linear) discriminant analysis.
(b) What is the Naive Bayes assumption and why do we need it?
In the previous section we saw that in order to do generative classification, we must learn a class conditional distribution for each of the k classes in our data. Now, if our observations live in a high-dimensional space (the number of features p is very large), then learning these distributions can be very difficult. For example, if p = 1000, and we assume that x|ck ∼ N(μk,σk2), then we would have to learn a 1000-dimensional Gaussian for each class. Learning high dimensional distributions can be very tricky, and we often need a large amount of data (relative to p) to be able to do it well. Naive Bayes gives us a way out of this problem. Before
describing it, let’s go back and rewrite the quantity we are trying to estimate:
p(x|ck)p(ck) = p(x,ck)
= p(x1,x2,…,xp,ck)
= p(x1|x2,…,xp,ck)p(x2,x3,…,xp,ck)
= p(x1|x2,…,xp,ck)p(x2|x3,…,xp,ck)p(x3,x4 …,xp,ck)
= p(x1 |x2 , . . . , xp , ck )p(x2 |x3 , . . . , xp , ck )p(x3 |x4 . . . , xp , ck ) × · · · × p(xp |ck )p(ck ).
These computations are valid for any probability distribution, and we have so far made zero assumptions. Naive Bayes introduces the assumption that the features are conditionally inde- pendent, this is written as
xi ⊥xj|ck forallj̸=i.
Now, let’s interpret this assumption intuitively. Let’s assume that we are trying to classify whether a patient has or doesn’t have diabetes, and I tell you that a particular patient, x, in our dataset belongs to the class diabetes. Let’s also assume that x1 describes the blood pressure of the patient, and x2 describes the weight of the patient. Now, the naive bayes assumption tells us that if we know the patient is diabetic, then also knowing that the patient has high blood pressure tells us nothing about their weight. Realistically, we expect that for patients with diabetes, there exists a positive correlation between blood pressure and weight. Under Naive Bayes however, we ignore this correlation for the sake of mathematical tractability. Mathematically, the Naive Bayes assumption means that p(xi|xj,ck) = p(xi|ck). Using this, we can rewrite:
p(x|ck)p(ck) = p(x,ck)
= p(x1 |x2 , . . . , xp , ck )p(x2 |x3 , . . . , xp , ck )p(x3 |x4 . . . , xp , ck ) × · · · × p(xp |ck )p(ck )
= p(x1|ck)p(x2|ck)p(x3|ck) × · · · × p(xp|ck)p(ck) p
= p(ck) p(xi|ck). i=1
This means that instead of estimating a p-dimensional distribution, we now just have to esti- mate p different 1-dimensional distributions (for each of the k classes) instead – since estimating 1-dimensional distributions is much more straight forward, this is a much easier task!
(c) Consider the problem from lectures of classifying emails as spam or ham, with training data sum- marised below: Each row represents an email, and each email is a combination of words taken from the set {a, b, c, d, e}. We treat the words d, e as stop words – these are words that are not useful for classification purposes, for example, the word ‘the’ is too common to be useful for classifying documents as spam or ham. We therefore define our vocabulary as V = {a, b, c}. Note that in this case we have two classes, so k = 2, and we will assume a uniform prior, that is:
p(c+) = p(c−) = 12, Page 4
e1 b d e b b e2 b c e b b e3 a d a d e e4 b a d b e e5 a b a b a e6 a c a c a e7 e a e d a e8 deded
d d e a e e d a b b a e c a e e a
where c+ = spam, c− = ham. Review the multivariate Bernoulli Naive Bayes set-up and classify the test example: assume we get a new email that we want to classify: e⋆ = abbdebb
Under the multivariate Bernoulli NB set-up, we encode the first email (e1) as x1 = (x1a = 0, x1b = 1, x1c = 0), so that each of the features are binary and represent whether a word is present or not in a given email. Carrying this out for all emails in our dataset gives us:
xia xib xic e1 0 1 0 e2 0 1 1 e3 1 0 0 e4 1 1 0 e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0
So, under this model, we are choosing to ignore the frequency of words in the email and just consider whether a word appears or not in an email. This is equivalent to modelling the class conditional distribution as
p(x|ck) = p(xj|ck), j∈V
where xj |ck ∼ Bernoulli(pkj ). In other words:
p(xj|ck) = (pkj )xj (1 − pkj )1−xj .
Notethatpkj istheprobabilitythatadocumentofclasskcontainsthewordjornot.So,under this approach, we must estimate the following parameters:
p+a ,p+b ,p+c ,p−a ,p−b ,p−c .
We will not go into detail here, but under the hood we are using maximum likelihood estima- tion to estimate these probabilities. It turns out that maximum likelihood gives us an intuitive estimate
pkj =numberofdocumentsofclasskthatcontainthewordj, number of documents in class k
so, based on our data, this gives us:
p +a = 42 , p +b = 34 , p +c = 14 , p −a = 34 , p −b = 41 , p −c = 41 .
Now, for e⋆ = abbdebb, we have the encoding x⋆ = (1, 1, 0), and we classify according to c⋆ = arg max p(ck)p(x⋆|ck).
p(xj =x⋆j|c+)
= p(c+) × p(xa = x⋆a|c+) × p(xb = x⋆b|c+) × p(xc = x⋆c|c+) =p(c+)×p(xa =1|c+)×p(xb =1|c+)×p(xc =0|c+) =p(c+)×p+a ×p+b ×(1−p+c )
1 2 3 1 =2×4×4× 1−4
where (NB) is to signify that we are making the Naive Bayes assumption. A similar computa- tion gives us:
Therefore,
First, for k = +, we have:
p(c+)p(x⋆|c+) = p(c+)
p(c−)p(x⋆|c−) = p(c−) × p−a × p−b × (1 − p−c ) 1 3 1 1
=2×4×4× 1−4 =9.
c⋆ = arg max p(ck)p(x⋆|ck) k∈{+,−}
99 = argmax 64, 128
(d) Next, review Smoothing for the multivariate Bernoulli case. Why do we need smoothing What happens to our previous classification under the smoothed multvariate Bernoulli model?
In the previous section, we saw how to make predictions under the Naive Bayes Multivariate Bernoulli model. Note that for each prediction, we have to multiply various probabilities with each other. These probabilities are of the form pkj or (1 − pkj ). Since we estimate these probabilities based on occurences in the data, this can be problematic if we have not seen particular observations before. For example, assume that we never observe the word a in any spam emails, so that:
xia xib xic e1 0 1 0 e2 0 1 1 e3 0 0 0 e4 0 1 0 e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0
Now, our estimate of p+a = 0. Therefore, any new data point, x⋆, which has x⋆a = 1, regardless of the values of the other features, gets p(x⋆|+) = 0. Further note that if p+a = 1, which means every spam email has the word a, we get similar behaviour, since any new email that does not havethewordagetsprobabilityofspamequaltozero,as(1−p+a) = (1−1) = 0. Thisis not ideal behaviour, since we allow one feature to override the entire model, and in the case of document classification, the model is completely unable to deal with new words not before seen in our vocabulary. Therefore, the modification is often made to Naive Bayes in which we smooth our probability estimates. This amounts to altering our estimating equation slightly to
pkj = numberofdocumentsofclasskthatcontainthewordj+1 . number of documents in class k + number of possible values of X
Note that in the bernoulli model, the number of possible values of X is 2 since we can only haveX =0orX =1.
This modification ensures that p(x|ck) is never equal to zero, even if the number of times a word j appears in a document of class k is actually zero. So, our smoothed estimates now would be:
p+a =2+1=3, p+b =3+1=4, p+c =1+1=2, 4+26 4+26 4+26
p−a =3+1=4, p−b =1+1=2, p−c =1+1=2. 4+26 4+26 4+26
Estimation is done identically as before, except with our new estimates.
Note that smoothing can be thought of in a different way. We want our probability estimates to never be zero, which is equivalent to observing every possible combination of words and classes. In other words, we need to observe documents in each class that contain each word in our vocabulary, as well as documents in each class that do not contain each word in our vocabulary. We can therefore introduce ‘pseudo-documents’, that ensure we have seen every combination of outcomes at least once. This amounts to adding ep1, ep2, ep3, ep4 as follows
xia xib xic e1 0 1 0 e2 0 1 1 e3 0 0 0 e4 0 1 0 ep1 1 1 1 ep2 0 0 0 e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 ep1 1 1 1 ep2 0 0 0
If this was our dataset, then none of our probability estimates based on the original estimation procedure would be zero.
(e) Redo the previous analysis for the Multinomial Naive Bayes model without smoothing. Use the following test email: e⋆ = abbdebbcc
Consider a k-sided die, with the probability of each side being θi, so that ki=1 θi = 1. Now, if we throw the die n times (the n throws are independent) and record the frequency of each face of the die as a vector X = (X1,X2,…,Xk), where Xi = number of times face i came up, then we say that X follows a multinomial distribution with parameters θ1 , . . . , θk and n. We summarise this as:
X ∼ Multinomial(n,θ1,…,θk), The probability mass function of the multinomial is:
P(X =(x ,x ,…,x ))= n!
θi = 1, θi > 0∀i. i=1
θx1 ×···×θxk x1!×x2!×···×xk! 1 k
n! k k xi! i
Now, under this model, we choose to encode emails as count vectors, where now the features represent the number of occurences of a word, rather than whether or not they were present.
xia xib xic e1 0 3 0 e2 0 3 3 e3 3 0 0 e4 2 3 0 e5 4 3 0 e6 4 0 3 e7 3 0 0 e8 0 0 0
We now model the class conditional distribution as a multinomial, which implicitly uses the Naive Bayes assumption since the multinomial distribution assumes independence across tri- als.
p(xj|ck) ,
and we write θjk = p(xj |ck ) to denote the probability that a word in document of class k takes the value j. Note the subtle difference to the bernoulli model. Previously, the probabilities were to do with the probability that a particular word existed in a given class. The probabilities here are to do with how likely a randomly chosen word in the class is a particular word. Now, we need to estimate the following parameters for our model:
θa+, θb+, θc+, θa−, θb−, θb−.
We once more omit the details of how to derive the estimates of these parameters, and as one
might expect, the estimates are:
θjk = number of times word j appears in class k
number of words that appear in class k so, based on our data, this gives us:
θa+= 5, θb+= 9, θc+= 3, 17 17 17
θa−=11, θb−=3, θc−=3. 17 17 17
p(x|ck) = j∈V xj!
Now, given x⋆ = (1, 4, 2), we can compute
p(c+)p(x⋆|c+) = p(c+) 7! × θa1 × θb4 × θc2
and similarly for p(c−)p(x⋆|c−).
1 7! 5 9 4 3 2
=2×4!×2!× 17× 17 × 17
(f) Repeat the analysis for the smoothed Multinomial Naive Bayes model.
In our predictions here, just as in the multivariate bernoulli case, we must multiply various probabilities. Therefore, this model suffers the same potential problems when probabilities
take the value zero. The difference here is that this only occurs when the frequency of a word in a particular class is zero. To get around this, we apply smoothing once more, which involves adding 1 to each of the numerators of our estimates. Note that for each class k, this means adding 1 to the numerator |V | times, and so we add |V | to the denominator to account for the |V | extra (pseudo-documents) that we are artifically adding. Our estimates therefore become
θjk = number of times word j appears in class k + 1 . number of words that appear in class k + |V |
Now, our estimates for the spam problem become
θa+= 5+1 = 6, θb+= 9+1 =10, θc+= 3+1 = 4, 17+3 20 17+3 20 17+3 20
θa−=11+1=12, θb−= 3+1 = 4, θc−= 3+1 = 4. 17+3 20 17+3 20 17+3 20
Question 3. Binary Logistic Regression, two perspectives
Recall from previous weeks that we can view least squares regression as a purely optimisation based problem (minimising MSE), or as a statistical problem (using MLE). We now discuss two perspectives of the Binary Logistic Regression problem. In this problem, we are given a dataset D = {(xi,yi)}ni=1 where the xi’s represent the feature vectors, just as in linear regression, but the yi’s are now binary. The goal is to model our output as a probability that a particular data point belongs to one of two classes. We will denote this predicted probability by
and we model it as
P(y = 1|x) = p(x)
pˆ(x) = σ(wˆT x), σ(z) = 1 , 1+e−z
where wˆ is our estimated weight vector. We can then construct a classifier by assigning the class that has the largest probability, i.e.:
1 if σ(wˆT x) ≥ 0.5 k=0,1 0 otherwise
yˆ = arg max P (yˆ = k|x) =
note: do not confuse the function σ(z) with the parameter σ which typically denotes the standard
deviation.
(a) What is the role of the logistic sigmoid function σ() in the logistic regression formulation? Why are
we not able to simply use linear regression here? (a plot of σ(z) may be helpful here).
The problem with just modelling pˆ(x) = wˆT x is that wˆT x is an unbounded quantity, which means that wˆT x can be any number on the real line. This doesn’t make sense since we wish to
model pˆ as a probability, and probabilities belong to the interval [0, 1]. The role of the logistic sigmoid, also referred to as a squashing function is to squash the raw score wT x into the desired interval.
(b) We first consider the statistical view of logistic regression. Recall in the statistical view of linear regression, we assumed that y|x ∼ N(xT β∗, σ2). Here, we are working with binary valued random variables and so we assume that
y|x ∼ Bernoulli(p∗), p∗ = σ(xT w∗)
where p∗ = σ(xT w∗) is the true unknown probability of a response belonging to class 1, and we assume this is controlled by some true weight vector w∗. Write down the log-likelihood of the data D (as a function of w), and further, write down the MLE objective (but do not try to solve it).
The assumption implies that
so we can write down the log-likelihood as
At this point, we recall that
P(y = 1|x) = py(1 − p)1−y, lnL(w) = lnP(y1,…,yn|x1,…,xn)
n =ln P(yi|xi)
= lnP(yi|xi) i=1
= ln(pyi (1 − pi)1−yi ) i
=[yilnpi +(1−yi)ln(1−pi)]. i=1
p =σ(wTx)= 1 , i i 1 + e−wT xi
ln L(w) = [yi ln σ(wT xi) + (1 − yi) ln(1 − σ(wT xi))]
i=1 The MLE optimisation is then
yi ln 1−σ(wTxi) +ln(1−σ(w xi)) .
wˆMLE = arg max ln L(w). w∈Rp
n σ(wTxi) T
Note that we cannot solve this problem by hand as in the linear regression case (you should try this for yourself). This is an example in which we rely solely on numerical methods to find a solution.
(c) An alternative approach to the logistic regression problem is to view it purely from the optimisa- tion perspective. This requires us to pick a loss function and solve for the corresponding minimizer. Write down the MSE objective for logistic regression and discuss whether you think this loss is ap- propriate.
The MSE objective is simply
arg min 1 ∥y − σ(Xw)∥2 wn
wherey=[y1,…,yn]T andσ(Xw)istheelement-wiseapplicationofσtothevectorXw,or we could equivalently write:
(yi − σ(wT xi))2.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com