School of Computing and Information Systems The University of Melbourne
COMP90049 Introduction to Machine Learning (Semester 1, 2022) Week 4: Sample Solutions
1. Whatisoptimisation?Whatisa“lossfunction”?
In the context of Machine Learning, optimisation means finding the optimal parameters of the model that give us the most accurate results (predictions).
Copyright By PowCoder代写 加微信 powcoder
To find the best possible results, optimisation usually involves minimising (the error) or maximising (the correct answers). Again, in the context of Machine Learning, most of the optimisation problems are described in terms of cost (i.e., error). We want to minimise undesirable outcomes (errors). To do so, we define a function that best describes our undesirable outcomes for each model. This function is called a cost function or a loss function.
2. Giventhefollowingdataset,buildaNaïveBayesmodelforthegiventraininginstances.
A Naïve Bayes model is probabilistic classification Model. All we need for building a Naive Bayes model is to calculate the right probabilities (Prior and Conditional).
For this dataset, our class (or label or variable we trying to predict) is PLAY. So, we need the probability of each label (the prior probabilities):
P(Play = Y) = ! P(Play = N) = ! “”
We also need to identify all the conditional probabilities between the labels of class (PLAY) and all the other attribute values such as s, o, r (for Outlook) or h, m, c (for Temp) and so on:
P(Outl = s | N) = ” P(Outl = o | N) = 0 P(Outl = r | N) = ! ##
P(Outl = s | Y) = 0 P(Outl = o | Y) = ! P(Outl = r | Y) = ” ##
P(Temp = h | N) = ” P(Temp = m | N) = 0 P(Temp = c | N) = ! ##
P(Temp = h | Y) = ! P(Temp = m | Y) = ! P(Temp = c | Y) = ! ###
P(Humi = n | N) = ” P(Humi = h | N) = ! ##
P(Humi = n | Y) = ! P(Humi = h | Y) = ” ##
P(Wind = T | N) = ” P(Wind = F | N) = ! ##
P(Wind = T | Y) = 0 P(Wind = F | Y) = 1
3. UsingtheNaïveBayesmodelthatyoudevelopedinquestion2,classifythegiventestinstances. (i). No smoothing.
For instance G, we have the following:
𝑃(𝑁)× 𝑃(𝑂𝑢𝑡𝑙=𝑜|𝑁)𝑃(𝑇𝑒𝑚𝑝=𝑚|𝑁)𝑃(𝐻𝑢𝑚𝑖=𝑛|𝑁)𝑃(𝑊𝑖𝑛𝑑=𝑇|𝑁)
=1 ×0×0×2×2=0 233
𝑃(𝑌)× 𝑃(𝑂𝑢𝑡𝑙=𝑜|𝑌)𝑃(𝑇𝑒𝑚𝑝=𝑚|𝑌)𝑃(𝐻𝑢𝑚𝑖=𝑛|𝑌)𝑃(𝑊𝑖𝑛𝑑=𝑇|𝑌)
=1 ×1×1×1×0=0 2333
To find the label we need to compare the results for the two tested labels (Y and N) and find the one that has a higher likelihood.
𝑦> = 𝑎𝑟𝑔𝑚𝑎𝑥$∈{‘,)} 𝑃(𝑦|𝑇 = 𝐺)
However, based on these calculations we find that both values are 0! So, our model is unable to predict any label for test instance G.
The fact is as long as there is a single 0 in our probabilities, none of the other probabilities in the product really matter.
For H, we first observe that the attribute values for Outl and Humi are missing (?). In Naive Bayes, this just means that we calculate the product without those attributes:
𝑃(𝑁)× 𝑃(𝑂𝑢𝑡𝑙=? |𝑁)𝑃(𝑇𝑒𝑚𝑝=h|𝑁)𝑃(𝐻𝑢𝑚𝑖=? |𝑁)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑁) ≈𝑃(𝑁)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑁) ×𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑁)
=1×2×1=1 2339
𝑃(𝑌)× 𝑃(𝑂𝑢𝑡𝑙=? |𝑌)𝑃(𝑇𝑒𝑚𝑝=h|𝑌)𝑃(𝐻𝑢𝑚𝑖=? |𝑌)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑌) ≈𝑃(𝑌)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑌) ×𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑌)
=1×1×1=1 236
Therefore, the result of our argmax function for the test instance H is Y. 𝑎𝑟𝑔𝑚𝑎𝑥$∈{‘,)} 𝑃(𝑦|𝑇 = 𝐻) = 𝑌
(ii). Using the “epsilon” smoothing method.
For test instance G, using the ‘epsilon’ smoothing method, we can simply replace the 0 values with a small positive constant (like 10−6), that we call 𝜀. So we’ll have:
𝑁: = 1 × 𝜀 × 𝜀 × 2 × 2 = 2𝜀” 2339
𝑌: =1×1×1×1×𝜀= 𝜀 2333 54
By smoothing, we can sensibly compare the values. Because of the convention of 𝜀 being very small (it should be (substantially) less than ! (6 is the number of training instances)), Y has the greater score
(higher likelihood). So, Y is the output of our argmax function and G is classified as Y.
A quick note on the ‘epsilons’:
This isn’t a serious smoothing method, but does allow us to sensibly deal with two common cases:
– Where two classes have the same number of 0s in the product, we essentially ignore the 0s.
– Where one class has fewer 0s, that class is preferred.
For H, we don’t have any zero probability, so the calculations are similar to when we had no smoothing:
𝑃(𝑁)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑁)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑁) ≈𝑃(𝑁)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑁)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑁)
= 1 × 2 × 1 = 1 ≅ 0.1 2339
𝑃(𝑌)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑌)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑌) ≈𝑃(𝑌)× 𝑃(𝑇𝑒𝑚𝑝=h|𝑌)𝑃(𝑊𝑖𝑛𝑑=𝐹|𝑌)
= 1 × 1 × 3 = 1 ≅ 0.16 2336
Therefore, the result of our argmax function for the test instance H is Y. 𝑎𝑟𝑔𝑚𝑎𝑥$∈{‘,)} 𝑃(𝑦|𝑇 = 𝐻) = 𝑌
(iii). Using “Laplace” smoothing (𝛼 = 1)
This is similar, but rather than simply changing the probabilities that we have estimated to be equal
to 0, we are going to modify the way in which we estimate a conditional probability:
𝑃 = 𝑥! + 𝛼
In this method we add α, which is 1 here, to all possible event (seen and unseen) for each attribute. So, all unseen event (that currently have the probability of 0) will receive a count of 1 and the count for all seen events will be increased by 1 to ensure that the monocity is maintained.
For example, for the attribute Outl that have 3 different values (s, o, and r). Before, we estimated P(Outl =o|Y) = ! before; now, we add 1 to the numerator (add 1 to the count of o), and 3 to the
# !,! ” denominator (1 (for o) + 1 (for r)+ 1 (for s)). So now P(Outl = o|Y) have the estimate of #,# = +.
In another example, P(Wind = T|Y) is not presented (unseen) in our training dataset (P(Wind = T|Y) = -). Using the Laplace smoothing (α =1), we add 1 to the count of Wind = T (given Play = Y) and 1
# -,! ! to the count ofWind=F(givenPlay=Y) and so now we have P(Wind=T|Y) = #,” = ..
Typically, we would apply this smoothing process when building the model, and then substitute in the Laplace-smoothed values when making the predictions. For brevity, though, I’ll make the smoothing corrections in the prediction step.
For G, this will look like:
𝑃(𝑁)× 𝑃(𝑂𝑢𝑡𝑙=𝑜|𝑁)𝑃(𝑇𝑒𝑚𝑝=𝑚|𝑁)𝑃(𝐻𝑢𝑚𝑖=𝑛|𝑁)𝑃(𝑊𝑖𝑛𝑑=𝑇|𝑁)
=1 ×0+1×0+1×2+1×2+1 2 3+3 3+3 3+2 3+2
=1 ×1×1×3×3 =0.005 26655
𝑃(𝑌)× 𝑃(𝑂𝑢𝑡𝑙=𝑜|𝑌)𝑃(𝑇𝑒𝑚𝑝=𝑚|𝑌)𝑃(𝐻𝑢𝑚𝑖=𝑛|𝑌)𝑃(𝑊𝑖𝑛𝑑=𝑇|𝑌)
=1 ×1+1×1+1×1+1×0+1 2 3+3 3+3 3+2 3+2
=1 ×2×2×2×1≅0.0044 26655
Unlike with the epsilon procedure, N has the greater score — even though there are two attribute values that have never occurred with N. So here G is classified as N.
𝑁: = 1 × 2 + 1 × 1 + 1 = 0.1 2 3+3 3+2
𝑌: = 1 × 1+1 ×3+1≅ 0.13 2 3+3 3+2
Here, Y has a higher score — which is the same as with the other method, which doesn’t do any smoothing here — but this time it is only slightly higher.
4. Forthefollowingsetofclassificationproblems,designaNaiveBayesclassificationmodel.Answer the following questions for each problem: (i) what are the instances, what are the features (and values)? (ii) explain which distributions you would choose to model the observations, and (iii) explain the significance of the Naive Bayes assumption.
A. You want to classify a set of images of animals in to ‘cats’, ‘dogs’, and ‘others’.
(i). Here the images are the instances, and the features are the pixels of the image. Each pixel can have values such as pixel intensity or colour code or shade. The important notice here is that these values (in the context of image processing) are continues.
(ii). Since our features are continuous, the Gaussian (or normal) distribution is most appropriate (assuming that our feature values are (roughly) Gaussian distributed). The Gaussian distribution has a bell shape curve and useful features that make the calculations fairly easy.
(iii). The Naïve Bayes assumption tells us that given each class (‘cat’, ‘dog’, ‘others’), we treat all features as independent. But the reality is that this assumption is not true at all. In fact clearly the {intensity, colour, …} of neighbouring pixels depend on one another. However, we can still use Naïve Bayes for developing a model and predicting the labels.
You want to classify whether each customer will purchase a product, given all the products (s)he has bought previously.
(i). In this problem, each customer can be used as an instance. The features can be the products (or types of products) in the catalogue. The value for these features can be 0 and 1 as an indicator of whether the customer has purchased the product; values = 0/1 (did or did not purchase), or perhaps counts of how many times the customer bought a specific (type of) product.
(ii). In this setting, the features are discrete. If we assume count-based features, we define a Multinomial distribution over K dimensions (K=number of products) and values are the counts of purchases of that particular customer of each product; we can use essentially the same approach using binary indicators (leading to the Binomial distribution). The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent Boolean experiments (with the probability of p for success in each experiment).
(iii). Here the NB assumptions tell us that given the label (‘purchase’, ‘not purchase’) all previous purchases are treated as independent features. But clearly, this is not the case (e.g., if a customer purchased Game of Thrones seasons 1-5, it should influence the probability of the customer also having purchased Game of Thrones season 6).
5. [OPTIONAL] Given the following dataset,
𝑋! ( 𝐻 𝑒 𝑎 𝑑 𝑎 𝑐 h 𝑒 )
0.8 0 0.4 0.4 0.8
𝑋! ( 𝑆 𝑜 𝑟 𝑒 )
0.4 0.8 0.4 0 0.8
𝑋! ( 𝑇 𝑒 𝑚 𝑝 )
39.5 37.8 37.8 37.8 37.8
Y ( D i a g n o s i s )
Flu Cold Flu Cold ? (Flu)
Build a Naïve Bayes model for the given training instances (1-4, above the line).
A Naïve Bayes model is probabilistic classification Model. All we need for building a Naive Bayes model is to calculate the right probabilities (Prior and Conditional).
For this dataset, our class (or label or variable we trying to predict) is Diagnosis. So, we need the probability of each label (the prior probabilities):
P(Diagnosis = Flu) = 0.5 P(Diagnosis = Cold) = 0.5
We also need to identify all the conditional probabilities between the labels of class (Diagnosis) and all the attributes (Headache, Sore, Temperature). All attributes in this data set are numeric, and we will represent the likelihoods using the Gaussian distribution which as two parameters: mean and standard deviation:
Let’s estimate the parameters for each likelihood
𝑃(h𝑒𝑎𝑑𝑎𝑐h𝑒|𝑓𝑙𝑢):
𝑃(h𝑒𝑎𝑑𝑎𝑐h𝑒|𝑐𝑜𝑙𝑑):
𝑃(𝑠𝑜𝑟𝑒|𝑓𝑙𝑢):
𝑃(𝑠𝑜𝑟𝑒|𝑐𝑜𝑙𝑑):
𝑃(𝑡𝑒𝑚𝑝|𝑓𝑙𝑢):
𝑃(𝑡𝑒𝑚𝑝|𝑐𝑜𝑙𝑑):
𝜇/01213/0,456 = 0.8 + 0.4 = 0.6 2
𝜎/01213/0,456 = U(-.9:-.+)!,(-.<:-.+)! = 0.2 "
𝜇/01213/0,3=52 = 0 + 0.4 = 0.2 2
𝜎/01213/0,3=52 = U(-:-.")!,(-.<:-.")! = 0.2 "
𝜇>=?0,456 = 0.4 + 0.4 = 0.4 2
𝜎>=?0,456 = U(-.<:-.<)!,(-.<:-.<)! = 0 "
𝜇>=?0,3=52 = 0.8 + 0 = 0.4 2
𝜎>=?0,3=52 = U(-.9:-.<)!,(-:-.<)! = 0.4 "
= 39.5 + 37.8 = 38.65 ≈ 38.7 2
= U(#C..:#9.D)!,(#D.9:#9.D)! = 0.85 "
= 37.8 + 37.8 = 37.8 2
= U(#D.9:#D.9)!,(#D.9:#D.9)! = 0 "
Estimate the probability of the test instance (5, below the line)
The probability of a class given observed features is the prior probability of the class (Binomial) times the probability of each feature given the class (Gaussian). Recall that the probability of an observation under the Gaussian distribution with specific mean and variance is defined as:
Note: The Gaussian distribution with zero variance is not defined. For this exercise, we will ignore features with zero variance under a class. I crossed out the omitted factors below.
𝑋! ( 𝐻 𝑒 𝑎 𝑑 𝑎 𝑐 h 𝑒 ) 𝑋! ( 𝑆 𝑜 𝑟 𝑒 )
𝑋! ( 𝑇 𝑒 𝑚 𝑝 ) Y ( D i a g n o s i s ) 37.8 ? (Flu)
𝜇/01213/0,456,𝜎/01213/0,456^×𝑃(𝑠𝑜𝑟𝑒= 0.8|𝑓𝑙𝑢; 𝜇>=?0,456,𝜎>=?0,456)×𝑃(𝑡𝑒𝑚𝑝=37.8|𝑓𝑙𝑢;
= 1×1.21×0.28=0.17 2
𝜇/01213/0,3=52,𝜎/01213/0,3=52^×𝑃(𝑠𝑜𝑟𝑒= 0.8|𝑐𝑜𝑙𝑑; 𝜇>=?0,3=52,𝜎>=?0,3=52)×𝑃(𝑡𝑒𝑚𝑝=37.8|𝑐𝑜𝑙𝑑;
= 1×0.02×0.6=0.006 2
We find that P(flu|xtest) > P(cold|xtest), and hence predict the label “flu”.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com