CS代考 COMP90049 Introduction to Machine Learning (Semester 1, 2022) Sample soluti

School of Computing and Information Systems The University of Melbourne
COMP90049 Introduction to Machine Learning (Semester 1, 2022) Sample solutions: Week 8
1. What is the difference between “model bias” and “model variance”?
Model Bias:

Copyright By PowCoder代写 加微信 powcoder

– Model bias is the propensity of a classifier to systematically produce the same errors; if it doesn’t produce errors, it is unbiased; if it produces different kinds of errors on different instances, it is also unbiased. (An example of the latter: the instance is truly of class A, but sometimes the system calls it B and sometimes the system calls it C.)
– The notation of bias is slightly more natural in a regression context, where we can sensibly measure the difference between the prediction and the true value. In a classification context, these can only be “same” or “different”.
– Consequently, a typical interpretation of bias in a classification context is whether the classifier labels the test data in such a way that the distribution of predicted classes systematically doesn’t match the distribution of actual classes. For example, “bias towards the majority class”, when the model predicts too many instances as the majority class.
Model variance is the tendency of a classifier to produce different classifications if it was trained on different training sets (randomly sampled from the same population). It is a measure of the inconsistency of the classifier between different training sets.
(i). Why is a high bias, low variance classifier undesirable?
In short, because it’s consistently wrong. Using the other interpretation: the distribution of labels predicted by the classifier is consistently different to the distribution of the true labels; this means that it must be making mistakes.
(ii). Why is a low bias, high variance classifier (usually) undesirable?
This is less obvious – it’s low bias, so that it must be making a bunch of correct decisions. The fact that it’s high variance means that not all of the predictions can possibly be correct (or it would be low- variance!) — and the correct predictions will change, perhaps drastically, as we change the training data.
One obvious problem here is that it’s difficult to be certain about the performance of the classifier at all: we might estimate its error rate to be low on one set of data, and high on another set of data.
The real issue becomes more obvious when we consider the alternative formulation: the low bias means that the distribution of predictions matches the distribution of true labels; however, the high variance means that which instances are getting assigned to which label must be changing every time.
This suggests the real problem — namely, that what we have is the second kind of unbiased classifier: one that makes different kinds of errors on different training sets, but always errors; and not the first kind: one that is usually correct.
2. Describe how validation set, and cross-validation can help reduce overfitting? 1

Machine learning models usually have one or more (hyper)parameters that control for model complexity, the ability of model to fit noise in the training set. In a practical application, we need to determine the values of such parameters, and the principal objective in doing so is usually to achieve the best predictive performance on new data. Furthermore, as well as finding the appropriate values for complexity parameters within a given model, we may wish to consider a range of different types of model in order to find the best one for our particular application.
We know that the performance on training data is not a good indicator of predictive performance on unseen data because of overfitting. If data is plentiful, then one approach is simply to use some of the available data to train a range of models, or a given model with a range of values for its complexity parameters, and then to compare them on independent data, sometimes called a validation set, and select the one having the best predictive performance. If the model design is iterated many times using a limited size data set, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated.
In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution to this dilemma is to use cross-validation.
3. Why does ensembling reduce model variance?
We know from statistics that averaging reduces variance. If Z1, . . . , Zn are i.i.d random variables:
𝑉𝑎𝑟(1’𝑍!)=1 𝑉𝑎𝑟(𝑍!) 𝑁!𝑁
So, the idea is that if several models are averaged, the model variance decreases without having an effect on bias. The problem is that there is only one training set, so how do we get multiple models? The answer to this problem is ensembling. Ensembling creates multiple models by creating multiple training sets from one training set by using bootstrap (e.g., bagging, random forests), or by training multiple learning algorithms (e.g., stacking). The predictions of the individual are then combined (averaged) to reduce final model variance.
4. Consider the following training set:
Consider the initial weight function as 𝜃 = {𝜃#,𝜃$,𝜃%} = {0.2, -0.4, 0.1} and the activation function of the perceptron as the step function of 𝑓 = * 1 𝑖𝑓 Σ > 0 .
Can the perceptron learn a perfect solution for this data set?
A perceptron can only learn perfect solutions for linearly separable problems, so you should ask yourself whether the data set is linearly separable. Indeed, it is. Imagine drawing the coordinates in a coordinate system, you will find that you can separate the positive example (0,0) from the negative ones ((0,1), (1,1)) with a straight line.
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒

Draw the perceptron graph and calculate the accuracy of the perceptron on the training data before training?
To calculate the accuracy of the system we first need to calculate the output (prediction) of our perceptron and then compare it the actual labels of the class.
As you can see one of the predictions (outputs) of our perceptron match the actual label and therefore the accuracy of our perceptron is ” at this stage.
Using the perceptron learning rule and the learning rate of 𝜂 = 0.2. Train the perceptron for one epoch. What are the weights after the training?
Remember the perceptron weights learning rule in iteration t and for train instance i is as follows: 𝜃(&) ← 𝜃(&(“) + 𝜂0𝑦! − 𝑦3!,(&)4 𝑥!
For epoch 1 we will have:
(0,1) (1,1)
𝚺=𝜽𝟎 +𝜽𝟏𝒙𝟏 +𝜽𝟐𝒙𝟐
0.2 – 0.4 x 0 + 0.1 x 0 = 0.2
𝜃(“) =𝜃(*) +𝜂0𝑦” −𝑦3″,(“)4𝑥” =0.2+0.2(0−1)1=0
𝒚<=𝒇(𝚺) y 1 0 (0,0) (0,1) (1,1) 𝚺=𝜽𝟎 +𝜽𝟏𝒙𝟏 +𝜽𝟐𝒙𝟐 0.2 – 0.4 x 0 + 0.1 x 0 = 0.2 0.2 – 0.4 x 0 + 0.1 x 1 = 0.3 0.2 – 0.4 x 1 + 0.1 x 1 = -0.1 𝒚<=𝒇(𝚺) y f(0.2) = 1 0 f(0.3) = 1 1 f(-0.1) = 0 1 𝜃(") =𝜃(*) +𝜂0𝑦" −𝑦3",(")4𝑥" =−0.4+0.2(0−1)0= −0.4 """ 𝜃(") =𝜃(*) +𝜂0𝑦" −𝑦3",(")4𝑥" =0.1+0.2(0−1)0= 0.1 +++ 0 – 0.4 x 0 + 0.1 x 1 = 0.1 Correct prediction → no update 0 – 0.4 x 1 + 0.1 x 1 = – 0.3 𝜃(") =𝜃(*) +𝜂0𝑦# −𝑦3#,(")4𝑥# =0+0.2(1−0)1=0.2 *** 𝜃(") =𝜃(*) +𝜂0𝑦# −𝑦3#,(")4𝑥# =−0.4+0.2(1−0)1= −0.2 """ 𝜃(") =𝜃(*) +𝜂0𝑦# −𝑦3#,(")4𝑥# =0.1+0.2(1−0)1= 0.3 +++ d) What is the accuracy of the perceptron on the training data after training for one epoch? Did the accuracy improve? With the new weights we get • for instance (0,0) with y=0: 0.2 – 0.2 x 0 + 0.3 x 0 = 0.2; f(0.2) = 1 ; incorrect • for instance (0,1) with y=1: 0.2 – 0.2 x 0 + 0.3 x 1 = 0.0; f(0.5) = 1 ; correct • for instance (1,1) with y=1: 0.2 – 0.2 x 1 + 0.3 x 1 = 0.1; f(0.3) = 1 ; correct The accuracy of our perceptron is now + . So, the accuracy of the system has been improved J # 5. [OPTIONAL] Why is a perceptron (which uses a sigmoid activation function) equivalent to logistic regression? A perceptron has a weight associated with each input (attribute); the output is acquired by (1) summing up the weighted input features and (2) applying the activation function to the summed value. The standard activation function for the Perceptron is the step function (as shown in the lectures), however, it can be replaced with a different (appropriate) function. For example, we could use the sigmoid activation function (𝜎(𝑥)=𝑓(𝑥)= " ) and apply it to the linear combination of inputs (𝜃 +𝜃 𝑥 +𝜃 𝑥 +⋯), which ",-!" * "" ++ simplifies to 𝑓(𝜃.𝑥) = 𝜎(𝜃.𝑥). — This is now similar to the logistic regression model. Note also that the Perceptron and logistic regression have different objective functions. In Logistic regression, we are using cross-entropy loss (negative log-likelihood) for optimizing the weights (𝜃). The objective of the Perceptron is simply based on counting errors. The perceptron and Logistic Regression will only be completely equivalent if we change (1) the objective function to the cross-entropy loss and (2) the activation function to the Sigmoid. The Perceptron and Logistic Regression also have different learning mechanisms (but this difference doesn't impact the equivalence of the models). For Logistic regression, weights are typically updated after all the training instances have been processed (after one full iteration). However, we could also apply batch gradient descent -- updating weights after a subset of instances have been observed (and the subset can be as small as 1 instance). The Perceptron by definition updates its weights after processing each instance. So the weights (𝜃) are updated several times in each iteration over the training data. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com