School of Computing and Information Systems The University of Melbourne
COMP90049 Introduction to Machine Learning (Semester 1, 2022) Week 7: Sample solutions
1. What is gradient descent? Why is it important?
Gradient descent is an instance of iterative optimization algorithms: it finds the parameters corresponding to optimal points of a target function step-by-step, by starting out with some initial parameter value, and incrementally modifying this value in the `most promising way’, i.e., in the way that leads to the largest improvement of the target function. This step-by-step procedure is important for optimization problems with no closed-form solution. This has numerous applications – most intuitively, in determining the regression weights which minimize an error function over some training data set.
Copyright By PowCoder代写 加微信 powcoder
2. What is Logistic Regression? What is “logistic”? What are we “regressing”?
We build a Binary Logistic Regression model, where the target (y) is (close to) 1 for instances of the positive class, and (close to) 0 for instance of the negative class. (Suitably can be re-interpreted for multi-class problems.)
The goal of binary logistic regression is to train a classifier that can make a binary decision about the class of an input observation. Consider a test instance X, which we will represent by a vector of features [x1, x2,…,xn]. The output y can be 1 or 0 (i.e., winning / not winning).
The model calculates the probability P(y=1|x) (from which we can trivially derive P(y=1|x). A logistic regression classifier additionally defines a ‘decision boundary’, which is typically set at 0.5. If the model predicts the probability P(Y=1|x) > 0.5, we classify x as class 1. Otherwise, x is classified as class 0.
In statistics, regression describes how an independent variable is numerically related to the dependent variable.InLogisticRegressionwefindtheregressionoutputz(𝑧=𝜃⃗.𝑋⃗= 𝜃!+𝜃”𝑥”+⋯+𝜃#𝑥#),where
θ is the weight of the features. θ represents the numeric correlation between each feature (attribute) with the label (class), where the feature (attributes) of our dataset is assumed as our independent variables and the label (class) is our dependent variable.
Since Logistic (or sigmoid) function has an easy–to–calculate derivative (that makes it easy to calculate the optimum parameters), and has a range of [0, 1], we apply the logistic function 𝜎 = ” to the regression
output z and call it Logistic Regression.
3. In following dataset each instance represents a news article. The value of the features are counts of selected words in each article. Develop a logistic regression classifier to predict the class of the article (fruit vs. computer). 𝑦” = 1 (fruit) and 𝑦” = 0 (computer).
ID apple ibm lemon TRAINING INSTANCES
A 1 0 1 B 1 0 1 C 2 0 0 D 2 2 0 E 1 2 1
5 1FRUIT 2 1FRUIT 1 1FRUIT 0 0 COMPUTER 7 0 COMPUTER
TEST INSTANCES
For the moment, we assume that we already have an estimate of the model parameters, i.e.,
the weight of the 4 features (and the bias 𝜃!) is 𝜃’ = [𝜃!, 𝜃”, 𝜃#, 𝜃$, 𝜃%] = 1
[0.2, 0.3, −2.2, 3.3, −0.2].
(i). Explain the intuition behind the model parameters, and their meaning in relation to
the features
Inthisdatasetwewantidentifyifapieceofwritingisaboutcomputerorfruit(e.g. ‘newapple iPhone is very expensive’ vs. ‘an apple a day, keeps the doctor away’). To do so, we are using 4 terms (apple, ibm, lemon, sun) and the count of their occurrences in a piece of writing. So, for example we know that Doc A includes apple once and sun five times.
The decision to include exactly these for terms (and no others) as attributes, and to define their values as word occurrence counts is the decision of the modeler, and the outcome of a process called Feature Engineering.
Based on the definition, we know that in Logistic Regression, we model P(y = 1|x1, x2, …, xF ) directly as subject to parameter θ. Using the following:
𝑃(𝑦=1|𝑥”,𝑥&,…,𝑥#)= 1 =𝜎(𝜃! +𝜃”𝑥” +⋯+𝜃#𝑥#) 1 + 𝑒'()#$)$*$$⋯$)%*%)
Here, we have a binary output label y = {0 (computer), 1 (fruit)}, and four features: apple, ibm, lemon, sun.
Weights 𝜃”, 𝜃&, 𝜃-, 𝜃. denote the respective importance of the features 1 to 4 for predicting the class fruit (1). For example, 𝜃& (-2.2) indicates how important feature 2 (ibm) is for predicting 𝑦5 = 1 (fruit). 𝜃! represent the bias for this model.
Here, the negative sign indicates that occurrence of the word ibm is negatively correlated with the class FRUIT. If we observe ibm in a document, it makes the label FRUIT less likely. The positive value of 𝜃- indicates a positive correlation of feature lemon with FRUIT. If the word lemon is mentioned in a document, it increases the probability of its label being FRUIT.
(ii). Predict the test label.
Using the logistic regression formula above, we find
And consequently
𝑃(𝑓𝑟𝑢𝑖𝑡|𝑇)=𝑃(𝑦5=1|𝑇)= 𝜎(𝜃!+𝜃”𝑡”+⋯+𝜃.𝑡.)
= 𝜎(0.2 + 0.3×1 +(−2.2)×2 + 3.3×1 +(−0.2)×5)
= 𝜎(−1.6) = 1 = 0.17 1 + 𝑒'(‘”.0)
𝑃(𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟|𝑇) = 1 − 𝑃(𝑓𝑟𝑢𝑖𝑡|𝑇) = 1 − 0.17 = 0.83
Recall that we turn the logistic regression model into a classifier by predicting label y = 1 whenever p(y=1|x, θ ) > 0.5, and predict y = 0 otherwise.
Since the for the test instance T the probability p(y=1) (FRUIT) is smaller than 0.5, we predict label y=0 (COMPUTER) for T. Comparing with our “ground truth” that we have in our dataset, we can see that our prediction is correct.J
(iii). Recall the conditional likelihood objective
−logL(𝛽)= −Ny1 logP𝜎(𝑥1;𝛽)R+(1−𝑦1)logP1− 𝜎(𝑥1;𝛽)R 13″
Design a test to make sure that the Loss of our model, is lower when its prediction the correct label for test instance T, than when it’s predicting a wrong label.
To test our error function, we are going to compute the negative log-likelihood of the test instance (1) assuming the true label as y = 1 (fruit), i.e., our classifier made a mistake; and (2) assuming that the true label y = 0 (computer), i.e., our classifier predicted correctly.
Assuming y = 1 and we predicted 𝑦5 = 0: 2
−logL(𝜃)= −N1logP𝜎(0.2 + 0.3×1 +(−2.2)×2 + 3.3×1 +(−0.2)×5)R
+(1−1)logP1− 𝜎(0.2 + 0.3×1 +(−2.2)×2 + 3.3×1 +(−0.2)×5)R
= − N 1 logP𝜎(−1.6)R + (1 − 1) logP1 − 𝜎(−1.6)R 13″
= −S1 𝑙𝑜𝑔P𝜎(−1.6)R + 0V = −𝑙𝑜𝑔 1 = − log(0.17) = 1.77 1 + 𝑒'(‘”.0)
Assuming y = 0 and we predicted 𝑦5 = 0: 2
− log L(𝜃) = − N 0 logP𝜎(−1.6)R + (1 − 0) logP1 − 𝜎(−1.6)R 13″
= −S0 + 1 logP1 − 𝜎(−1.6)RV = −𝑙𝑜𝑔 (1 − 0.17) = − log(0.83) = 0.19
Reassuringly, the negative log likelihood (aka loss) under predicted label y = 0 (correct) is lower than the loss for label = 1 which is the incorrect prediction.
4. For the model created in question 3, compute a single gradient descent update for parameter 𝜃” given the training instances given above. Recall that for each feature j, we compute its weight update as
𝜃& ← 𝜃& −𝜂2(𝜎(𝑥’;𝜃’)− 𝑦’)𝑥’& ‘
Summing over all training instances i. We will compute the update for 𝜃 assuming the
current parameters as specified above, and a learning rate 𝜂 = 0.1.
𝜃”is the model parameter (respective importance) of the features 1 (apple) in our model. Using Gradient Descent method over iterations, we want to find the best parameters of the model (the parameters that minimize the loss (error) of our model).
Step 1, we need to compute the 𝜎(𝑥1 ; 𝜃) for all our training instances.
𝜎(𝑥5;𝜃)= 𝜎(0.2+(0.3×1+(−2.2)×0+3.3×1+(−0.2)×5))=0.94 𝜎(𝑥6;𝜃)= 𝜎(0.2+(0.3×1+(−2.2)×0+3.3×1+(−0.2)×2))=0.97
𝜎(𝑥7;𝜃)= 𝜎(0.2+(0.3×2+(−2.2)×0+3.3×0+(−0.2)×1))=0.65 𝜎(𝑥8;𝜃)= 𝜎(0.2+(0.3×2+(−2.2)×2+3.3×0+(−0.2)×0))=0.03 𝜎(𝑥9;𝜃)= 𝜎(0.2+(0.3×1+(−2.2)×2+3.3×1+(−0.2)×7))=0.12
Step 2, now we can compute the parameter update for 𝜃”. We are using the following formula: 𝜃” = 𝜃” − 𝜂 N (𝜎(𝑥1;𝜃) − 𝑦1)𝑥”1
1∈{5,6,7,8,9)
𝜃” = 0.3 − 0.1 N (𝜎(𝑥1;𝜃)− 𝑦1)𝑥”1 1∈{5,6,7,8,9)
𝜃” = 0.3− 0.1[((𝜎(𝑥5;𝜃)−𝑦5 ).𝑥”5)+((𝜎(𝑥6;𝜃)−𝑦6).𝑥”6)+((𝜎(𝑥7;𝜃)− 𝑦7).𝑥”7) +((𝜎(𝑥8; 𝜃) − 𝑦8 ). 𝑥”8) + ((𝜎(𝑥9; 𝜃) − 𝑦9). 𝑥”9)]
= 0.3− 0.1[((0.94−1)× 1)+((0.97−1)× 1)+((0.65−1)×2)+((0.03−0)×2) +((0.12 − 0) × 1)]
= 0.3 − 0.1P(−0.06) + (−0.03) + (−0.70) + 0.06 + 0.12R = 0.3 − 0.1 (−0.61) = 0.3 + 0.061 = 0.361
The new value for 𝜃” is 3.061. Given the feedback from the current model mistakes (over on our training data), the importance of feature Apple has slightly increased. We can do the same thing for other parameters 𝜃& , 𝜃- and 𝜃. . We can continue the iterations until we find the “optimum” parameters.
NOTE: In the full Gradient Descent algorithm, we first compute the sigma value for all parameters (step 1 above), and then update all parameters at once (step 2 above).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com