CS代写 UNIVERSITY OF WARWICK September Examinations 2021/22 Applications of Data S

UNIVERSITY OF WARWICK September Examinations 2021/22 Applications of Data Science
Time Allowed: 2 Hours
Answer ALL FOUR questions in Section A (10 marks each) and TWO questions from
THREE in Section B (30 marks each). Approved pocket calculators are allowed.

Copyright By PowCoder代写 加微信 powcoder

Read carefully the instructions on the answer book provided and make sure that the particulars required are entered on each answer book. If you answer more questions than are required and do not indicate which answers should be ignored, we will mark the requisite number of answers in the order in which they appear in the answer book(s): answers beyond that number will not be considered.
Section A: Answer FOUR questions
1. This question considers the various Model Selection approaches that you have learnt.
(a) Describe the Selection Approach. (5 marks) Solution:
i Let M0 denote the null model containing no predictors except for a constant (i.e. predicting the mean).
ii For k = 1, …, p 􏰇p􏰈
a) Fit all k models that contain k predictors.
b) Pick the best among these k dimensional models, calling it Mk. The best
model is the one that has smallest RSS or largest R2.
iii Select the single best model from the set M0, …, Mp. Here best is determined using best performance in terms of MSE on a training set, or some measure of goodness of fit that adjusts for the fact that R2 monotonically decreases as k gets larger.
(Continued overleaf)

(b) Suppose you perform best subset selection with six covariates X1, …, X6. Best subset selection identified the following sequence of best models Mk (see overleaf). After performing cross validation, you choose M3, since its the model with lowest cross validated test MSE. Given the sequence Mk identified by best subset selection, would forward stepwise selection yield the same result, i.e. pick M∗3 as best model? What about backward stepwise selection? Please explain. (5 marks)
M∗3 = M4 = M5 = M6 =
{X1, X2, X3}
{X1, X2, X3, X5}
{X2, X3, X4, X5, X6} {X1, X2, X3, X4, X5, X6}
Solution: The fact that M∗3 is the best model as determined following best subset selection means its is truely the best model and is to be prefered over the other models. Forward stepwise selection will yield the same results since the models
M1 ⊂ M2 ⊂ M3, so it lies on the search path for forward stepwise selection. However, backward stepwise selection is guaranteed to yield a different “best” model. Both best subset selection and backward stepwise would choose M5 ⊂ M6, throwing out X1 as regressor. However, the best subset selected model M3 contains X1 again, so this does not lie on the search path of backward stepwise selection anymore, since it will have thrown out X1 already and M3 includes it again.
2. Classification
(a) What decision rule is the MAP decision rule? What does this rule maximize? (5
Solution: MAP decision rule assigns the label whichever has highest posterior
probability, i.e.
Yˆ = argmaxy∈CPˆ(Y = y|X)
This is optimal, because it minizes the overall error rate – to see why consider the case of a binary prediction setting. If we absorb information X and assign the error based on whichever label is more likely, our error rate is
P (error|x) = min{P (+|x), P (−|x)}
(Continued overleaf)

This is the rule that minimizes error for all x joint
P (error) = p(error, x) dx = P (error|x)p(x)dx
since by choosing P (error|x) = min{P (+|x), P (−|x)} we chose the minimal level
feasible to integrate over.
(b) Using the output provided below that is obtained from estimating the following logistic regression
g(p(x)) = logit p(x) = ln 1 − p(x) = β0 + β1×21 + β2x1x2 + β3×2,
Can you write down the equation for the best decision boundary that minimizes the overall prediction error? What shape is the decision boundary? [note that the estimated coefficients have been rounded to the nearest integer] (5 marks)
Solution: MAP decision boundary for logistic regression is such that the decision boundary is exactly equal to zero. The boundary appears non-linear, but is actually not given that
−4+x21 −4x1x2 +4×2 =0 This can be solved for x2(x1) as we can factorize
−4 + (x1 − 2×2)2 = 0
And hence x2 < 12 x1 − 1 or x2 > 1 + 12 x1. This is obviously an artefact of the coefficients being precise zero, which is something that students should notice. The decision boundary is two parallel lines.
(Continued overleaf)

3. (a) Describe the binary recursive splitting algorithm used to build regression trees. (4 marks)
Solution: Define two half planes as R1(j, s) = {X|Xj < s} and R2(j, s) = {X|Xj ≥ s}. Binary Recursive Splitting (a) Initialize with empty tree T (b) For each j = 1, ..., k, find the s that minimizes minRSSj = 􏰊 (yi −yˆR1)2 + i:xi ∈R1 (j,s) (c) Choose the j and corresponding optimal s ̃ at which the value of RSSj is lowest and split the data at this point into two half planes R1,R2. (d) For each new half plane, repeat (2)-(4) so long as cutting criterion is not Students may use an illustration in a bivariate setting. (b) Suppose you have built a tree of the following shape. You are also provided with the information that there is exactly one observation in each terminal node. The fitted values in the terminal node are provided as the values Ti. Define Di,1 = 1(Xi,1 < c1), Di,2 = 1(Xi,2 < c2), Di,3 = 1(Xi,3 < c3) and Di,4 = 1(Xi,4 < c4). 􏰊 (yi −yˆR2)2 i∈R2 (j,s) (Continued overleaf) What would be the estimated coefficients βˆ0, βˆ1, βˆ2, βˆ3, βˆ4 if you estimate the below specification on the above data? yi = β0 + β1Di,1 + β2Di,2 + β3Di,3 + β4Di,4 Solution: Students should realize that the above specification will exactly replicate the tree and has to produce the same fitted values as the tree as it is identical. The only difference is that there is a common group mean β0. Since there is exactly one observation there is no issue with weighting that may distort some means. To work out what the βj are they will have to plug in FortheterminalnodeT1withgroupmeanof50D1 =1,D2 =1,D3 =0,D4 =1, so the model becomes yi =β0 +β1 +β2 +β4 =50 FortheterminalnodeT2withgroupmeanof100D1 =1,D2 =1,D3 =0,D4 =0, so the model becomes yi =β0 +β1 +β2 =100 FortheterminalnodeT3withgroupmeanof9D1 =1,D2 =0,D3 =0,D4 =0,so the model becomes yi = β0 + β1 = 9 FortheterminalnodeT4withgroupmeanof100D1 =0,D2 =0,D3 =1,D4 =0, so the model becomes yi =β0 +β3 =100 FortheterminalnodeT4withgroupmeanof200D1 =0,D2 =0,D3 =0,D4 =0, so the model becomes So this implies yi =β0 =200 βˆ = 200 0 βˆ = −191 1 βˆ = −100 3 βˆ = −50 4 (Continued overleaf) You want to apply Ridge to the following linear model yi = α + β1xi,1 + β2xi,2, where yi represents an individual i’s income in 1000s of dollars and xi,1 represents individual i’s spending on healthcare in 1000s of dollars, while xi,2 represents his or her spending on education in 1000s of dollars. Can you write down the Ridge regression optimization problem? Do you need to worry about standardization? Why does Ridge regression not perform variable selection? (4 marks) In this case we do NOT need to worry about standardization as all xi’s are measured in the same units. In this special case, Ridge regression is not biased due to different coefficient scales. To make the case that Ridge regression does not perform variable selection students may highlight that this is a smooth optimization problem with a concave objective function and convex constraint. The solution will be derived through first order conditions meaning that numerically no coefficient will be optimal and estimated to be exactly equal to zero. You have been told that the underlying features xi,1 and xi,2 have been rescaled to be measured not in 1000s of dollars but in dollars – i.e. x ̃i,1 = 1000xi,1 and x ̃i,2 = 1000xi,2. Using the same training and validation sets along with the same values for λ, will the resulting optimal model from (b) have a different test MSE compared to the one obtained in (a)? (6 marks) Solution: The solution will be IDENTICAL in terms of test MSE model performance. The rescaling of the x’s leaves the fact intact that all x ̃i’s are measured in the same units. The yi has not been rescaled though. As a result the coefficient sizes will increase to reflect this but since the scaling is the same there is no issue and the ensuing optimal models will yield numerically identical performance. Section B: Answer TWO questions Logistic regression with shrinkage or regularization. You are presented with the following cloud of points for a binary classification problem. The 2-dimensional labeled training set, where ‘+’ corresponds to class y = 1 and ‘O’ corresponds to class y = 0. Solution: Standard setup of Ridge regression optimization problem argmin 􏰊(y −[α+β x +β x ])2 +λ(β2 +β2) βi 1i,12i,2 12 i (Continued overleaf) (a) You are given the choice of a logistic, KNN and SVM to build a classifier on the above training data. What would you expect to be the training error in each? Which method would you prefer? Why? (4 marks) (b) You decide to run logistic regression, i.e. modelling the log odds of an individual observation i as: log( P(yi =1|xi) )=β0 +β1xi1 +β2xi2 1−P(yi =1|xi) What is the probability of observing an individual observation i? Can you write down the Likelihood function L(β|X) that logistic regression maximizes? What is your assumption as you write down the likelihood? Please derive the First Order Condition with respect to β1 [Hint: you can use shorthand to write the probability of observing a Y = 1|X for an individual observation i as p(xi) = P(yi = 1|xi)] (6 marks) (c) Please derive the First Order Condition with respect to β1 (4 marks) (d) Suppose now you are asked to estimate logistic regression on the above training data, but you add a Lasso shrinkage penalty, where you maximize the likelihood but add a penalty term. 􏰊 log(P (yi|xi, β0, β1, β2) − λ(|β1| + |β2|) i=1 As we increase the regularization parameter λ what do you think will happen to the estimated βˆ1 and βˆ2, please explain your answer. (8 marks) (e) Suppose λ is very large, what do you expect to be the value of β0? (6 marks) (a) Solution: The training error for each method will be exactly zero; for kNN this is only the case however, if k is sufficiently small (less than 3). The data is perfectly separated and hence, there is no objective ”better” method. Yet, given the specific training data it may make sense to opt for a more parametric method like SVM or logistic regression as these actually provide a functional form and are much less of a black box. (Continued overleaf) (b) Solution: We need to assume that the joint distribution of the data factorizes, i.e. that the individual observations are class conditionally independent from one another. The likelihood of observing an individual observation i is given as p(xi)yi (1 − p(xi))1−yi The likelihood to observe n observations is the product L(β|X) = 􏰋 p(xi)yi (1 − p(xi))1−yi i (c) Solution: Taking logs of the likelihood yields log(L(β|X)) = 􏰊 yilog(p(xi)) + (1 − yi)log((1 − p(xi))) There are three coefficients to estimate, so the FOC is a vector. However, students can just derive with regard to β0 and βi They may use the trick that for the logistic function h′(z) = h(z)(1 − h(z)) Students may remember or can rearrange the log-odds to arrive at an expression for P(Y = 1|X) as p(xi) = exp β0 + β1xi1 + β2xi2 1−expβ0 +β1xi1 +β2xi2 The FOC is: Rewrite as: ∂log(L(β|X)) = 􏰊 yi p′(xi)xi1 − 1 − yi p′(xi)xi1 1 − p(xi) p′(xi) − Using the fact that h′(z) = h(z)(1 − h(z)) 1 − yi p′(xi)]xi1 = 0 1 − p(xi) 􏰊[yi − p(xi)]xi1 = 0 i (d) Solution: The data can be classified with zero training error and therefore also with high log- probability by looking at the value of x2 alone, i.e. making β1 = 0 comes at no cost. So as λ increases, we would expect that β1 is first shrunken to zero. As λ increases further, even β2 will also become zero. We pay higher and higher cost for setting β2 to a non-zero value. Eventually this cost overwhelms the gain from the log-probability of labels that we can achieve with a non-zero β2. (Continued overleaf) (e) Solution: For very large λ, β1 = 0 and β2 = 0. That is, we ignore the information contained in the x1 and x2 features. In this case, the best we can do is to assign classes based on our estimated priors. We can estimate the priors from the data: there is the same number of O and +, suggesting that our best estimates for the priors are Pˆ(Y = 1) = 0.5 and Pˆ(Y = 0) = 0.5. The optimal choice is thus for a value of β0 that will ensure that we always assign the same label. log( P(yi =1|xi) )=β0 +β1xi1 +β2xi2 =0 1−P(yi =1|xi) Sinceforlargeλ,β1 =0andβ2 =0 log( P(yi =1|xi) )=β0 =0 1−P(yi =1|xi) Onlyifβ0 =0,soweexpectthatβ0 =0. 6. You are working for HMRC (Her Majesty’s Revenue and Customs) and are asked to use the data from past tax returns to build a classifier that helps the government to better use their resources to investigate tax returns. Specifically, you want to ensure that you only investigate those tax returns, that historically, were found to have been involving cases of tax evasion. You are given the following training data. (a) What kind of classification approach would you think is most appropriate given the data? (4 marks) (Continued overleaf) Solution: Any classification approach would do, given that features are mostly categorical. These are easy points. Nothing particularly special about the data. (b) Suppose you wanted to use Naive Bayes to build a classifier. Can you explain how the Naive Bayes classifier works? What is the simplification assumption that it employs? Fee free to illustrate or use formal notation. (7 marks) Solution: The general classification problem requires that we assign the label Y to a data point, such that the probability of observing that particular label y, conditional on the data X is maximal, formally this Maximum A Posterior Decision rule is written as: Yˆ = argmaxy∈CPˆ(Y = y|X) The starting point for a Naive Bayes classifier is Bayes rule, which allows us to write down the conditional probability as : Yˆ = argmax Pˆ(X|Y )Pˆ(Y ) y ∈C Pˆ (X ) Given that for each given data point (yi,Xi), the denominator is identical, the optimization problem is equivalent as: Yˆ = argmaxy∈CPˆ(X|Y )Pˆ(Y ) Typically, our data matrix X has many columns p, so we need to think about estimating the joint distribution conditional on class Y : Pˆ(X|Y ) = Pˆ(x1, ..., xp|Y ) The Naive Bayes assumption states that the x1, ..., xp are independent from one another conditional on the class Y . This allows us to write Pˆ(X|Y ) = Pˆ(x1, ..., xp|Y ) = 􏰋 Pˆ(xi|Y ) . With the Naive Bayes classifier becoming Yˆ = argmaxy∈C 􏰋 Pˆ(xi|Y )Pˆ(Y ) Maybe say a bit about why this makes sense, its computational feasibility as opposed to estimating the unfactorized joint distribution. (c) Suppose that the “taxable income” data column is turned into a discrete variable with three brackets: L for low for income less than 100k, M for income between 100k to 200k and high H for income above 200k. With this refinement, looking at the set of possible realization as what is the total number of possible realization that (yi, xi1, ..., xi3) could take? (4 marks) (Continued overleaf) Solution: Simple counting exercise Refund = 2, Marital status =3, Income = 3, Evade = 2 - so total set of possible realizations is given as 2 × 3 × 3 × 2 = 36. (d) Based on the training data, provide a Naive Bayes prediction for yj evade no or yes, for a new test observation defined as vector (xj1, ..., xj3) = (Y es, Single, H). Please indicate and explain your steps, especially how you estimate the different inputs. (8 marks) Solution: The Naive Bayes classifier: Yˆ = argmaxy∈C 􏰋 Pˆ(xi|Y )Pˆ(Y ) Here we need to compute: Pˆ(yes)Pˆ(xi,1 = yes|yes)Pˆ(xi,2 = Single|yes)Pˆ(xi,3 = H|yes) = 3 ×1×1×2 = 1 2 = 2 10 3 3 3 109 90 Pˆ(no)Pˆ(xi,1 = yes|no)Pˆ(xi,2 = Single|no)Pˆ(xi,3 = H|no) = 7 ×3×2×1 = 3 2 = 6 10 7 7 7 1049 490 Since 6 < 2 , we assign the label as the person having not evaded. 490 90 (e) Suppose you wanted to retain the fact that the income variable xi,3 is actually a numeric variable (and not a discrete one). What kind of assumptions on the generative process for xi,3 could you make to estimate P(xi,3|yes) and P(xi,3|no)? (7 marks) Solution: If the features are numerically distributed, what we can do is to assume they are the result of a normal distribution and estimate the mean (μyes,σyes) and μno,σno and variance parameters from the subsets of data where evade=yes and evade=no. Given that we know what the PDF is for a normal distribution we can use this to then obtain density values. 7. Food delivery optimization (a) You are working for an IT startup that specializes in food delivery. Customer satisfaction requires for food orders to be delivered on time to the clients. You have access to a dataset of historical orders, indexed by i. For each order, you have defined an indicator as: yi = 1 if order delivered early or on time 0 if order is delayed You have access to a database of 100,000 orders. You use a random set of 99,500 orders to train a model and hold out 500 orders for validation. You also know a lot of (Continued overleaf) information about each individual order – these features are stored in the variables xi,1, ..., xi,p. In what type of framework are you - classification or regression? What are some of the methods that you may want to use? (4 marks) Solution: This is a classification setup as the dependent variable yi is categorical (2 marks). Give full marks if students think of the setting as one where we want to predict delay times – i.e. the number of minutes an order is delayed relative to an ex-ante delivery time estimate. The context is one of classification and all methods we discussed are likely suitable – so kNN, SVM, Logistic Regression, Classification Trees, etc. Students may point out that in this setting, we do not care about producing an interpretable model and so may err on chosing a modelling technique that may produce black-box but accurate models. (b) You decide to run logistic regression, i.e. modelling the log odds of an individual order i being delayed as: What is the probability of observing an individual observation i? Can you write down the Likelihood function L(β|X) that logistic regression maximizes? What is your assumption as you write down the likelihood? [Hint: you can simply use shorthand to write the probability of default of an individual loan i as p(xi) = P(yi = 1|xi)] (6 marks) Solution: We need to assume that the joint distribution of the data factorizes, i.e. that the individual observations are independent from one another. The likelihood of observing an individual observation i is given as P(yi=1|xi) 􏰊p log(1−P(yi =1|xi))=β0 + p(xi)yi (1 − p(xi))1−yi . The likelihood to observe n observations is the product L(β|X) = 􏰋 p(xi)yi (1 − p(xi))1−yi i Students may directly write down the log likelihood. (c) Please derive the First Order Condition with respect to β1 Solution: Taking logs of the likelihood yields log(L(β|X)) = 􏰊 yilog(p(xi)) + (1 − yi)log((1 − p(xi))) There are three coefficients to estimate, so the FOC is a vector. However, students can just derive with regard to β0 and βi They may use the trick that for the logistic function h′(z) = h(z)(1 − h(z)) (Continued overleaf) Students may remember or can rearrange the log-odds to arrive at an expression for P(Y =1|X)as The FOC is: Rewrite as: expβ0 +􏰉pj=1 βjxij p(xi)= 1−expβ0 +􏰉pj=1βjxij ∂log(L(β|X)) = 􏰊 yi p′(xi)xi1 − 1 − yi p′(xi)xi1 1 − p(xi) p′(xi) − Using the fact that h′(z) = h(z)(1 − h(z)) 1 − yi p′(xi)]xi1 = 0 1 − p(xi) 􏰊[yi − p(xi)]xi1 = 0 i (d) You are estimating your logistic regression and evaluate t 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com