Data 100, Midterm 2 Fall 2019
Name:
Email:
Student ID:
Exam Room:
All work on this exam is my own (please sign):
@berkeley.edu
Instructions:
• This midterm exam consists of 100 points and must be completed in the 80 minute time period ending at 9:30, unless you have accommodations supported by a DSP letter.
• Note that some questions have circular bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use two cheat sheets each with two sides.
• Please show your work for computation questions as we may award partial credit.
1
Data 100 Midterm 2, Page 2 of 25 November 13, 2019
Reference Table
exp(x)
ex
log(x)
loge(x) or ln(x)
Linear regression model
T ⃗ˆ yˆ = f ⃗ˆ ( ⃗x ) = ⃗x β
β
Logistic (or sigmoid) function
σ(t) = 1 1+exp(−t)
Logistic regression model
T ⃗ˆ yˆ=f⃗ˆ(⃗x)=P(Y =1|⃗x)=σ(⃗x β)
β
Squared error loss
L ( y , yˆ ) = ( y − yˆ ) 2
Absolute error loss
L ( y , yˆ ) = | y − yˆ |
Cross-entropy loss
L(y, yˆ) = −y log(yˆ) − (1 − y) log(1 − yˆ)
Model Bias
E[f⃗ˆ(⃗x)] − g(x) β
Model Variance
E[(f⃗ˆ(⃗x) − E[f⃗ˆ(⃗x)])2] ββ
0 Howdy
[0 pts] In LASSO regression, LASSO is an acronym. What does it stand for?
Solution: Least Absolute Shrinkage and Selection Operator
Data 100 Midterm 2, Page 3 of 25 November 13, 2019
1 PCA
A children’s zoo collects data about how much time 1000 visitors spend at each of 8 selected exhibits and stores them in a dataframe df zoo. These exhibits include 6 animals and 2 activities (train and playground). An example row of df zoo is given below.
(a)
[3 Pts] Suppose we center and scale df zoo (as we learned about in class) to form the design matrix X. X has 1000 rows and 8 columns exactly corresponding to the dataframe described above, except that it has been centered and scaled. Suppose we then use SVD to decompose X into U , Σ, and V T . Suppose that we want to compute the principal component matrix P, where the 1st column of P is the 1st principal component, the 2nd column of P is the 2nd principal component, etc. Which of the following expressions are equal to P? Select all that apply.
U Σ VT X UX UΣ XU XΣ XV
[2 Pts] How many rows and columns are in P?
# rows = # columns =
(b)
Solution: 1000 rows, 8 columns. 1000 data points, 8 principal components per data point.
(c) i.
Solution: RecallthattheSVDgivesX=UΣVT andP =UΣ=XV.
[3 Pts] What is the total variance V of our centered and scaled design matrix X? If there is not enough information provided in the problem statement, write ”not enough information.”
answer =
ii. [3 Pts] Suppose our first 6 singular values are 56, 53, 21, 20, 20, 19. What fraction of the variance is captured by the first two principal components? Do not carry out any arithmetic operations; just give us a numerical expression that could be evaluated into the correct answer. Regardless of your answer to the previous problem, you may
Solution: V = 8
Data 100 Midterm 2, Page 4 of 25 November 13, 2019
assume that you know V, and may give your answer for this problem in terms of V. If there is not enough information, write ”not enough information.”
answer =
Solution: 562+532 1000V
The variance described by the first 2 PCs is
and the total variance is V.
562 + 532 1000
Data 100 (d)
Midterm 2, Page 5 of 25 November 13, 2019
[6 Pts] Below is a 2D scatterplot of the first two principal components. We see that there appear to be 3 types of visitors, grouped on the top, bottom-left, and bottom-right.
Below are plots of the first and second rows of V T .
Use these plots to describe the characteristics of each of the 3 groups in the scatterplot above. Your explanations should only be a sentence or two.
Solution: The group with positive pc2 values and pc1 values around 0 (top group) seems to represent a group of visitors who spend a lot of time at the activites (train and playground). The group with negative pc1 values and negative pc2 values (bottom left group) seems to represent a group of visitors who spend a lot of time at the mammal exhibits (cheetah, tiger, and lion). The group with positive pc1 values and negative pc2 values (bottom right group) seems to represent a group of visitors who spend a lot of time at the reptile exhibits (turtle, iguana, and alligator).
Top group description:
Bottom-left group description:
Data 100 Midterm 2, Page 6 of 25 November 13, 2019
Bottom-right group description:
Data 100 Midterm 2, Page 7 of 25 November 13, 2019
2 Linear Regression
Suppose we have a data set of 100 points whose first few rows are shown below, and that we’d like to predict ⃗y from ⃗v and w⃗. Suppose we create a design matrix X whose first column is ⃗v, second column is w⃗, and third column ⃗u is a new feature ui = |vi|. The resulting model is yˆi =β1vi +β2wi +β3|vi|.Thetoprowisrow1,e.g.y1 =4.
(a) [3 Pts] For the data above, suppose we arbitrarily pick β⃗ = [0.1, 12, 0.2]T . What is yˆ1? yˆ1 =
(b) [2 Pts] For the data above, let ⃗e be the residual vector if β⃗ = [0.1, 12, 0.2]T . What is |e1|? |e1| =
(c) [3 Pts] For the data above, suppose that ⃗e · ⃗e = 9. What is the MSE? MSE =
⃗ˆ
(d) [3 Pts] Let β be the exact parameter vector that minimizes the empirical L2 risk, where
y
v
w
4
-30
1
6
-40
2
5
20
3
Solution:
yˆ =0.1·(−30)+12·1+0.2·|−30|=−3+12+6= 1
15
Solution:
|e |=|y −yˆ|=|4−15|= 111
11
Solution: Note, ⃗e · ⃗e = ||⃗e||2 = n e2. Then, since MSE = 1 n e2, the MSE 2 i=1i ni=1i
is1·9= . 100
9
100
we write this risk as R(β⃗, X, ⃗y). Also, let ⃗e be the residuals for the optimal parameter ⃗ˆ
vector β. Which of the following quantities are guaranteed to be zero?
⃗ˆ⃗⃗ˆ
ei TheMSE ∇β⃗(R(β,X,⃗y)) ⃗e·yˆ ⃗e·β Noneofthese
Data 100
Midterm 2, Page 8 of 25 November 13, 2019
Solution:
i. The first option is not correct – this is only guaranteed when we have an inter- cept term, which our model does not.
ii. This is also not correct – this would require all our residuals to equal 0 which is usually not the case.
iii. This is correct – at the point at which our empirical risk R is minimized, the gradient of R with respect to β⃗ is guaranteed to be 0. Note: if we use a numeri- cal technique, like gradient descent, the value of our gradient for our estimated value of β⃗ may not be exactly 0, but in this question we’re dealing with an exact
⃗ˆ value for β.
iv. This is also correct – we know that ⃗e is orthogonal to the span of X. Since ⃗ˆ
yˆ = Xβ, we know that yˆ ∈ span{X}, and thus yˆ must also be orthogonal to the residuals.
⃗ˆ v. This option is not correct, since there’s no direct relationship between ⃗e and β
that doesn’t involve yˆ or X.
(e)
[1 1/2 Pts] For the data above, the matrix X has full rank (i.e. no columns are linear combinations of any others). Suppose we compute Z = (XT X)−1XT ⃗y. What is Z? Select one and fill in its blank.
⃝ It is a vector of length .
⃝ It is a matrix with rows and columns. ⃝ It does not exist because |vi| is not differentiable.
[5 Pts] Let β⃗ridge be the β⃗ that minimizes the sum of the MSE plus an L2 regularization term for a positive λ. Let ⃗e be the residuals for the parameter vector β⃗ridge. Which of the following are true? Recall that ||β⃗||2 is the sum of the squares of the components of β⃗ and R is the empirical L2 risk defined in (d).
ei =0
∇β⃗(R(β⃗ridge,X,⃗y)) = 0
⃗ˆ ⃗
R(β,X,⃗y)≤R(βridge,X,⃗y)
⃗ 2 ⃗ˆ2 ||βridge||2 ≤ ||β||2
None of these
⃗ˆ T −1 T
Solution: We know that β = (X X) X ⃗y. Thus, this quantity must be a vector,
with length equal to the number of features, which in this case is 3.
(f)
Data 100 Midterm 2, Page 9 of 25 November 13, 2019
Solution:
i. This is not correct —- we still don’t have an intercept column, and even if we did, we know that the predictions on our training set for β⃗ridge are worse than
⃗ˆ
those of β, and so the residuals (and hence their sum) is larger for the ridge
solution than it is for the non-regularized solution.
⃗ˆ ⃗ ⃗
ii. This is also not correct — β is the unique value of β that minimizes R. βridge
doesn’t minimize R (it instead minimizes a regularized risk), and so the gradi- ent of R is not equal to 0 when evaluated at β⃗ridge.
iii. This is true — regularizing our model makes our predictions worse on our training set (in hopes that it generalizes our model to better fit unseen data). As a result, the empirical risk of our regularized model is greater than (or equal to) the empirical risk of our unregularized model.
iv. Thisisalsotrue—theobjectivefunctionforridgeregressionincludesapenalty on the L2 norm of β⃗, in order to decrease the norm of β⃗. This option follows from that principle.
Data 100 Midterm 2, Page 10 of 25 November 13, 2019
3 Bias-Variance Tradeoff
We obtain n data points (n is some large fixed integer) which have been generated from the truemodelY =f(x)+ε,whereεisrandomnoise(E[ε]=0,Var(ε)=σ2).
We fit linear models of varying complexity to our data, and plotted the bias, variance, and irreducible error below.
(a) [1 1/2 Pts] Sketch the MSE on the above graph. Where does its minimum occur? Draw a star on your MSE plot where the minimum occurs.
Data 100 Midterm 2, Page 11 of 25 November 13, 2019
Solution:
MSE = (Model Bias)2 + Model Variance + Irreducible error
Irreducible error is also known as σ2, i.e. the variance of the noise term ε.
An approximately U-shaped curve should be drawn where each point on the curve is the sum of the three curves/lines. The minimum occurs at the minimum of this drawn curve.
(b) [1Pt] SupposewecontrolthecomplexityofthelinearmodelsusingaRidgepenaltyterm λ βi2. Which of the following is true?
⃝ The left side of the graph represents small λ.
⃝ The right side of the graph represents small λ.
(c) [3 Pts] Which of the following can impact our model variance? Select all that apply.
The regularization coefficient λ.
The choice of features to include in our design matrix. The learning rate α in gradient descent.
The size of the training set.
Solution: A smaller λ value means higher model complexity. Remember that a zero λ value means a model with no regularization.
Solution:
Data 100 Midterm 2, Page 12 of 25 November 13, 2019
Ahigherλvaluemeansmoreregularizationwhichreducesmodelvari- ance.
Including a large number of uninformative features may lead to over- fitting, which in turn with increase the model variance.
The learning rate α in gradient descent won’t impact the model’s variance bias tradeoff, it is simply a numerical method that is used to fit the model to data.
Generally, a larger training set will reduce model variance.
Data 100 Midterm 2, Page 13 of 25 November 13, 2019
4 Cross Validation
Suppose we have a training dataset of 90 points, and a test set of 30 points, and want to know which λ value is best for a ridge regression model. Our candidate hyperparameters are λ = 0.1, λ = 1, and λ = 10.
(a) [2 1/2 Pts] A DS100 student suggests performing 10-fold cross validation to find the opti- mal λ. Is the choice of 10-fold CV reasonable?
⃝ Yes.
⃝ No, since we have 3 candidate hyperparameters we should use 3-fold cross
validation.
⃝ No, since we have 30 test points, we should use 30-fold cross validation. ⃝ No, CV should never be used for selecting hyperparameters.
Solution:
i. With a (relatively small) dataset of 90 points, 10-fold CV is reasonable. We will be computing (10 folds) * (3 choices of λ) = 30 validation errors, each of which is obtained by training a ridge regression model on some portion of the 90 training data points and testing on the remainder of the 90 points we didn’t use for training. This answer must also be the solution because the other statements are not correct/logical.
ii. In general, there is no rule saying that we have to use the same number of folds as there are choices of hyperparameters. The number of folds is completely separate from the number of hyperparameters.
iii. The test data is not considered at all for CV, so there is no relationship between any property of the test data and the number of folds in CV.
iv. CVistheonlymethodtaughtinthisclassforselectinghyperparameters,sothis statement is incorrect.
(b) Suppose we select the best choice of λ from the three choices available using 3-fold cross validation. As mentioned in class, we can compute the optimal parameters for a ridge regression model with the expression β⃗ = (XT X + nλI)−1XT ⃗y. Assume that we use this closed equation to fit the parameters for our model.
i. [2 Pts] During the entire process of selecting our best λ, how many total times will we evaluate the expression (XT X + nλI)−1XT ⃗y?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝270
ii. [2 Pts] How many rows will be in X each time this expression is evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝120 ⃝ It will vary each time. ⃝ Not enough information.
Data 100 Midterm 2, Page 14 of 25 November 13, 2019
Solution:
i. Notethatcomputing(XTX+nλI)−1XT⃗yistraining(or”fitting”)amodelwith data matrix X and a regularization parameter λ. In CV, we train a model on each fold for each value of λ. Thus, we have to train (3 folds) * (3 choices of λ) = 9 models.
ii. Since we are doing 3-fold CV, we split our training data into 3 parts of equal size. For each fold, 2 of these parts will be used for training the model and 1 part will be used for validation. Since our training data has 90 points, each part will have 30 points. Since 2 parts are used to train each model, the X matrix will have 60 points and therefore 60 rows.
(c) As in the previous part, suppose we want to select the best λ from the three choices above using 3-fold cross validation. To evaluate the MSE for a given β⃗, we use the sum of squares: ||⃗y − Xβ⃗||2. Reminder that this expression is just another way of writing ( ⃗y i − ⃗x Ti β⃗ ) 2 .
i. [2 Pts] During the entire process of selecting our best λ, how many times will this expression get evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90
ii. [2 Pts] How many rows will be in X each time this expression is evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝120 ⃝ It will vary each time. ⃝ Not enough information.
Solution:
i. Note that computing the MSE of a model on some data is evaluating the model’s error on that data. In CV, we are interested in knowing each model’s error on each fold. Remember that we have a different model for each of 3 choices of λ and that we have 3 folds. Thus, we will be computing the MSE (3 choices of λ) * (3 folds) = 9 times.
ii. Since we are doing 3-fold CV, we split our training data into 3 parts of equal size. For each fold, 2 of these parts will be used for training the model and 1 part will be used for validation. Validation is the process of computing the error of a model on a particular fold. Since our training data has 90 points, each part will have 30 points. Since 1 part is used for validation, the X matrix will have 30 points and therefore 30 rows.
Data 100
Midterm 2, Page 15 of 25 November 13, 2019
5
(a)
Gradient Descent
[3 Pts] The learning rate can potentially affect which of the following? Select all that apply. Assume nothing about the function being minimized other than that its gradient exists. You may assume the learning rate is positive.
The speed at which we converge to a minimum.
Whether gradient descent converges.
The direction in which the step is taken.
Whether gradient descent converges to a local minimum or a global mini- mum.
[3Pts] Supposewerungradientdescentwithafixedlearningrateofα=0.1tominimize the2Dfunctionf(x,y)=5+x2 +y2 +5xy.
The gradient of this function is
2x + 5y ∇x,yf(x,y)= 2y+5x
If our starting guess is x(0) = 1, y(0) = 2, what will be our next guess x(1), y(1)?
x(1) = y(1) =
[2 Pts] Suppose we are performing gradient descent to minimize the empirical risk of a linear regression model y = β0 + β1×1 + β2×21 + β3×2 on a dataset with 100 observations. Let D be the number of components in the gradient, e.g. D = 2 for the equation in part b. What is D for the gradient used to optimize this linear regression model?
⃝2 ⃝3 ⃝4 ⃝8 ⃝100 ⃝200 ⃝300 ⃝400 ⃝800
(b)
Solution: The gradient is = [2*1 + 5*2, 2*2 + 5*1] = [12, 9] so next guess is [1, 2] – 0.1 * [12, 9] = -0.2, 1.1
(c)
Data 100 Midterm 2, Page 16 of 25 November 13, 2019
6 One Hot Encoding and Feature Engineering
A Canadian study of workers in the 1980s collected the following information: • wage (hourly in dollars)
• edu (years)
• job_type (1 for blue collar, 2 for white collar, and 3 for managerial)
A data scientist fitted a model with wage as the response, and the other two variables as ⃗ˆ
features (job_type was one-hot encoded). The resulting fitted model was yˆ = ⃗x · β, where ⃗ˆ T
β= −8 3 6 −3 ,i.e.
yˆ=−8+3xedu +6xm −3xb,
where y is the hourly wage, xedu is years of education, and the other two variables are the
dummies for managerial and blue collar workers, respectively.
(a) [2 Pts] For a blue collar worker with 10 years of education, what is the predicted value
of wage (the predicted hourly wage) according to our model?
wage =
(b) [2 Pts] For a white collar worker with 10 years of education, what is the predicted value of wage according to our model?
wage =
(c) [6 Pts] Sketch the fitted model on the graph below. Hint: What you did in parts (a) and (b) is useful here. When grading we will only look at y-values for x = 10 and x = 20, so don’t worry about exact values other than these. Don’t worry about exact shape.
Solution: For this worker, we have that xedu = 10, xm = 0, and xb = 1. When we plug these values in to the fitted model, we get:
yˆ = −8 + 3 × 10 + 6 × 0 − 3 × 1 = 19
Solution: For this worker, we have that xedu = 10, xm = 0 and xb = 0. When we plug these values in to the fitted model, we get:
yˆ = −8 + 3 × 10 + 6 × 0 − 3 × 0 = 22
Data 100 Midterm 2, Page 17 of 25 November 13, 2019
Solution: The fitted model yields three parallel lines. The slope of the lines is 30. The intercept depends on job type. The intercept is −8 for the white collar workers, −8 − 3 = −11 for blue collar workers, and −8 + 6 = −2 for managerial workers. We can use the points determined in parts (a) and (b) to draw the lines. Specifically, for a white collar worker we found that the line goes through the point (10, 22). It must also go through the point (20, 52). Similarly, the blue collar line goes through the points (10, 19) and (20, 49) and the managerial worker line goes through the points (10, 28) and 20, 58). The figure below shows these three lines.
Data 100 (d)
Midterm 2, Page 18 of 25 November 13, 2019 [5 Pts] The first four rows of the original data frame appear below on the left.
Create the design matrix X used to fit the model on the previous page by filling in the table below. Put the variable name in the first row and fill the remaining 4 rows with the corresponding data. You may not need all columns. Use the top row to name your columns.
wage
edu
job.type
15
10
1
28
14
2
20
12
1
35
16
3
Solution: The model that was fitted has a constant (bias) term and dummy variables for two of the three categories (managerial and blue collar workers). The dummy variable xm is 1 for observations where job.type is 3 (i.e. managerial).
bias
xedu
xm
xb
1
10
0
1
1
14
0
0
1
12
0
1
1
16
1
0
(e)
[6 Pts] Suppose we believe that the slope of the relationship between education level and wage is different for each of our 3 job types, e.g. perhaps white collar workers have salaries that are 2x their years of education, but blue collar workers only 1.5x. Create a design matrix below that will yield a model with different slopes and y-intercepts for each job type. Use the top row to name your columns. You may not need all columns.
Warning: This is a very challenging problem. Move on if you’re stuck.
Data 100 Midterm 2, Page 19 of 25 November 13, 2019
Solution: To allow the slopes to be different for the different job types, we augment to design from the previous problem to include variables that allow education to have a different slope. We can do this by adding two additional features that contain the education for subgroups of the data as shown below.
bias
xedu
xm
xb
xedu,m
xedu,b
1
10
0
1
0
10
1
14
0
0
0
0
1
12
0
1
0
12
1
16
1
0
16
0
Now, our model looks like
yˆ=β0 +β1 ·xedu +β2 ·xm +β3 ·xb +β4 ·xedum +β5 ·xedub
Another approach to encapsulating these three separate models (one for each job type) into one model is to create three pairs of education levels and biases for each of the job types.
xedub and biasb will only have values in that column if the original datapoint was of job_type 1. Otherwise, both values in these columns will be 0.
xedub
biasb
xeduw
biasw
xedum
biasm
10
1
0
0
0
0
0
0
14
1
0
0
12
1
0
0
0
0
0
0
0
0
16
1
Now, our model looks like
yˆ=β1 ·xedub +β2 ·biasb +β3 ·xeduw +β4 ·biasw +β5 ·xedum +β6 ·biasm
Data 100 Midterm 2, Page 20 of 25 November 13, 2019
For a given observation, if the original job_type value was 1 (i.e. the person was a blue collar worker), then all features other than xedub and biasb are set to 0, so we haveyˆ=β1 ·xedub +β2 ·biasb.
The same principle applies to the other two job types as well.
Data 100 Midterm 2, Page 21 of 25 November 13, 2019
7 Logistic Regression
Suppose we want to build a classifier to predict whether a person survived the sinking of the Titanic. The first 5 rows of our dataset are given below.
(a) For a given classifier, suppose the first 10 predictions of our classifier and 10 true obser- vations are as follows:
i. [1 Pt] What is the accuracy of our classifier on these 10 predictions?
ii. [1 1/2 Pts] What is the precision on these 10 predictions?
iii. [1 1/2 Pts] What is the recall on these 10 predictions?
(b) [4 1/2 Pts] In general (not just for the Titanic model), if we increase the threshold for a classification model, what of the following can happen to our precision, recall, and accuracy? We have not included the option ”X can stay the same”, because this is trivially true (e.g. if we increase the threshold by some tiny number, it will have no effect).
prediction
1
1
1
1
1
0
1
1
1
1
true label
0
1
1
1
0
0
0
1
1
1
Solution: 7 of our predictions were correct, out of 10 total. Thus, our accuracy is .
7
10
Solution: Thenumberoftruepositives,TP,is6.Thenumberoffalsepositives,
FP,is3.Then,theprecisionis TP =6 = . TP+FP 9
2
3
Solution: From the solution to the previous part, we know that T P = 6. The
number of false negatives, F N , here is 0 (we only predicted 0 once, and in that
case the true value was actually 0). Thus, the recall is T P = 6 = . TP+FN 6+0
1
Data 100
Midterm 2, Page 22 of 25
November 13, 2019
Precision can increase. Precision can decrease. Recall can increase.
Recall can decrease.
Accuracy can increase. Accuracy can decrease.
Solution: As we increase our classification threshold, the number of false positives decreases, but the number of false negatives (i.e. undetected points) increases. As a result, our precision increases (more of the points we say are positive will actually be positive), but our recall decreases (there will be more points that are actually positive that we don’t detect). However, in some cases precision can also decrease, when increasing a threshold lowers the number of true positives but keeps the number of true negatives the same. As seen in lecture, accuracy may increase or decrease – there typically exists an optimal threshold that maximizes accuracy, and if we increase or decrease our threshold from that point, accuracy decreases.
Data 100 Midterm 2, Page 23 of 25 November 13, 2019 For convenience, we repeat the figure from the previous page below.
(c) Suppose after training our model we get β⃗ = −1.2 −0.005 2.5T , where −1.2 is an intercept term, −0.005 is the parameter corresponding to passenger’s age, and 2.5 is the parameter corresponding to sex.
i. [3 Pts] Consider S ̄ıla ̄nah Iskandar Na ̄s ̄ıf Ab ̄ı Da ̄ghir Yazbak, a 20 year old female. What chance did she have to survive the sinking of the Titanic according to our model? Give your answer as a probability in terms of σ. If there is not enough information, write “not enough information”.
P(Y = 1|age = 20,female = 1) =
ii. [3 Pts] S ̄ıla ̄nah Iskandar Na ̄s ̄ıf Ab ̄ı Da ̄ghir Yazbak actually survived. What is the cross-entropy loss for our prediction in part i? If there is not enough information, write ”not enough information.”
cross entropy loss =
iii. [6 Pts] Let m be the odds of a given male passenger’s survival according to our model, i.e. if the passenger had an 80% chance of survival, m would be 4, since their odds of survival are 0.8/0.2 = 4. It turns out we can compute f, the odds of survival for a female of the same age, even if we don’t know the age of the two
Solution: To be explicit, our observation vector here is ⃗x = [1, 20, 1]T . Then, ⃗xT β⃗ = 1(−1.2) + 20(−0.005) + 1(2.5) = 1.2.
Then, P(Y = 1|⃗x) = σ(⃗xT β⃗) = .
σ(1.2)
Solution: Here, y = 1 and yˆ = σ(1.2). Then,
cross entropy loss = −y log(yˆ) − (1 − y) log(1 − yˆ) =
− log(σ(1.2))
Data 100 Midterm 2, Page 24 of 25 November 13, 2019 people. What is this relationship? Hint: How are the odds related to t = ⃗xT β⃗ for a
given observation?
Warning: This is a very challenging problem. Move on if you’re stuck. f=
Solution: Westartbyfindingasimplerelationshipbetweentheoddsandσ(t)= σ(⃗xT β⃗).
In logistic regression p = σ(⃗xT β⃗).
The odds are defined as odds = p . Substituting in, we have:
1−p
odds =
odds = 1+e−t 1+e−t −1
1 1+e−t
1−1 1+e−t
1
1+e−t odds= 1
Thus, we have that odds = e⃗xT β⃗ .
From here, the problem is relatively straightforward, as we have that
f/m = e−1.2−0.005age+2.5 e−1.2−0.005age
f/m = e2.5
e−t odds = et
Solution: Recall, the assumption upon which we derived the logistic model was that the log-odds of our probability was linear. That is,
log P (Y = 1|⃗x) = ⃗xT β⃗ P ( Y = 0 | ⃗x )
Exponentiating both sides and substituting in the model and provided weights:
P (Y = 1|⃗x) = e⃗xT β⃗ = e−1.2−0.005·(age)+2.5·(sex) P ( Y = 0 | ⃗x )
Data 100 Midterm 2, Page 25 of 25 November 13, 2019
We’re told to consider the odds for a fixed age. So,
m = e−1.2−0.005·(age)+2.5·0 = e−1.2−0.005·(age)
f = e−1.2−0.005·(age)+2.5·1 = e−1.2−0.005·(age) · e2.5 Thus, we can say that .
f = m · e2.5