COMS 4771-2 Fall-B 2020 HW 2 (due November 18, 2020)
Instructions Submitting the write-up
• Submit your write-up on Gradescope as a neatly typeset (not scanned nor hand-written) PDF document by 10:00 PM of the due date (New York City local time).
• You may produce the PDF using LATEX, Word, Markdown, etc.—whatever you like—as long as the output is readable!
• On Gradescope, be sure to select the pages containing your answer for each problem. (If you don’t select pages containing your answer to a problem, you’ll receive a zero for that problem.)
Submitting as a group
• If you are working in a group (of up to three members), make sure all group members’ names and UNIs appear prominently on the first page of the write-up.
• Only one group member should submit the write-up on Gradescope. That student must make sure to add all of the group members to the submission.
Submitting the source code
• Please submit all requested source code on Gradescope in a single ZIP file.
• Use the concatenation of your group members’ UNIs followed by .zip as the filename for the
ZIP file (e.g., abc1234def5678.zip).
• Use the format hwXproblemY for filenames (before the extension), where X is the homework
number and Y is the problem number (e.g., hw1problem3).
• If you are submitting a Jupyter notebook (.ipynb), you must also include a separate Python
source file (.py) with the same basename that contains only the Python code.
• Do not include the data files that we have provided; only include source code.
Please consult the Gradescope Help Center if you need help with submitting.
1
Problem 1 (10 points)
In this problem, you will reason about optimal predictions for mean squared error.
Suppose Y1,…,Yn,Y are iid random variables—the distribution of Y is unknown to you. You
observe Y1, . . . , Yn as “training data” and must make a (real-valued) prediction of Y . (a) Assume Y has a probability density function given by
1 ye−y/θ if y > 0, pθ(y) := θ2
0 if y ≤ 0,
for some θ > 0. Suppose that θ is known to you. What is the “optimal prediction” yˆ⋆ of Y that has the smallest mean squared error E[(yˆ⋆ − Y )2]? And what is this smallest mean squared error? Your answers should be given in terms of θ.
(b) (Continuing from Part (a).) In reality, θ is unknown to you. Suppose you observe (Y1, . . . , Yn) = (y1, . . . , yn) for some positive real numbers y1, . . . , yn > 0. Derive the following:
• the MLE θˆ(y1,…,yn) of θ given this data;
• the prediction yˆ(y1, . . . , yn) of Y based on the plug-in principle (using θˆ(y1, . . . , yn)).
Show the steps of your derivation. The MLE and prediction should be given as simple formulas involving y1, . . . , yn.
(c) Now, instead assume Y ∼ Bernoulli(θ) for some θ ∈ [0, 1]. Suppose that θ is known to you. What is the prediction yˆ⋆ of Y that has the smallest mean squared error E[(yˆ⋆ − Y )2]? And what is this smallest mean squared error? Your answers should be given in terms of θ. (Note: yˆ⋆ is allowed to be any real number!)
(d) (Continuing from Part (c).) Define the following loss function l : R × R → R by
2(yˆ − y)2 if yˆ ≥ y,
(yˆ − y)2 if yˆ < y.
This loss function is a different way to measure how “bad” a prediction is. With this loss function, a prediction that is too high is more costly than one that is too low. What is the prediction yˆ⋆ of Y that has the smallest expected loss E[l(yˆ⋆, Y )]? And what is this smallest expected loss? Your answers should be given in terms of θ.
l(yˆ, y) :=
2
Problem 2 (15 points)
In this problem, you’ll practice analyzing a simple data set with linear regression.
Obtain the Jupyter notebook Linear_regression_on_Dartmouth_data.ipynb from Courseworks, and run the code there (e.g., using Google Colaboratory) to fit linear regression models to the Dartmouth College GPA data described in lecture.
You’ll now apply a similar linear regression analysis to a data set concerning prostate cancer: • https://www.cs.columbia.edu/~djhsu/coms4771-f20/data/prostate-train.csv
Regard this data set as “training data” in which the goal is to predict the variable lpsa (the logarithm of the prostate specific antigen level) using the remaining variables (lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45) as features.1
(a) For each of the eight features, find the best fit affine function of that variable to the label lpsa. Report the “slope” and “intercept” in each case.
(b) Now find the best fit affine function of all eight features (together as a vector in R8) to the label lpsa. Report the coefficients in the weight vector and the “intercept” term.
You should find that some of the variables have a negative coefficient in the weight vector from Part (b) even though its corresponding affine function from Part (a) has a positive slope. This might seem like a paradox: for such a feature, Part (a) might lead you to think that increasing the feature’s value should, on average, increase the value of lpsa; whereas Part (b) might lead you to think that increasing the feature’s value should, on average, decrease the value of lpsa.
Of course there is no paradox. Here is a simple example to show how this can happen. Suppose X1 ∼ N(0,1) and X2 ∼ N(0,1), and E[X1X2] = 2. Furthermore, suppose Y = 3X1 − 3X2.
(c) What is the linear function of X1 that has smallest mean squared error for predicting Y ? What is the linear function of X2 that has smallest mean squared error for predicting Y ? And finally, what is the linear function of (X1,X2) that has smallest mean squared error for predicting Y ?
You should find that even though each of X1 and X2 is positively correlated with Y (analogous to the situation in Part (a)), the best linear predictor of Y that considers both X1 and X2 has a positive coefficient for one variable and a negative coefficient for the other variable (analogous to the situation in Part (b)).
This example shows that one must be careful when interpreting the sign of the regression coefficient. There are 30 additional labeled examples available:
• https://www.cs.columbia.edu/~djhsu/coms4771-f20/data/prostate-test.csv Regard these data as “test data”.
(d) What are the mean squared errors of your affine function from Part (b) on the training data? And on the test data?
1A description of the variables appears on page 4 of Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, available here: https://web.stanford.edu/~hastie/ElemStatLearn/
324
3
(e) Apply ridge regression (along with some form of cross-validation to choose the hyperparameter λ) to find an affine function of all eight features. Briefly describe your approach (including how you applied cross-validation), and report the mean squared error on each of the training data and the test data.
You are welcome to use any software packages you like for this problem, provided that you cite these software packages in your write-up.
Include the source code in your code submission on Gradescope.
4
Problem 3 (15 points)
In this problem, you will use linear regression with a subset of ProPublica’s COMPAS data set, and evaluate the disparate behavior of a derived “screening tool” on different subpopulations represented in the data.
COMPAS data set
We have prepared subsets of the COMPAS data set that was analyzed by ProPublica for their Machine Bias article:
• https://www.cs.columbia.edu/~djhsu/coms4771-f20/data/compas-train.csv • https://www.cs.columbia.edu/~djhsu/coms4771-f20/data/compas-test.csv
These data represent criminal defendants from Broward County, Florida. For each defendant, the data contains information available to a “screening tool” that is intended to assess the defendant’s risk of recidivism (i.e., risk of committing a crime if they were to be released on parole). The data also contains, for each defendant, whether the defendant was arrested again within two years of the assessment.2
The data has been randomly split into two parts: one for “training” and another for “testing”. There is one row per defendant. The “label” is provided in the first column (named two_year_recid). The remaining eight columns correspond to features that are to be used to predict the label. Each feature is either binary-valued (sex, race, c_charge_degree) or integer-valued (age, juv_fel_count, juv_misd_count, juv_other_count, priors_count). (There are several other features in the original data set that we have omitted for this assignment. We have taken a subset of the original data that includes only two possible values for the sex attribute and two possible values for the race attribute.)
Linear regression and classification
Compute the affine function ηˆ: Rd → R of smallest mean squared error on the training data. (Ordinary least squares does the job.) What is the mean squared error of ηˆ on the test data?
Since the label is binary-valued, we can regard ηˆ(x) as an estimate of the conditional probability that the label is one given the feature vector x (i.e., an estimate of Pr(Y = 1 | X = x)). Therefore, it is reasonable to derive a binary classifier fˆ: Rd → {0, 1} from ηˆ, defined by
ˆ
f(x) = 1{ηˆ(x)>1/2}.
What are the error rate and false positive rate of fˆ on the test data?
(A “false positive” is a prediction that a given defendant will be arrested within two years of
screening, among defendants who do not actually get arrested within two years of screening.)
Performance on subpopulations
There are many natural subpopulations represented in the data. For example, we may define two subpopulations based on the value of the sex feature, and we may also define two subpopulations
2Note that being arrested is not the same as having committed a crime. See the recent article by Lum and Shah, 2019 for further discussion in a related context.
5
based on the value of the race feature. ProPublica’s article was largely concerned with the disparate behavior of a “screening tool” called “COMPAS” on two subpopulations based on the race feature.
Choose one of the features sex and race; call this feature A. For each a ∈ {0, 1}, what are the error ˆ
rate and false positive rate of f, as evaluated on the subset of test data with A = a?
You should find that, like the COMPAS screening tool studied by ProPublica, the function fˆ also has disparate behavior on the two subpopulations: the false positive rate is much higher for one subpopulation than it is for the other.
Inherent limitations?
Is the disparate behavior on the two subpopulations inevitable? For each a ∈ {0, 1}, repeat the
training and testing above using only training and test data with A = a. That is, compute the affine
function ηˆa : Rd → R of smallest empirical mean squared error on the training data with A = a;
form the corresponding classifier fˆ : Rd → {0, 1} via fˆ (x) = 1 ; then compute the error rate a a { ηˆ a ( x ) }
and false positive rate of fˆ on the test data with A = a. a
Do you still observe disparate behavior on the two subpopulations? What are the implications of this finding? (Think about the relative utility of the features for predicting the label.)
What to submit in your write-up
(a) Please indicate which feature you have selected to be A (either sex or race).
(b) Mean squared error of ηˆ on test data, error rate of fˆ on test data, and false positive rate of fˆ
on test data.
(c) The performance measures (error rate and false positive rate) for fˆ on the test data with
A = 0, and the same performance measures for fˆ on the test data with A = 1.
(d) The performance measures (error rate and false positive rate) for fˆ on the test data with
A = 0, and the same performance measures for fˆ on the test data with A = 1. 1
(e) (Optional.) Discuss the implications of your observations concerning the disparate behavior of fˆ and fˆ (or lack thereof) on their respective subpopulations.
01
You are welcome to use any software packages you like for this problem, provided that you cite these software packages in your write-up.
Include the source code in your code submission on Gradescope.
6
0
Problem 4 (10 points)
In this problem, you’ll practice performing Bayesian inference.
Consider the statistical model for the outcomes of 10 coin tosses given by Y1, . . . , Y10 ∼iid Bernoulli(θ).
Here, θ ∈ [0, 1] is the model parameter.
Suppose you observe the following data (the outcomes of the 10 coin tosses):
(y1,y2,y3,y4,y5,y6,y7,y8,y9,y10) = (0,1,0,0,0,1,1,1,0,0).
(a) Suppse your prior distribution on θ was the uniform distribution on the interval [0, 1]. Deter- mine an explicit formula for the posterior distribution. Please evaluate any definite integral that arises. (Hint: the formula is a polynomial in θ.) What is the mode of the posterior distribution? And what are, approximately, the values of the 25th, 50th, and 75th percentiles of the posterior distribution?3
(b) Suppose instead your prior distribution on θ was given by the following probability mass function:
Pr(θ = 1/6) = 1/6, Pr(θ = 1/3) = 1/6, Pr(θ = 1/2) = 1/3, Pr(θ = 4/6) = 1/6, Pr(θ = 5/6) = 1/6.
Determine the posterior probability distribution. (Hint: it is given by a probability mass function that is non-zero only on five different points.)
3Recall that the r-th percentile of a probability distribution on R is the value t such that the cumulative distribution function evaluated at t is r/100.
7