COMS 4771-2 Fall-B 2020 HW 1 (due November 4, 2020)
Instructions Submitting the write-up
• Submit your write-up on Gradescope as a neatly typeset (not scanned nor hand-written) PDF document by 10:00 PM of the due date (New York City local time).
• You may produce the PDF using LATEX, Word, Markdown, etc.—whatever you like—as long as the output is readable!
• On Gradescope, be sure to select the pages containing your answer for each problem. (If you don’t select pages containing your answer to a problem, you’ll receive a zero for that problem.)
Submitting as a group
• If you are working in a group (of up to three members), make sure all group members’ names and UNIs appear prominently on the first page of the write-up.
• Only one group member should submit the write-up on Gradescope. That student must make sure to add all of the group members to the submission.
Submitting the source code
• Please submit all requested source code on Gradescope in a single ZIP file.
• Use the concatenation of your group members’ UNIs followed by .zip as the filename for the
ZIP file (e.g., abc1234def5678.zip).
• Use the format hwXproblemY for filenames (before the extension), where X is the homework
number and Y is the problem number (e.g., hw1problem3).
• If you are submitting a Jupyter notebook (.ipynb), you must also include a separate Python
source file (.py) with the same basename that contains only the Python code.
• Do not include the data files that we have provided; only include source code.
Please consult the Gradescope Help Center if you need help with submitting.
1
Problem 1 (20 points)
In this problem, you will implement and evaluate nearest neighbor classifiers.
MNIST data set
Download the MNIST data set of handwritten digit images ocr.mat from Courseworks, and load it into Python:
from scipy.io import loadmat
ocr = loadmat(‘ocr.mat’)
The 60000 unlabeled training data (i.e., feature vectors) are contained in a matrix called data (one data point per row), and the corresponding labels are in a vector called labels. The 10000 test feature vectors and labels are in, respectively, testdata and testlabels.
You can view an image (say, the first one) with the following commands:
import matplotlib.pyplot as plt
from matplotlib import cm
plt.imshow(ocr[‘data’][0].reshape((28,28)), cmap=cm.gray_r)
plt.show()
Nearest neighbor classifier
Let the training examples be denoted by (x1, y1), . . . , (xn, yn) (for n = 60000), where each xi ∈ Rd and each yi ∈ {0,…,9}.
The nearest neighbor classifier fˆ: Rd → {0, . . . , 9} is a function defined in terms of the 60000 training examples as follows. On input x ∈ Rd, let i(x) := arg mini∈{1,…,n} ∥x − xi∥2, breaking ties arbitrarily (i.e., i(x) is any i ∈ {1, . . . , n} for which ∥x − xi(x)∥2 ≤ mini∈{1,…,n} ∥x − xi∥2); and return yi(x).1
Use the Python 3 code template from the Courseworks and implement the nearest neighbor classifier function nn(X,Y,test), which takes as input a matrix of training feature vectors X and a vector of corresponding labels Y, as well as a matrix of test feature vectors test, and outputs a vector of the predicted labels preds for all test points, as provided by the nearest neighbor classifier fˆ based on the examples in X and Y.
Don’t use (or look at the source code for) any library functions for computing all pairwise distances between entire sets of vectors, or for computing nearest neighbor queries; that would, of course, defeat the purpose of this assignment. However, it is fine to use functions for computing inner products between vectors and norms of vectors, as well as other basic matrix and vector operations. It is also fine to use functions for computing order statistics, like those that find the minimum or maximum entry in an array. If in doubt about what is okay, just ask.
In order to make your code efficient and not take a long time to run, you should use matrix/vector operations (rather than, say, a bunch of explicit for-loops). Think of the n training feature vectors
1The expression “argmina∈A f(a)” denotes the set of minimizers of the function f among all elements in the set A. On occasion, as we do here, we’ll be a little sloppy and write expressions of the form “aˆ := argmina∈A f(a)”. This will always means that aˆ is some minimizer of f within A, rather than the set of minimizers. For this to really make sense, though, it must be made clear (either explicitly or implicitly) what we mean in the case that there are multiple minimizers—i.e., what the tie-breaking rule is—or if there are no minimizers!
2
as being stacked as the rows of a matrix X ∈ Rn×d, and the m test feature vectors as the rows of another matrix T ∈ Rm×d. If the i-th row of T is tTi and the j-th row of X is xTj , then the squared Euclidean distance between ti and xj is
∥ti −xj∥2 =(ti −xj)T(ti −xj)=tTiti −2tTixj +xTjxj.
Can you leverage this identity to speed-up computation of the nearest neighbor classifier?2
Evaluating the nearest neighbor classifier
Instead of using your nearest neighbor program with data and labels as the training data, the code template implements the following evaluation. For each value n ∈ {1000, 2000, 4000, 8000}:
1. Draw n points from data, together with their corresponding labels, uniformly at random. Use sel = random.sample(range(60000,n) (after import random), ocr[‘data’][sel].astype(‘float’), and ocr[‘labels’][sel] to select the exam- ples.3
2. Use these n points as the training data and testdata as the test points; compute the fraction of test examples on which the nearest neighbor classifier predicts the label incorrectly (i.e., the test error rate).
A plot of test error rate (on the y-axis) as a function of n (on the x-axis) is called a learning curve.
Carry out the (random) process described above ten times, independently. Produce a learning curve plot using the average of these test error rates (that is, averaging over ten repetitions). Add error bars to your plot that extend to one standard deviation above and below the means. Ensure the plot axes are properly labeled.
What to submit in your write-up
(a) A brief explanation of how your nearest neighbor implementation works. (b) Learning curve plot. (Please include the plot directly in the PDF write-up.)
Include the source code in your code submission on Gradescope.
2In order to take advantage of matrix/vector operations, your Python installation needs to be using optimized linear algebra libraries, such as ATLAS, MKL, or the Accelerate/vecLib framework on OS X. If you installed Python using Anaconda, then you should already have MKL.
3The data are stored as unsigned integers, so you need to convert them to floating point or else you may get integer overflows.
3
Problem 2 (10 points)
In this problem, you will practice deriving formulae for maximum likelihood estimators.
Consider a statistical model for iid random variables X1, . . . , Xn, parameterized by θ ∈ Θ. The
distribution Pθ of X1 is specified in each part below.
In each of these cases, derive a simple formula for the MLE for θ (given data (X1,…,Xn) =
(x1, . . . , xn)), and briefly justify each step of the derivation. (a) X1 ∼ Pθ has probability density function fθ satisfying
fθ(x) = 1 x2e−x/θ for all x > 0, Zθ
and fθ(x) = 0 for all x ≤ 0. Here, Zθ = ∞ x2e−x/θ dx is the “normalization constant” that 0
ensures the probability density function integrates to one. The parameter space is the positive reals Θ = {θ ∈ R : θ > 0}.
(b) X1 ∼ Pθ has probability density function fθ satisfying
fθ(x)= 1 forall−θ≤x≤θ,
Zθ
andfθ(x)=0forallx<−θandallx>θ. Here,Zθ =θ 1·dxisthe“normalization
−θ
constant” that ensures the probability density function integrates to one. The parameter space
is the positive reals Θ = {θ ∈ R : θ > 0}.
Now consider the following statistical model for random variables X1, X2, which are not independent. The distribution of X1 is Pois(θ) (i.e., Poisson with rate θ), and for any non-negative integer x2, the conditional distribution of X2 | X1 = x1 is Pois(x1θ). The parameter space is the non-negative reals Θ = {θ ∈ R : θ > 0}.
(c) Write a formula for the joint probability mass function of (X1,X2) with parameter θ.
(d) Derive a simple formula for the MLE for θ (given data (X1, X2) = (x1, x2)), and briefly justify each step of the derivation.
4
Problem 3 (20 points)
In this problem, you will consider a different model for collecting data that ensures some kind of privacy guarantee.
Suppose you would like to estimate the proportion of people in a population with a positive COVID19 test result, i.e., the positivity rate. (For the purpose of this problem, assume everyone has been tested.) You randomly sample n individuals from the population.4 You could just ask each person whether or not they have tested positive. However, they may be uncomfortable sharing that information with you outright.
The randomized response method was developed to address this privacy concern. In the randomized response method, surveyed individuals are not asked to directly tell you whether or not they have tested positive. Instead, you ask each surveyed individual to do the following:
• Toss a fair coin three times (without letting you see the outcomes). • If at least one toss comes up heads:
– Respond truthfully (i.e., say 1 if the individual has tested positive; 0 if not). • Else:
– Give the opposite response (i.e., say 0 if the individual has tested positive; 1 if not).
Because you do not observe the outcomes of the coin tosses, you never definitively learn whether the surveyed individual has tested positive or not. This is a very strong privacy guarantee for the surveyed individuals. Moreover, the collected information can still be used to obtain a good estimate of the proportion of individuals in the population with a positive test.
Consider a statistical model for n responses collected in this way, where the responses are regarded as iid {0, 1}-valued random variables Y1, . . . , Yn, and all coin tosses are independent. Let θ ∈ [0, 1] denote the parameter of the model that equals the proportion of individuals in the population with a positive test.
(a) What is the probability that Y1 = 1? Give your answer in terms of the parameter θ.
(b) What is the log-likelihood of θ given data y1, . . . , yn ∈ {0, 1}? Write your answer in terms of
θ and y1,…,yn.
(c) Suppose n = 100, and the number of yi that are equal to 1 is 40 (i.e., ni=1 yi = 40). Plot the
log-likelihood as a function of θ ∈ [0, 1]. What appears to be the θ with highest likelihood? (Please include the plot directly in the PDF write-up.)
(d) Write a program in Python that simulates the (random) process described above for estimating the positivity rate. In your simulation, make it so that the true proportion of individuals with a positive test is 10% (yikes). For each value of n ∈ {100, 200, 400, 800, 1600}, carry out the above process ten times, independently, with n being the sample size. Report (in a table), for each such n, the mean and standard deviation of the ten resulting estimates of θ.
Include the source code in your code submission on Gradescope.
4When sampling from a finite population, we typically use sampling without replacement. But to simplify the problem, assume that sampling with replacement is used.
5
Problem 4
Please fill out this survey (individually, not as a group) about your interests in machine learning:
https://forms.gle/rScuPz7yBp4NvE8U6
6