COMS 4771 HW4 (Fall 2021)
Due: Dec 12, 2021 at 11:59pm
This homework is to be done alone. No late homeworks are allowed. To receive credit, a type-
setted copy of the homework pdf must be uploaded to Gradescope by the due date. You must show
your work to receive full credit. Discussing possible solutions for homework questions is encour-
aged on the course discussion board and with your peers, but you must write their own individual
solutions and not share your written work/code. You must cite all resources (including online mate-
rial, books, articles, help taken from specific individuals, etc.) you used to complete your work.
1 Bayesian interpretation of ridge regression
Consider the following data generating process for linear regression problem in Rd. Nature first
selects d weight coefficients w1, . . . , wd as wi ∼ N(0, τ2) i.i.d. Given n examples x1, . . . , xn ∈ Rd,
nature generates the output variable yi as
yi =
d∑
j=1
wjxi,j + ϵi,
where ϵi ∼ N(0, σ2) i.i.d.
Show that finding the coefficients w1, . . . , wd that maximizes P [w1, . . . , wd|(x1, y1) . . . , (xn, yn)]
is equivalent to minimizing the ridge optimization criterion.
2 From distances to embeddings
Your friend from overseas is visiting you and asks you the geographical locations of popular US
cities on a map. Not having access to a US map, you realize that you cannot provide your friend
accurate information. You recall that you have access to the relative distances between nine popular
US cities, given by the following distance matrix D:
1
Distances (D) BOS NYC DC MIA CHI SEA SF LA DEN
BOS 0 206 429 1504 963 2976 3095 2979 1949
NYC 206 0 233 1308 802 2815 2934 2786 1771
DC 429 233 0 1075 671 2684 2799 2631 1616
MIA 1504 1308 1075 0 1329 3273 3053 2687 2037
CHI 963 802 671 1329 0 2013 2142 2054 996
SEA 2976 2815 2684 3273 2013 0 808 1131 1307
SF 3095 2934 2799 3053 2142 808 0 379 1235
LA 2979 2786 2631 2687 2054 1131 379 0 1059
DEN 1949 1771 1616 2037 996 1307 1235 1059 0
Being a machine learning student, you believe that it may be possible to infer the locations of
these cities from the distance data. To find an embedding of these nine cities on a two dimensional
map, you decide to solve it as an optimization problem as follows.
You associate a two-dimensional variable xi as the unknown latitude and the longitude value for
each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for NYC, etc.).
You write down the an (unconstrained) optimization problem
minimizex1,…,x9
∑
i,j
(
∥xi − xj∥ −Dij
)2
,
where
∑
i,j(∥xi − xj∥ −Dij)
2 denotes the embedding discrepancy function.
(i) What is the derivative of the discrepancy function with respect to a location xi?
(ii) Write a program in your preferred language to find an optimal setting of locations x1, . . . , x9.
You must submit your code to receive full credit.
(iii) Plot the result of the optimization showing the estimated locations of the nine cities. (here is
a sample code to plot the city locations in Matlab)
>> cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};
>> locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];
>> figure; text(locs(:,1), locs(:,2), cities);
What can you say about your result of the estimated locations compared to the actual geo-
graphical locations of these cities?
3 An alternate learning paradigm
In class you have seen that when building classifiers, one wants to minimize the expected classifica-
tion error over a distribution D. That is, we want to find the classifier f that minimizes:
E(x,y)∼D
[
1[f(x) ̸= y]
]
. (1)
Since this quantity is not estimable in practice (since we don’t know D and only have access to finite
samples drawn from it), it is usually approximated via its empirical equivalent:
1
|S|
∑
(x,y)∈S
1[f(x) ̸= y]. (2)
2
This latter quantity is the training error if S is the training set, and is the testing error if S is the
testing set.
However for certain applications, obtaining a positively and negatively labelled samples S is not
possible. Consider, for example, the problem of modelling user preferences based on news-feed that
gets shown. Very simply, if a user interacts with a particular news item (such as they clicked and
read it) shows that they are interested in the contents of the article, thus providing a positive label.
But if a user does not interact with a particular news item, it is not clear whether the user dislikes the
contents of the article, or simply didn’t get around to viewing it. In such a scenario obtaining a good
quality negatively labelled data sample is not possible. We thus need a slightly different learning
paradigm where the training samples obtained are only either labelled as positive examples, or they
are simply unlabeled examples. We can model this as follows:
• D is an unknown distribution over RD ×{0, 1} × {0, 1}. (x, y, s) ∼ D is a sample, where x
is the input feature vector, y is the true label, and s (the “selection” variable) is whether x was
interacted with (ie, selected) or not. Note that only x and s are observed.
• Pr[s = 1 | x, y = 0] = 0, that is, a negatively labelled x is never selected.
• Given y, s and x are conditionally independent. That is, which x gets selected (given that,
say, x positively labelled) is chosen independently.
The goal of this problem is to find an empirical estimator of (1) similar to (2) but using the
unlabeled and positive data only.
1. Prove that Pr[y = 1 | x] = Pr[s=1|x]
Pr[s=1|y=1] .
2. Using (i) prove that Pr[y = 1 | x, s = 0] = 1−Pr[s=1|y=1]
Pr[s=1|y=1]
Pr[s=1|x]
1−Pr[s=1|x] .
For the rest of the problem, assume that both quantities on the RHS can be estimated from
(x, s) data only. This is trivially true for Pr[s = 1 | x] (since it does not depend on y). And
while estimating Pr[s = 1 | y = 1] with only(x, s) data is nontrivial, it can be done under
suitable conditions.
3. Letting p denote the PDF of D show that:
E(x,y)∼D
[
1[f(x) ̸= y]
]
=
∫
x
p(x, s = 1)1[f(x) ̸= 1]
+ p(x, s = 0)(Pr[y = 1 | s = 0, x]1[f(x) ̸= 1]
+ Pr[y = 0 | s = 0, x]1[f(x) ̸= 0])dx
4. Using parts (ii) and (iii) suggest an empirical estimator of (1) similar to (2) but that uses only
(x, s) data.
Hint: Try viewing unlabeled points as part positive and part negative. That is, replace unla-
beled points by two “partial” points. One that is positive with weight w(x) and one negative
with weight 1− w(x).
3
4 Exploring k-means in detail
Recall that in k-means clustering we attempt to find k cluster centers cj ∈ Rd, j ∈ {1, . . . , k} such
that the total (squared) distance between each datapoint and the nearest cluster center is minimized.
In other words, we attempt to find c1, …, ck that minimizes
n∑
i=1
min
j∈{1,…,k}
∥xi − cj∥2, (3)
where n is the total number of datapoints. To do so, we iterate between assigning xi to the nearest
cluster center and updating each cluster center cj to the average of all points assigned to the jth
cluster (aka Lloyd’s method).
(a) [it is unclear how to find the best k, i.e. estimate the correct number of clusters!] Instead
of holding the number of clusters k fixed, one can think of minimizing (3) over both k and c.
Show that this is a bad idea. Specifically, what is the minimum possible value of (3)? what
values of k and c result in this value?
(b) [efficient optimal solution when either k = 1 or d = 1] Optimizing the k-means objective
(3) for the general case when k ≥ 2 and d ≥ 2 is NP-hard. But one can find an efficient
optimial solution when either k = 1 or d = 1.
(i) What is the optimal objective value and the setting of the cluster centers in the case of
k = 1?
(ii) For the case d = 1 (and k ≥ 2), show that Lloyd’s algorithm is not optimal. That is,
there is a suboptimal setting of cluster assignment for some dataset (with d = 1) for
which Lloyd’s algorithm will not be able to improve the solution.
(iii) (note: this part is extra credit, only attempt it once everything else is done)
Propose an efficient algorithm (that is, an algorithm that runs in polynomial in n and k
steps) for optimally solving k-means for d = 1 case.
(Hint: Use the fact that when d = 1, the dataset can be ordered as xi1 ≤ xi2 ≤ · · · ≤ xin
and each cluster in the optimal assignment will be a nonoverlapping interval)
(c) [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair of
clusters is linearly separable. This may not be the case. A classic example is where we have
two clusters corresponding to data points on two concentric circles in the R2.
(i) Implement Lloyd’s method for k-means algorithm and show the resulting cluster assign-
ment for the dataset depicted above. Give two more examples of datasets in R2, for
which optimal k-means setting results in an undesirable clustering. Show the resulting
cluster assignment for the two additional example datasets.
Let ϕ denote an explicit non-linear feature space mapping in some inner product space. We
will show how one can derive an implicit kernel-based version of the Lloyd’s method for
k-means, where we only operate on data as ϕ(xi) · ϕ(xj) = K(xi, xj).
(ii) Let zij be an indicator that is equal to 1 if the ϕ(xi) is currently assigned to the jth
cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center cj
can be written as
∑n
i=1 αijϕ(xi), where αij only depends on zij variables.
4
(iii) Given any two data points ϕ(xi) and ϕ(xj), show that the square distance ∥ϕ(xi)− ϕ(xj)∥2
can be computed using only (linear combinations of) inner products.
(iv) Given the results of parts (ii) and (iii), show how to compute the square distance ∥ϕ(xi)− cj∥2
using only (linear combinations of) inner products between the data points x1, . . . , xn.
(v) From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized version
of Lloyd’s method of finding k-means.
(vi) Implement your proposed kernelized k-means method and run it on the three example
datasets of part (i). Compare the resulting cluster for various choices of kernels (e.g.
linear kernel, quadratic kernel, rbf kernel).
(submit your datasets and kernelized code to receive full credit)
5
Bayesian interpretation of ridge regression
From distances to embeddings
An alternate learning paradigm
Exploring k-means in detail