Midterm Exam for CS 565
March 23, 2021
This is a take home exam. You should solve the problems on your own and you should cite any sources you use. As always, if your answer to a question is: “I don’t know”, you get 20%
of the grade.
If you are asked to provide an algorithm with a particular running time (e.g., linear) but your solution is an algorithm that does not have the required running time, then you get zero credit.
There are four problems. To get full marks you need to answer all of them.
1
Problem 1 (20 points)
Recall the metric properties: A distance function d : X × XR is said to be a metric on X if: 1. (non-negativity) d(x, y) ≥ 0 for all x, y ∈ X.
2. (definiteness) d(x, y) = 0 if and only if x = y.
3. (symmetry) d(x, y) = d(y, x) for all x, y ∈ X.
4. (triangle inequality) d(x, y) ≤ d(x, z) + d(z, y) for all x, y, z ∈ X.
Assume a distance function d() that is a metric. Prove that for any three points x, y, z the
following holds:
(double triangle inequality).
d(x,y)2 ≤2·d(x,z)2 +2·d(z,y)2
2
Problem 2 (30 points)
Consider as input a set of graphs G = {G1,…,Gn} such that every graph Gi = (V,Ei) is undirected and fully connected (i.e., all edges between all pairs of nodes exist. Additionally, all graphs are defined over the same set of nodes. Now assume that every edge e(x, y) ∈ Ei is associated with a real-valued weight wi(x, y). That is, every graph Gi has its own set of weights for its edges.
We define the distance between two such graphs Gi = (V,Ei) and Gj = (V,Ej) as follows: d(Gi, Gj ) = (wi(x, y) − wj (x, y))2.
(x,y)∈V ×V
1. a. (15 points) Design a polynomial-time algorithm that finds the representative of the
set of graphs G.
2. c. (15 points) Prove that your algorithm finds the optimal representative and compute
the running time of your algorithm.
3
Problem 3 (30 points)
Consider set of ordered 1-dimensional points X and the problem of clustering them into k clusters using the k-means objective. If the points are sorted as: X = x1, x2, . . . , xn and they are clustered into k clusters C1, C2, . . . Ck, then the k-means objective is
k E((x1,…,xn),k)= (xj −μi)2,
i=1 xj∈Ci
where μi is the representative (i.e., the mean) of cluster Ci.
We have discussed in class, the the optimal solution of partitioning points x1, . . . xn into k
clusters is given by the following dynamic-programming recursion:
n E((x1,…,xn),k)= min E((x1,…,xj),k−1)+ (xj′ −μj′n),
j =1…,n−1
j′=j+1
whereμj′n isthemeanofpointsxj′,…xn.
Now you are asked to solve the same problem, with the difference that at most l of the
original n points can be ingnored –i.e., they need not participate in any of the clusters. Let’s call this problem the (k, l)− means problem.
1. c. (15 points) Give the recursion that would solve the (k,l)− means problem in poly- nomial time.
2. c. (15 points) Compute the running time of the recursion.
4
Problem 4 (20 points)
Consider the task of finding the right number of dimensions that you need to use in order to project a multi-dimensional dataset X, which has dimensionality d. Assume for this task that you have access to the SVD algorithm such that given as input X and the number of dimensions k, it outputs the value of ||X − Xk||, where Xk is the k-dimensional projection of the dataset. Design an algorithm that decides what is the right value of k.
5