机器学习代写: COMS 4771 HW4

COMS 4771 HW4 Due: Sat Apr 21, 2018 at 11:59pm

You are allowed to work in groups of (at max) three students. These group members don’t nec- essarily have to be the same from previous homeworks. Only one submission per group is required by the due date. Name and UNI of all group members must be clearly specified on the homework. You must cite all the references you used to do this homework. You must show your work to receive full credit.

1 [PAC learning and VC dimension]

(a) Consider a domain with n binary features and binary class labels. Let Fd be the hypoth- esis class that contains all decision trees over these features that have depth at most d. Give the tightest possible VC-dimension estimate of Fd.

(b) [asharperlearningrateforfindingperfectpredictors!]Considerafinitesizehypoth- esisclassFthatcontainstheperfectpredictor,thatis,∃f∗ ∈F,s.t.∀(x,y)∼D(X×Y), y = f∗(x). We will show that for such a hypothesis class F, we can derive a sharper sample complexity rate than what was shown in class (i.e., Occam’s Razor bound).

  1. (i)  Forafixedǫ>0,wesaythataclassifierf isǫ-badif P(x,y)∼D[f(x) = y] < 1 − ǫ.

    What is the chance that output of a given ǫ-bad f matches the true label for all m samples drawn i.i.d. from D? That is, for a given ǫ-bad f, calculate

    P(xi,yi)mi=1∼D􏰗∀i f(xi) = yi | f is ǫ-bad􏰘.

  2. (ii)  What is the chance that event in part (i) occurs for some f ∈ F ? That is, calculate

    P(xi,yi)mi=1∼D􏰗∃ ǫ-bad f ∈ F, such that ∀i f(xi) = yi􏰘.

  3. (iii)  Observe that event is part (ii) is a “bad” event: an ERM-algorithm can pick a ǫ-bad classifier if it happens to perfectly classify on the m i.i.d. samples. We want to limit the probability of this bad event by at most δ > 0 amount. So how many i.i.d. samples suffice to ensure that with probability at least 1 − δ, an ERM-algorithm does not return an ǫ-bad classifier?

    1

(c) Given a collection of models F, suppose you were able to develop an algorithm A : (xi , yi )ni=1 􏰔→ fnA (that is, given n labeled training samples, A returns a model fnA ∈ F ) that has the following property: for all ǫ > 0, with probability 0.55 (over the draw of n = O( 1 ) samples,

ǫ2

err(fnA) − inf err(f) ≤ ǫ, f∈F

where err(f) := P(x,y)[f(x) ̸= y].
Show that one can construct an algorithm B : (xi , yi )i=1 􏰔→ fn′ with the property: for

n′ B allǫ>0andallδ>0,withprobabilityatleast1−δoveradrawofn′ samples:

B

(Hint: Algorithm B can make multiple calls to the algorithm A.)

err(fn′)− inf err(f)≤ǫ. f∈F

Moreover show that n′ = poly( 1 , 1 ) samples are enough for the algorithm B to return

ǫδ
such a model fn′ . Hence, the model class F is efficiently PAC-learnable.

B

2 [Exploring k-means in detail] Recall that in k-means clustering we attempt to find k cluster centers cj ∈ Rd, j ∈ {1, . . . , k} such that the total (squared) distance between each datapoint and the nearest cluster center is minimized. In other words, we attempt to find c1 , …, ck that minimizes

n
􏰍 min ∥xi −cj∥2, (1)

i=1 j∈{1,…,k}

where n is the total number of datapoints. To do so, we iterate between assigning xi to the nearest cluster center and updating each cluster center cj to the average of all points assigned to the jth cluster (aka Lloyd’s method).

(a) [it is unclear how to find the best k, i.e. estimate the correct number of clusters!] Instead of holding the number of clusters k fixed, one can think of minimizing (1) over both k and c. Show that this is a bad idea. Specifically, what is the minimum possible value of (1)? what values of k and c result in this value?

(b) [efficient optimal solution when either k = 1 or d = 1] Optimizing the k-means objective (1) for the general case when k ≥ 2 and d ≥ 2 is NP-hard. But one can find an efficient optimial solution when either k = 1 or d = 1.

(i) What is the optimal objective value and the setting of the cluster centers in the case of k = 1?

(ii) For the case d = 1 (and k ≥ 2), show that Lloyd’s algorithm is not optimal. That is, there is a suboptimal setting of cluster assignment for some dataset (with d = 1)

for which Lloyd’s algorithm will not be able to improve the solution.
(iii) (note: this part is extra credit, only attempt it once everything else is done)

Propose an efficient algorithm (that is, an algorithm that runs in polynomial in n and k steps) for optimally solving k-means for d = 1 case.
(Hint: Use the fact that when d = 1, the dataset can be ordered as xi1 ≤ xi2 ≤ ··· ≤ xin and each cluster in the optimal assignment will be a nonoverlapping interval)

2

(c) [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair of clusters is linearly separable. This may not be the case. A classic example is where we have two clusters corresponding to data points on two concentric circles in the R2.

(i) Implement Lloyd’s method for k-means algorithm and show the resulting cluster assignment for the dataset depicted above. Give two more examples of datasets in R2, for which optimal k-means setting results in an undesirable clustering. Show the resulting cluster assignment for the two additional example datasets.

Let φ denote an explicit non-linear feature space mapping in some inner product space. We will show how one can derive an implicit kernel-based version of the Lloyd’s method for k-means, where we only operate on data as φ(xi) · φ(xj ) = K(xi, xj ).

  1. (ii)  Let zij be an indicator that is equal to 1 if the φ(xi) is currently assigned to the jth cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center cj can be written as 􏰌ni=1 αijφ(xi), where αij only depends on zij variables.
  2. (iii)  Given any two data points φ(xi) and φ(xj), show that the square distance ∥φ(xi)− φ(xj )∥2 can be computed using only (linear combinations of) inner products.
  3. (iv)  Given the results of parts (ii) and (iii), show how to compute the square distance

    ∥φ(xi ) − cj ∥2 using only (linear combinations of) inner products between the data

    points x1, …, xn.

  4. (v)  From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized

    version of Lloyd’s method of finding k-means.

  5. (vi)  Implementyourproposedkernelizedk-meansmethodandrunitonthethreeexam-

    ple datasets of part (i). Compare the resulting cluster for various choices of kernels (e.g. linear kernel, quadratic kernel, rbf kernel).
    (submit your datasets and kernelized code on Courseworks to receive full credit)

3 [different learning rates for different strategies] Here will explore how the rate by which an algorithm learns a concept can change based on how labeled examples are obtained. For that we look at three settings: (i) an active learning setting where the algorithm has the luxury of specifying a data point and querying its label, (ii) a passive learning setting where labeled examples are drawn at random and (iii) an adversarial setting where training examples are given by an adversary that tries to make your life hard.

Consider a binary classification problem where each data point consists of d binary features. Let F be the hypothesis class of conjunctions of subsets of the d features and their negations. So for example one hypothesis could be f1(x) = x1 ∧ x2 ∧ ¬xd (where ∧ denotes the logical “and” and ¬ denotes the logical “not”). Another hypothesis could be f2(x) = ¬x3 ∧ x5. A conjunction in F cannot contain both xi and ¬xi. We assume a consistent learning scenario where there exists a hypothesis f ∗ ∈ F that is consistent with the true label for all data points.

  1. (i)  In the active learning setting, the learning algorithm can query the label of an unlabeled example. Assume that you can query any possible example. Show that, starting with a single positive example, you can exactly learn the true hypothesis f∗ using d queries.
  2. (ii)  In the passive learning setting, where the examples are drawn i.i.d. from an underlying fixed distribution D. How many examples are sufficient to guarantee a generalization error less than ǫ with probability ≥ 1 − δ?

    3

(iii) Show that if the training data is not representative of the underlying distribution, a con- sistent hypothesis f∗ can perform poorly. Specifically, assume that the true hypothesis f∗ is a conjunction of k out of the d features for some k > 0 and that all possible data points are equally likely. Show that there exists a training set of 2(d−k) unique examples and a hypothesis f that is consistent with this training set but achieves a classification error ≥ 50% when tested on all possible data points.

4 [From distances to embeddings] Your friend from overseas is visiting you and asks you the geographical locations of popular US cities on a map. Not having access to a US map, you realize that you cannot provide your friend accurate information. You recall that you have access to the relative distances between nine popular US cities, given by the following distance matrix D:

Distances (D) BOS BOS 0 NYC 206 DC 429 MIA 1504 CHI 963 SEA 2976 SF 3095 LA 2979 DEN 1949

Being a machine learning
of these cities from the distance data. To find an embedding of these nine cities on a two dimensional map, you decide to solve it as an optimization problem as follows.

You associate a two-dimensional variable xi as the unknown latitude and the longitude value for each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for NYC, etc.). You write down the an (unconstrained) optimization problem

NYC DC 206 429 0 233 233 0 1308 1075 802 671 2815 2684 2934 2799 2786 2631 1771 1616

MIA CHI 1504 963 1308 802 1075 671

0 1329 1329 0 3273 2013 3053 2142 2687 2054 2037 996

SEA SF LA DEN 2976 3095 2979 1949 2815 2934 2786 1771 2684 2799 2631 1616 3273 3053 2687 2037 2013 2142 2054 996

0 808 1131 1307 808 0 379 1235 1131 379 0 1059 1307 1235 1059 0

student, you believe that it may be possible to infer the locations

minimizex1,…,x9 􏰍􏰕∥xi −xj∥−Dij􏰖2, i,j

where 􏰌i,j (∥xi − xj ∥ − Dij )2 denotes the embedding discrepancy function.
(i) What is the derivative of the discrepancy function with respect to a location xi?

(ii) Write a program in your preferred language to find an optimal setting of locations x1 , . . . , x9 . You must submit your code to Courseworks to receive full credit.

(iii) Plot the result of the optimization showing the estimated locations of the nine cities. (here is a sample code to plot the city locations in Matlab)

>> cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’}; >> locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];
>> figure; text(locs(:,1), locs(:,2), cities);

What can you say about your result of the estimated locations compared to the actual geographical locations of these cities?

4