COMS 4771 SP21 HW4 Due: Fri Apr 09, 2021 at 11:59pm
This homework is to be done alone. No late homeworks are allowed. To receive credit, a type- setted copy of the homework pdf must be uploaded to Gradescope by the due date. You must show your work to receive full credit. Discussing possible solutions for homework questions is encour- aged on piazza and with your peers, but you must write your own individual solutions and not share your written work/code. You must cite all resources (including online material, books, articles, help taken from specific individuals, etc.) you used to complete your work.
1 A question on active learning
Suppose we are trying to learn the correct “threshold” for data in [0, 1]. Formally, let D be a dis- tribution over [0,1] × {0,1}, for α ∈ [0,1] define fα : [0,1] → {0,1} by fα(x) = 1[x ≥ α], and let F = {fα | α ∈ [0, 1]} be the function class we are interested in learning. Furthermore, assume that there is a “true” target concept which always labels the data correctly (i.e. assume ∃fα ∈ F : err(fα) = 0).
(a) Give a learning algorithm which given a sample S = {(xi,yi)}mi=1 drawn iid from D runs
in time O(m) and outputs a classifier fα ∈ F with the following property: for any ε, δ > 0
ifm≥O1 ln(1/δ)thenwithprobabilityatleast1−δoverthedrawofS,err(f )≤ε. ε2 α
Make sure you prove that it has this property.
It is interesting to note that due to the assumption that there exists a true classifier in F which
perfectly labels the points, it can actually be shown that only O 1 ln(1/δ) points are required to ε
PAC learn F (you do not need to prove this). However, this is essentially tight (it can be shown that there exist distributions on which that many points are needed). This is disappointing since the problem is simple enough that one could hope for a logarithmic dependence on 1/ε only (i.e. exponentially better than what we have right now). To achieve something like this we can look towards active learning.
Active learning is a slightly different learning framework in which the learner is allowed to have some amount of say into which examples it gets (as opposed to the PAC learning framework seen in class where the learner is simply given a set of examples). Active learning is interesting in part because it better reflects the process of learning that we are familiar with. Indeed, in real life we as learners can ask questions and demand explanations. We are never expected to learn by simply being handed a set of examples. Furthermore, in many settings active learning can yield exponential improvements over passive PAC learning. Indeed:
(b) Suppose that D is uniform over [0, 1] and that the learner, rather than being given a set of points drawn iid from D, is able to query for the true label of any point x ∈ [0,1] (and gets the answer in O(1) time). Give a learning algorithm which given ε > 0 makes at most
1
O(ln(1/ε)) queries, runs in time O(ln(1/ε)), and returns some fα ∈ F such that err(f) ≤ ε with probability 1. Make sure you prove that it has these properties.
Unfortunately there are two issues with an “active learning” solution of this type. First and foremost, D is usually unknown (above we assume it is uniform over [0, 1]) and it is not clear how to extend the above solution to work in such cases. Second, it is not always reasonable to assume that the algorithm can just ask for the label of ANY point in the input space. For example, in image classification most of the input space looks like nonsensical images to human beings. A nice way to deal with both issues is to assume that the algorithm gets points drawn iid from D as before but that they are unlabelled. The algorithm can then actively and sequentially query for the labels of these points, and do so only if the label is useful (i.e. if it brings more information). Formally, consider the following relatively general “active learning algorithm”:
Algorithm 1 Active learning algorithm
Input: A set of samples S = {xi}mi=1 drawn iid from D, the function class F, and access to a query oracle O which on input xi ∈ S outputs yi. It must be true that ∃f ∈ F such that err(f) = 0. Let V = F
forxi ∈Sdo
if ∃f1, f2 ∈ V such that f1(xi) ̸= f2(xi) then Get yi by making the query O(xi)
SetV tobe{f ∈V :f(xi)=yi}
end if end for
return Anyf ∈V.
The issue with this algorithm is that it is often difficult to check the condition of the if statement and update V efficiently (i.e. without using too much running time). Also it is not clear in which order one should iterate through S. But for some simple problems there are solutions to these issues, in which case this active learning algorithm can give a much better label complexity (number of labeled points it needs) than the standard PAC alternatives.
(c) Going back to our problem from (a) and (b) rewrite Algorithm 1 so that:
– Your algorithm runs in time O(m log(m)) (you can assume each call to O takes O(1) time).
– Your algorithm makes O(log m) queries (calls to O).
– If m ≥ O 1 ln(1/δ) then with probability at least 1 − δ over the draw of S your
ε2
algorithm returns fα such that err(fα) ≤ ε.
As always make sure to prove that your algorithm has each of these properties.
Note that the label complexity of your new algorithm (i.e. how many examples it needs to have labelled) is exponentially smaller than that of the algorithm from part (a).
2 BER / application of large deviation theory
Suppose we have a distribution D over X × {0, 1}, a sample S = {(xi, yi)}mi=1 drawn iid from D,
and a classifier f : X → {0, 1}. We have usually looked at the 0-1 error of f which is given by 1 m
P(x,y)∼D[f(x) ̸= y] and which can be estimated via m i=1 1[f(xi) ̸= yi]. 2
Sometimes, it is more reasonable to look at the Balanced Error Rate (BER) of f which is defined as:
BER(f) = 1 P(x,y)∼D[f(x) ̸= y | y = 1] + P(x,y)∼D[f(x) ̸= y | y = 0] 2
This is useful when one needs a classifier that has good accuracy on both classes (points where y = 1 and points where y = 0) despite having a distribution with very different priors (i.e. when one class is much more prevalent than the other).
Propose an estimator BER(f,S) of BER(f) with the following property: for any ε ∈ (0,1), if m ≥ 1000 then with probability at least 0.99 (over the draw of S),
ε2 min(P[y=1],P[y=0])
BER(f, S) − BER(f) ≤ ε Then show that your estimator has this property.
Hint: Use the Chernoff-Hoeffding bound seen in class. 3 Studying k-means
Recallthatink-meansclusteringweattempttofindkclustercenterscj ∈Rd,j∈{1,…,k}such that the total (squared) distance between each datapoint and the nearest cluster center is minimized. In other words, we attempt to find c1 , …, ck that minimizes
n
i=1
where n is the total number of datapoints. To do so, we iterate between assigning xi to the nearest cluster center and updating each cluster center cj to the average of all points assigned to the jth cluster (aka Lloyd’s method).
(a) [it is unclear how to find the best k, i.e. estimate the correct number of clusters!] Instead of holding the number of clusters k fixed, one can think of minimizing (1) over both k and c. Show that this is a bad idea. Specifically, what is the minimum possible value of (1)? what values of k and c result in this value?
(b) [suboptimality of Lloyd’s method] For the case d = 1 (and k ≥ 2), show that Lloyd’s algorithm is not optimal. That is, there is a suboptimal setting of cluster assignment for some dataset (with d = 1) for which Lloyd’s algorithm will not be able to improve the solution.
(c) [improving k-means quality] k-means with Euclidean distance metric assumes that each pair of clusters is linearly separable (see part ii below). This may not be the desired result. A classic example is where we have two clusters corresponding to data points on two concentric circles in the R2.
(i) ImplementLloyd’smethodfork-meansalgorithmandshowtheresultingclusterassign- ment for the dataset depicted above. Give two more examples of datasets in R2, for which optimal k-means setting results in an undesirable clustering. Show the resulting cluster assignment for the two additional example datasets.
(ii) Show that for k = 2, for any (distinct) placement of centers c1 and c2 in Rd, the cluster boundary induced by minimizing the k-means objective (i.e. Eq. 1) is necessarily linear.
3
min ∥xi−cj∥2, (1) j∈{1,…,k}
One way to get a more flexible clustering is to run k-means in a transformed space. The transformation and clustering is done as follows:
• Let Gr denote the r-nearest neighbor graph induced on the given dataset (say the dataset has n datapoints), that is, the datapoints are the vertices (notation: vi is the vertex associ- ated with datapoint xi) and there is an edge between vertex vi and vj if the corresponding datapoint xj is one of the r closest neighbors of datapoint xi.
• Let W denote the n × n edge matrix, where
wij = 1[∃ edge between vi and vj in Gr].
• Definen×ndiagonalmatrixDasdii :=j wij,andfinallydefineL=D−W.
• Compute the bottom k eigenvectors/values of L (that is, eigenvectors corresponding to the k smallest eigenvalues). Let V be the n × k matrix of of the bottom eigenvectors. One can view this matrix V as a k dimensional representation of the n datapoints.
• Run k-means on the datamatrix V and return the clustering induced.
We’ll try to gain a better understanding of this transformation V (which is basically the lower
order spectra of L).
(iii) Show that for any vector f ∈ Rn,
fTLf = 1wij(fi −fj)2.
2
ij
(iv) L is a symmetric positive semi-definite matrix.
(v) Let the graph Gr have k connected components C1 , . . . , Ck . Show that the n × 1 indica- tor vectors 1Ck , . . . , 1Ck are (unnormalized) eigenvectors of L with eigenvalue 0. (the ith component of an indicator vector takes value one iff the vertex vi is in the connected component)
Part (v) gives us some indication on why the transformation V (low order spectra of L) is a reasonable representation. Basically: (i) vertices belonging to the same connected componen- t/cluster (ie, datapoints connected with a “path”, even if they are located far away or form odd shapes)willhavethesamevalueintherepresentationV =[1C1,…,1Ck],and(ii)vertices belonging to different connected component/cluster will have distinct representation. Thus making it easier for a k-means type algorithm to recover the clusterings.
(vi) Foreachofthedatasetsinpart(i)(therearetotalthreedatasets),runthisflexibleversion of k-means in the transformed space. Show the resulting clustering assignment on all the datasets. Does it improve the clustering quality? How does the choice of r (in Gr) affects the result?
(You must submit your code for parts (i) and (vi) to receive full credit.)
4