Machine Learning Theory (CSC 482A/581A)
Problem Set 2
Instructions:
- You must write up your solutions individually.
- You may have high-level discussions with up to 2 other students registered in the course. If you discuss problems with others, include at the top of your submission: their names, V#’s, and the problems discussed.
- Your solutions should be clear and concise (and legible!). You are strongly encouraged to type up your solutions in LaTeX. For any problems where you only have a partial solution, be clear about any parts of your solution for which you have low confidence.
- If submitted through conneX, the due date is Friday, February 23rd, 7pm. This is a hard deadline. If submitted in writing, submit at the beginning of class (12:30pm) Friday, February 23rd. You are thus incentivized to submit online (using LaTeX) as you get more time!
Questions:
- Let X = R2 and take F to be the set of all convex polygons; the classifier corresponding to a convex polygon labels as positive all points inside the polygon (including the boundary) and labels all other points as negative. Prove that VCdim(F) = ∞.
d
- Let F be the class of linear separators in d dimensions, so that F = fw,b :w∈R ,b∈R
with fw,b(x) = 1 [⟨w, x⟩ + b ≥ 0].
- (a) Prove that VCdim(F ) ≥ d + 1.
- (b) Radon’s Theorem states that any set of d + 2 points in Rd can be partitioned into two sets A and B such that the convex hulls of A and B intersect. Using Radon’s Theorem, prove that VCdim(F ) ≤ d + 1 (and hence, combined with part (a), we may conclude that VCdim(F) = d + 1).
- (c) Next, prove Radon’s Theorem. Any valid proof is allowed. Here is the start of one potential proof. Recall from linear algebra that any d + 1 points x1, . . . , xd+1 ∈ Rd must be linearly dependent, i.e., there exists a vector λ ∈ Rd+1 not equal to the zero vector such that
d+1
λjxj =0. j=1The hint is to first prove that any set of d + 2 points x1,…,xd+2 ∈ Rd must be affine independent, meaning that there exists a vector λ ∈ Rd+2 not equal to the zero vector such that
d+1
λjxj =0 j=1
and
d+2
λj =0. j=1
1
3. Suppose that P is a probability distribution over Rd, and let the training sample X1, . . . , Xn, Xn+1 be i.i.d. samples with distribution P . We say that (X1, . . . , Xn) is s-sparse if
∥Xj∥0 ≤s forallj∈[n],
where, for any vector x, the l0 “norm” ∥x∥0 is defined as the number of non-zero components
in x.
Let sˆ be the minimum value of s ∈ {0,1,…,d} such that (X1,…,Xn) is s-sparse.
(a) Derive an upper bound (which holds with high probability over X1, . . . , Xn) on the prob- ability that Xn+1 is sˆ-sparse. Specifically, your bound should be of the form:
log 1
With probability at least 1 − δ, Pr(∥Xn+1∥0 > sˆ) = O worse than O 1 (so O logn is not allowed).
δ .
The bound can also depend on the dimension, but the rate with respect to n cannot be
n
(b) If your upper bound from part (a) depended on the dimension d, it degrades severely as
d → ∞. Derive an upper bound that is dimension-free. Unlike part (a), the rate with
respect to n now can be O logn . n
4. Suppose that P is a probability distribution over the unit Euclidean ball in Rd, and let X1, . . . , Xn be i.i.d. samples with distribution P .
Using tools from class, prove that the average distance (considering all pairs) between n points is tightly concentrated around its expectation. That is, show that
nn
EX,Y∼P ∥X−Y∥2. Specifically, you should show that with probability at least 1 − δ,
1 logδ
1
∥X−X∥−E
i j2 X,Y∼P
∥X−Y∥ =O
2 n
.
1
2 1≤i<j ≤n
n ∥Xi −Xj∥2
is tightly concentrated around
n
2 1≤i<j≤n
2