机器学习代写：CSC 482A/581A Problem Set 2

Machine Learning Theory (CSC 482A/581A)

Problem Set 2

Instructions:

You must write up your solutions individually.
You may have high-level discussions with up to 2 other students registered in the course. If you discuss problems with others, include at the top of your submission: their names, V#’s, and the problems discussed.
Your solutions should be clear and concise (and legible!). You are strongly encouraged to type up your solutions in LaTeX. For any problems where you only have a partial solution, be clear about any parts of your solution for which you have low confidence.
If submitted through conneX, the due date is Friday, February 23rd, 7pm. This is a hard deadline. If submitted in writing, submit at the beginning of class (12:30pm) Friday, February 23rd. You are thus incentivized to submit online (using LaTeX) as you get more time!
Questions:

Let X = R2 and take F to be the set of all convex polygons; the classifier corresponding to a convex polygon labels as positive all points inside the polygon (including the boundary) and labels all other points as negative. Prove that VCdim(F) = ∞.
d
Let F be the class of linear separators in d dimensions, so that F = fw,b :w∈R ,b∈R
with fw,b(x) = 1 [⟨w, x⟩ + b ≥ 0].
1. (a) Prove that VCdim(F ) ≥ d + 1.
2. (b) Radon’s Theorem states that any set of d + 2 points in Rd can be partitioned into two sets A and B such that the convex hulls of A and B intersect. Using Radon’s Theorem, prove that VCdim(F ) ≤ d + 1 (and hence, combined with part (a), we may conclude that VCdim(F) = d + 1).
3. (c) Next, prove Radon’s Theorem. Any valid proof is allowed. Here is the start of one potential proof. Recall from linear algebra that any d + 1 points x1, . . . , xd+1 ∈ Rd must be linearly dependent, i.e., there exists a vector λ ∈ Rd+1 not equal to the zero vector such that
  d+1
  λjxj =0. j=1
  
  The hint is to first prove that any set of d + 2 points x1,…,xd+2 ∈ Rd must be affine independent, meaning that there exists a vector λ ∈ Rd+2 not equal to the zero vector such that

d+1
λjxj =0 j=1

and

d+2
λj =0. j=1

3. Suppose that P is a probability distribution over Rd, and let the training sample X1, . . . , Xn, Xn+1 be i.i.d. samples with distribution P . We say that (X1, . . . , Xn) is s-sparse if

∥Xj∥0 ≤s forallj∈[n],
where, for any vector x, the l0 “norm” ∥x∥0 is defined as the number of non-zero components

in x.
Let sˆ be the minimum value of s ∈ {0,1,…,d} such that (X1,…,Xn) is s-sparse.

(a) Derive an upper bound (which holds with high probability over X1, . . . , Xn) on the prob- ability that Xn+1 is sˆ-sparse. Specifically, your bound should be of the form:

log 1

With probability at least 1 − δ, Pr(∥Xn+1∥0 > sˆ) = O worse than O 1 (so O logn is not allowed).

δ .
The bound can also depend on the dimension, but the rate with respect to n cannot be

(b) If your upper bound from part (a) depended on the dimension d, it degrades severely as

d → ∞. Derive an upper bound that is dimension-free. Unlike part (a), the rate with

respect to n now can be O logn . n

4. Suppose that P is a probability distribution over the unit Euclidean ball in Rd, and let X1, . . . , Xn be i.i.d. samples with distribution P .

Using tools from class, prove that the average distance (considering all pairs) between n points is tightly concentrated around its expectation. That is, show that

EX,Y∼P ∥X−Y∥2. Specifically, you should show that with probability at least 1 − δ,

 1 logδ

∥X−X∥−E
i j2 X,Y∼P

∥X−Y∥ =O
2  n

2 1≤i<j ≤n

n ∥Xi −Xj∥2

is tightly concentrated around

n
2 1≤i<j≤n

Related Posts