机器学习代写:CSC 482A/581A Problem Set 2

Machine Learning Theory (CSC 482A/581A)

Problem Set 2

Instructions:

  • You must write up your solutions individually.
  • You may have high-level discussions with up to 2 other students registered in the course. If you discuss problems with others, include at the top of your submission: their names, V#’s, and the problems discussed.
  • Your solutions should be clear and concise (and legible!). You are strongly encouraged to type up your solutions in LaTeX. For any problems where you only have a partial solution, be clear about any parts of your solution for which you have low confidence.
  • If submitted through conneX, the due date is Friday, February 23rd, 7pm. This is a hard deadline. If submitted in writing, submit at the beginning of class (12:30pm) Friday, February 23rd. You are thus incentivized to submit online (using LaTeX) as you get more time!

    Questions:

  1. Let X = R2 and take F to be the set of all convex polygons; the classifier corresponding to a convex polygon labels as positive all points inside the polygon (including the boundary) and labels all other points as negative. Prove that VCdim(F) = ∞.

    d

  2. Let F be the class of linear separators in d dimensions, so that F = fw,b :w∈R ,b∈R

    with fw,b(x) = 1 [⟨w, x⟩ + b ≥ 0].

    1. (a)  Prove that VCdim(F ) ≥ d + 1.
    2. (b)  Radon’s Theorem states that any set of d + 2 points in Rd can be partitioned into two sets A and B such that the convex hulls of A and B intersect. Using Radon’s Theorem, prove that VCdim(F ) ≤ d + 1 (and hence, combined with part (a), we may conclude that VCdim(F) = d + 1).
    3. (c)  Next, prove Radon’s Theorem. Any valid proof is allowed. Here is the start of one potential proof. Recall from linear algebra that any d + 1 points x1, . . . , xd+1 ∈ Rd must be linearly dependent, i.e., there exists a vector λ ∈ Rd+1 not equal to the zero vector such that

      d+1
      λjxj =0. j=1

      The hint is to first prove that any set of d + 2 points x1,…,xd+2 ∈ Rd must be affine independent, meaning that there exists a vector λ ∈ Rd+2 not equal to the zero vector such that

d+1
λjxj =0 j=1

and

d+2
λj =0. j=1

1

3. Suppose that P is a probability distribution over Rd, and let the training sample X1, . . . , Xn, Xn+1 be i.i.d. samples with distribution P . We say that (X1, . . . , Xn) is s-sparse if

∥Xj∥0 ≤s forallj∈[n],
where, for any vector x, the l0 “norm” ∥x∥0 is defined as the number of non-zero components

in x.
Let sˆ be the minimum value of s ∈ {0,1,…,d} such that (X1,…,Xn) is s-sparse.

(a) Derive an upper bound (which holds with high probability over X1, . . . , Xn) on the prob- ability that Xn+1 is sˆ-sparse. Specifically, your bound should be of the form:

log 1

With probability at least 1 − δ, Pr(∥Xn+1∥0 > sˆ) = O worse than O 1 (so O logn is not allowed).

δ .
The bound can also depend on the dimension, but the rate with respect to n cannot be

n

(b) If your upper bound from part (a) depended on the dimension d, it degrades severely as

d → ∞. Derive an upper bound that is dimension-free. Unlike part (a), the rate with

respect to n now can be O logn . n

4. Suppose that P is a probability distribution over the unit Euclidean ball in Rd, and let X1, . . . , Xn be i.i.d. samples with distribution P .

Using tools from class, prove that the average distance (considering all pairs) between n points is tightly concentrated around its expectation. That is, show that

nn

EX,Y∼P ∥X−Y∥2. Specifically, you should show that with probability at least 1 − δ,

 1 logδ

1

∥X−X∥−E
i j2 X,Y∼P

∥X−Y∥ =O
2  n

.

1

2 1≤i<j ≤n

n ∥Xi −Xj∥2

is tightly concentrated around

n
2 1≤i<j≤n

2