UNIVERSITY OF TORONTO SCARBOROUGH FALL 2020 STAD68 MIDTERM
Allowed aids. You may refer to the book when solving these problems. You should not use the internet, any social media platform, or discuss the problems with other students.
Point system. Every Question is made up of Parts. (Note that some Questions only have one Part.) All Parts are worth the same # of points. In particular, a Question with 3 Parts is worth 3 times more points than a Question with 1 Part. Some Parts will be harder than others. Be strategic. You answers should specify solutions for each part using, e.g., 3.4 to refer to Question 3, Part 4.
1
Question 1. Warm-up Part 1.
STAD68 MIDTERM 2
You run a quality control company that inspects food. For one client—a potato chip company—you are asked to help them classify whether a potato chip is either 1) perfect, 2) undercooked, or 3) overcooked, based on an image. A loss of 1 is incurred if the detector chooses the incorrect answer. Otherwise, loss is zero. Choose a label space Y and loss function l for this situtaion. You may assume that H is a subset of the space of functions from images X to Y.
Y = {P, U, O}. Any three point set will do.
If we assume loss function acts on l(h(x), y) then
l(yˆ, y) = 0 if and only if yˆ = y, and 1 otherwise.
If l(h, (x, y)) then l(h, (x, y)) = 0 if and only if h(x) = y, and 1 otherwise.
Part 2.
Consider two hypothesis spaces. Assuming approximation error is equivalent, and knowing nothing else, would you prefer to learn using a hypothesis class whose sample complexity scales proportionally to 1/ε + log(1/δ) or 1/ε2 + log(1/δ)? Why?
Since I know nothing else, I may want to protect myself in the worst case, which means I should care about sample complexity, as this determines the number of samples I need in the worst case toachieve ε error with probability at least 1 − δ.
The sample complexity based on 1/ε is much better (explodes much slower than 1/ε2). In particular, to obtain ε = 0.01 error, I need 100 times more samples in the 1/ε2 case.
Part 3.
Consider two hypothesis spaces H,H′, with VCdim(H) < VCdim(H′), and let A(S) and A′(S) be empirical risk minimization algorithms for H and H′, respectively. Assuming approximation error is equivalent, should you always prefer A or A′ or vice versa? Explain.
The VC dimension controls the worst case performance of arbitrary ERMs. However, in this particular case, where the approximation error is equivalent (meaning that there is some fixed distribution we’re dealing with), the estimation error of A′ may be better than that of A. E.g., A′ could choose among the ERMs in a particular way that reduces variance. We know that, in the worst case, the estimation error for H is no greater than that for H′.
Part 4.
You are an ML engineering responsible for designing a voice recognition system for the next iPhone. You need to determine how many voice samples to collect from each user in order to train the system and guarantee that the system “works well”. Define a notion of “works well” that you can guarantee to deliver for each user with a fixed number of samples determined in advance. What potentially changes if you allow yourself to collect more samples for some users, based on the results of training?
Consider a classifier that learns to recognizer the user from voice. THis is binary classification and so everything we’ve learned is relevant.
A natural notion of “works well” is that the recognition performance of the learned recognizer, as measure by risk, is within ε of the best achievable recognition performance. That is, small excess risk.
If the hypothesis class we choose has finite VC dimension, than we can precompute the number of iid samples necessary to achieve any desired excess risk with high probability.
STAD68 MIDTERM 3
If you can take more samples as needed, then we don’t necessarily need the guarantees afforded by PAC learning. We could then collect more samples, independent of the learned hypothesis, to get an unbiased estimate of its risk.
Question 2. Risk, risk, risk
Consider binary classification on a space X and fix a hypothesis space H. Assume we are given m i.i.d. samples S.
Let h∗ satisfy LD(h∗) = minh∈H LD(h). When asked to make a comparison, you can express your answer in terms of relations such as “greater than”, “less than”, “no greater than”, etc.
Part 1.
Assume LD(h∗) > 0. How does LS(ERMH(S)) compare to LS(h∗)? What if LD(h∗) = 0.
LS(ERMH(S)) is less than or equal to LS(h∗). (For a large data set, it will likely be strictly less than.) However, if LD(h∗) = 0, both will be equal.
Part 2.
How does LD(ERMH(S)) compare to LD(h∗)?
LD(ERMH(S)) is no less than LD(h∗). (PAC learning is about their difference going to zero.) Part 3.
What happens to the excess risk LD(ERMH(S)) − LD(h∗) as m → ∞? Explain.
If the VC dimension of H is finite, the excess risk goes to zero as m → ∞.
If the VC dimension is infinite, the excess risk may not go to zero or it might. Depends.
Part 4.
Let S, S′ be two independent i.i.d. samples from D, both consisting of m samples. Then consider a learning algorithm
A(S, S′) which performs the following optimization:
arg max LS′ (h) subject to the constraint h ∈ ERMH(S).
h∈H
Will this algorithm find a hypothesis with small excess risk as m → ∞? Why?
So, again, if VC dimension is infinite, all bets are off. Could or could not.
So let’s assume finite VC. In this case the algorithm works—excess risk goes to zero. Why? Because the hypothesis is an ERM for a data set S whose size is diverging and so simply the fact that h is an ERM for this data set implies learning will happen by the PAC learning properties of ERM. The fact that we’re choosing the “worst” ERM is irrelevant because of uniform convergence.
Part 5.
Let S,S′ be two independent i.i.d. samples from D, consisting of m and n′ samples, respectively. Then consider a
learning algorithm A(S,S′) which performs the following optimization: argmin (n′ LS′(h)−mLS(h)).
h∈H
Assume VCdim(H) is finite. Give conditions on m, n′ so that the excess risk of the learned hypothesis goes to zero as m,n′ →∞.
The optimization is equivalent to
STAD68 MIDTERM 4
arg min LS′ (h) − m LS (h) . h∈H n′
Thus, if m/n′ → 0 as m, n′ → ∞, then the second term contributes a vanishing amount relative to the first term. (To make this formal, you’d also exploit the fact that risk is bounded in [0, 1].) The first term being minimized is just empirical risk minimization.
Part 6.
An ε-ERM algorithm is guaranteed to find a hypothesis hˆ such that, for all h ∈ H,
LS(hˆ) ≤ LS(h) + ε.
Assume VCdim(H) is finite and assume A(S) is an f(m)-ERM, where m is the size of the sample S and f is some
fixed function such that f(n) → 0 as n → ∞. Prove a bound on the excess error of A(S) in terms of f. For some universal constant C,
VCdim(H) + log 1/δ LD(A(S)) ≤ LS(A(S)) + C m
VCdim(H) + log 1/δ ≤ LS(h∗) + f(m) + C m
VCdim(H) + log 1/δ ≤ LD(h∗) + f(m) + 2 ∗ C m .
Part 7.
Rather than running one algorithm, you run k algorithms A1(S), . . . , Ak(S) and choose the hypothesis among these k hypotheses that achieves the best empirical risk. Let A(S) denote this hypothesis chosen based on the output of these k algorithms. Here S is a size m i.i.d. sample.
For each algorithm i = 1, . . . , k, you have a guarantee
∀D,n,δ, PS∼Dn[LD(Ai(S))≤LS(Ai(S))+Φ(S,n,δ)]≥1−δ.
What guarantee can you derive on the difference LD(A(S)) − LS(A(S))? By a union bound,
∀D,n,δ, PS∼Dn[∀i, LD(Ai(S))≤LS(Ai(S))+Φ(S,n,δ)]≥1−kδ. Let|S|=n. SinceLD(A(S))−LS(A(S))isLD(AJ(S))−LS(AJ(S))forsome(random)indexJ,
LD(A(S))−LS(A(S)) ≤ Φ(S,n,δ/k)
with probability at least 1 − δ. Question 3. VC dimension
Consider two hypothesis spaces H,H′, assumed to be binary predictors with inputs in a space X. Assume H ⊂ H′.
Part 1.
Assume X is infinite. How do VCdim(H) and VCdim(H′) compare? Is one necessarily bigger or smaller?
VCdim(H) ≤ VCdim(H′), i.e., the former is no larger than the latter. They could be equal or not. We cannot say definitively.
Part 2.
If X is finite, can VCdim(H) be finite? Yes.
STAD68 MIDTERM 5
Part 3.
If X is finite, can VCdim(H) be infinite?
No. VCdim(H) ≤ log |H| and if X is finite, then so is H.
Part 4.
If X is infinite, can VCdim(H) be finite? Yes, e.g., rectangles.
Part 5.
You find a set of points C ⊆ X such that every possible binary labeling of C is achievable by hypotheses in H. What
do you now know about VCdim(H)? VCdim(H) ≥ |C|.
Part 6.
You find a set of points C ⊆ X such that one of the possible binary labelings of C is not achievable by any of the
hypotheses in H. What do you now know about VCdim(H)? Nothing.
Part 7.
Is it true or false that, if VCdim(H) is infinite, it is not possible to find hˆ with small excess risk LD(hˆ)−minh∈H LD(h)?
Why?
False. VC dimension controls only the worst case excess risk. That is, for every algorithm, we can design a distribution that causes it to fail to learn. However, an algorithm could be tuned to a specific data distribution and work well for that distribution.
Part 8.
Assume H is the set of hypothesis on X = R2 where, for every pair of rectangles R, R′ ⊆ R2, there is h ∈ H such that
h(x)=1ifandonlyifx∈Rorx∈R′. ArguethattheVCdimensionofHgreaterthan4.
This class contains the set of rectangles in two dimensions and so it is clear that the VC dimension is at least 4. To argue that it is greater than 4 we can find 5 points that it shatters. To do this, consider 4 points in a diamond (which can be shattered by a single rectangle, and then put another point far away. For every labeling of the diamond, we can achieve that with one rectangle. We then use the second rectangle to label the final point as necessary.
Question 4. Coins
There are m people in a room. You ask each of them to toss an ordinary coin k times and keep track of the count of
heads. For each person p, let θp be the fraction of times their coin came up heads, i.e., θp is their count divided by k.
Without telling anyone, and before anyone told you their count, you picked a person p0 in the room. After the counts were revealed, you also noted down a person p1 who had the largest count.
Part 1.
What is the expectation E[θp0 ]? 1/2.
STAD68 MIDTERM 6
Part 2.
How does θp1 compare to θp0 , qualitatively? How does E[θp1 ] compare to E[θp0 ], qualitatively? If m = 1, all quantities are equal. Assume m > 1.
Then θp1 ≥ θp0 (with probability one), and E[θp1 ] > E[θp0 ].
Part 3.
What happens to the difference θp1 − θp0 as m → ∞, and k fixed? Thedifferencegoesto1−θp0 becauseθp1 →1asm→∞forfixedk.
Part 4.
What happens to the difference θp1 − θp0 as k → ∞, and m fixed? The difference goes to 0 because θp0,θp1 → 1/2 as k → ∞ for m fixed.
Part 5.
What happens to the difference θp1 − θp0 as both k, m → ∞? You may need to break down the possibilities.
θp0 → 1/2 as k → ∞ regardless of m. So the answer really depends on the behavior of θp1 . By a union bound argument applied to Hoeffding’s Lemma/Inequality,
P[θp1 −1/2>ε]≤2mexp{−2kε2}.
(The constant 2 is not required actually since this is a one-sided tail bound.) Therefore, if 2m exp{−2kε2} → 0 as m,k → ∞, for all ε > 0, then θp1 → 1/2 (equiv., θp1 −θp0 → 0). This holds if logm ∈ o(k), where o(k) means that the function grows slower than c k for all constants c > 0. One can express this also as m ∈ 2o(k).
This is enough for full credit.
In the other direction, if log m ∈ Ω(k), i.e., log m grows at least as fast as c k for some c > 0. Let’s actually assume
that logm ∈ ω(k), that is logm grows faster than ck for all c > 0. In this case, one can argue that k plays the
role of the number of data while log m plays the role of the number of VC dimension. In this case, since log m/c k
diverges as m, k → ∞, for all c > 0, in this analogy, we would have no learning. For log m ≈ c k, we would get some
intermediate result. AS such, I believe that it’s possible to cause θp1 − 1 2 → d for any d ∈ [0, 1/2]. Certainly, the end /
point (converging to 1/2 is possible). This paragraph is a bit informal. One cannot reason about this on the basis of Hoeffding’s inequality. However, one could use Slud’s Inequality, but I wouldn’t expect this for full credit.
Part 6.
Assume one person in the room has a trick coin with both sides head. How big does k have to be to reliably find the
trick coin? (Your answer should make it clear what “reliable” means.)
No matter how many flips k, the trick coin will always produce k heads. In order to find the coin, we need to make sure that no other coin comes up heads k times. Let Bi be the event that the ith person with a fair coin flips k heads. Clearly P(Bi) = 2−k. Then we can find trick coin with perfect reliability unless the event i Bi happens. By the union bound,
P(Bi) ≤ P(Bi) = 2−k = m2−k. iii
STAD68 MIDTERM 7
Let’s say we want to detect the trick coin with probability at least 1 − δ. In that case, it would suffice if m2−k ≤ δ
or equivalently
This is an instance of uniform convergence.
k≥log m. 2δ