程序代写代做代考 go C discrete mathematics EECS 70 Discrete Mathematics and Probability Theory Fall 2020

EECS 70 Discrete Mathematics and Probability Theory Fall 2020
Concentration Inequalities and the Laws of Large Numbers
Note 17
Suppose we have a biased coin, but we don’t know what the bias is. To estimate the bias, we toss the coin n times and count how many Heads we observe. Then our estimate of the bias is given by pˆ = 1 Sn , where
n
Sn is the number of Heads in the n tosses. Is this a good estimate? Let p denote the true bias of the coin,
which is unknown to us. Since E[Sn] = np, we see that the estimator pˆ has the correct expected value: E[pˆ] = 1 E[Sn] = p. This means when n is sufficiently large, we can expect pˆ to be very close to p; this is a
n
manifestation of the Law of Large Numbers, which we shall see at the end of this note.
How large should n be to guarantee that our estimate pˆ is within an error ε of the true bias p, i.e., | pˆ − p| ≤ ε ?
The answer is that we can never guarantee with absolute certainty that | pˆ − p| ≤ ε . This is because Sn is
a random variable that can take any integer values between 0 and n, and thus pˆ = 1 Sn is also a random n
variable that can take any values between 0 and 1. So regardless of the value of the true bias p ∈ (0, 1), it is possible that in our experiment of n coin tosses we observe n Heads (this happens with probability pn), in which case Sn = n and pˆ = 1. Similarly, it is also possible that all n coin tosses come up Tails (this happens with probability (1 − p)n ), in which case Sn = 0 and pˆ = 0.
So instead of requiring that |pˆ − p| ≤ ε with absolute certainty, we relax our requirement and only demand that | pˆ − p| ≤ ε with confidence 1 − δ , namely, P[ | pˆ − p| ≤ ε ] ≥ 1 − δ . This means there is a small probability δ that we make an error of more than ε , but with high probability (at least 1 − δ ) our estimate is very close to p. Now we can state our result: to guarantee that P[ | pˆ − p| ≤ ε ] ≥ 1 − δ , it suffices to toss the coin n times where
n≥1. 4ε2δ
To prove such a result, we use a mathematical tool called Chebyshev’s Inequality, which provides a quanti- tative, probabilistic bound on how far away a random variable can be from its expected value.
1 Markov’s Inequality
Before discussing Chebyshev’s inequality, we first prove the following simpler bound, which applies only to nonnegative random variables (i.e., r.v.’s which take only values ≥ 0).
Theorem 17.1 (Markov’s Inequality). For a nonnegative random variable X (i.e., X(ω) ≥ 0 for all ω ∈ Ω) with finite mean,
for any positive constant c.
EECS 70, Fall 2020, Note 17
1
P[X ≥ c] ≤ E[X], c

Proof. Let A denote the range of X and consider any constant c ∈ A . Then, E[X]= ∑a×P[X=a]
a∈A
≥ ∑ a × P[X = a]
a≥c
≥ ∑ c × P[X = a]
a≥c
= c ∑ P[X = a]
a≥c
= c P[X ≥ c],
where the first line is just the definition of expectation, while the second line follows from the fact that X is a nonnegative random variable (i.e., all a ∈ A satisfies a ≥ 0). Rearranging the last inequality gives us the desired result.
A slicker proof can be provided using the indicator function I{·}, defined as
E[X]≥cE[I{X ≥c}]=cP[X ≥c], follows from the fact that I{X ≥ c} is an indicator random variable.
􏰅1, if statement is true, 0, if statement is false.
I{statement} =
Alternative proof of Theorem 17.1. Since X is a nonnegative random variable and c > 0, we have, for all
ω ∈ Ω,
since the right hand side is 0 if X(ω) < c and is c if X(ω) ≥ c. Multiplying both sides by P[ω] and summing X(ω) ≥ cI{X(ω) ≥ c}, (1) over ω ∈ Ω gives where the first inequality follows from (1) and the fact that P[ω] ≥ 0 for all ω ∈ Ω, while the last equality If we have a random variable Y that is not necessarily nonnegative, then the same line of argument adopted above can be applied to prove the following result for the absolute value |Y | of Y : Theorem 17.2 (Generalized Markov’s Inequality). Let Y be an arbitrary random variable with finite mean. Then, for any positive constants c and r, P[|Y| ≥ c] ≤ E[|Y|r ]. cr Proof. Forc>0andr>0,wehave
|Y|r ≥ |Y|rI{|Y| ≥ c} ≥ crI{|Y| ≥ c}.
(Note that the last inequality would not hold if r were negative.) Taking expectations of both sides gives E[|Y|r ] ≥ cr E[I{|Y| ≥ c}] = crP[|Y| ≥ c],
and dividing by cr leads to the desired result.
EECS 70, Fall 2020, Note 17 2

There is an intuitive (leveraging your physical intuition) way to understand Markov’s inequality through
an analogy of balancing a seesaw, illustrated in Figure 1. Imagine that the probability distribution of a
nonnegative random variable X is resting on a fulcrum at μ = E[X ]. We are trying to find an upper bound on
the percentage of the distribution which lies beyond kμ, i.e., P[X ≥ kμ]. In other words, we seek to add as
much mass m2 as possible on the seesaw at kμ while minimizing the effect it has on the seesaw’s balance.
This mass will represent the upper bound we are searching for. To minimize the mass’s effect, we must
imagine that the mass of the distribution which lies beyond kμ is concentrated at exactly kμ. However, to
keep things stable and maximize the mass at kμ, we must add another mass m1 as far left to the fulcrum as
wecansothatm2 isaslargeasitcanbe. Thefarthestwecangototheleftis0,sinceX isassumedtobe
nonnegative. Moreover, the two masss m1 and m2 must add up to 1, since they represent the area under the
distribution curve. Since the lever arms are in the ratio k − 1 to 1, a unit mass at kμ balances k − 1 units of
mass at 0. So the masses should be k−1 at 0 and 1 at kμ, which is exactly Markov’s bound with α = kμ. kk
Figure 1: Markov’s inequality interpreted as balancing a seesaw.
Example: Coin Tosses
Consider tossing a fair coin n times and let X denote the number of heads observed. What is the probability of observing more than 3 n heads? Let’s apply Markov inequality to obtain an upper bound on P[X ≥ 3 n].
Since X ∼ Binomial(n, 1 ), we have E[X] = 1 n, so 22
P[X ≥ 3n]≤ E[X] = 2, 43n3
4
which does not depend on n. Is this a good upper bound? The exact answer for P[X ≥ 3 n] is given by 4
44
n 􏰊n􏰋 1 P[X≥3n]= ∑ ,
4n k=􏰦3n􏰧 k 2
4
whichis≈5.5×10−2 forn=10and≈2.8×10−7 forn=100. Asthesenumericalvaluesshow,P[X ≥ 3n] 4
is a decreasing function of n, and so Markov’s inequality does not provide a very good upper bound for this example. In the next section, we will see how to obtain an improved bound.
EECS 70, Fall 2020, Note 17 3

2 Chebyshev’s Inequality
We have seen that, intuitively, the variance (or, more correctly the standard deviation) is a measure of “spread,” or deviation from the mean. We can now make this intuition quantitatively precise:
Theorem 17.3 (Chebyshev’s Inequality). For a random variable X with finite expectation E[X] = μ, P[|X−μ|≥c]≤ Var(X), (2)
c2
and for any positive constant c.
Proof. DefineY=(X−μ)2 andnotethatE[Y]=E􏰂(X−μ)2􏰃=Var(X). Also,noticethattheeventthat weareinterestedin,|X−μ|≥c,isexactlythesameastheeventY =(X−μ)2 ≥c2. Therefore,P[|X−μ|≥ c] = P[Y ≥ c2]. Moreover, Y is obviously nonnegative, so we can apply Markov’s inequality in Theorem 17.1 to get
P[|X−μ|≥c]=P[Y ≥c2]≤ E[Y] = Var(X). c2 c2
This completes the proof.
Here is a simpler proof using Generalized Markov’s inequality from Theorem 17.2:
AlternativeproofofTheorem17.3. DefineY=X−μ. Then,since|Y|2=|X−μ|2=(X−μ)2,wehave E􏰂|Y|2􏰃=E􏰂(X−μ)2􏰃=Var(X). Hence,(2)followsfromapplyingTheorem17.2toY=X−μwith r = 2.
Let’s pause to consider what Chebyshev’s inequality says. It tells us that the probability of any given deviation, c, from the mean, either above it or below it (note the absolute value sign), is at most Var(X).
c2 As expected, this deviation probability will be small if the variance is small. An immediate corollary of
Chebyshev’s inequality is the following:
Corollary 17.1. For any random variable X with finite expectation E[X] = μ and finite standard deviation σ = 􏰘Var(X),
P[|X−μ|≥kσ]≤ 1, k2
for any constant k > 0.
Proof. Plug c = kσ into Chebyshev’s inequality.
So, for example, we see that the probability of deviating from the mean by more than (say) two standard
deviations on either side is at most 1 . In this sense, the standard deviation is a good working definition of 4
the “width” or “spread” of a distribution.
In some special cases, it is possible to get tighter bounds on the probability of deviations from the mean. However, for general random variables that only have means and variances, Chebyshev’s inequality is es- sentially the only tool. Its power derives from the fact that it can be applied to any random variable with a mean and a variance.
EECS 70, Fall 2020, Note 17 4

Example: Coin Tosses Revisited
Let’s revisit the coin toss example considered above and apply Chebyshev’s inequality to obtain an upper
bound on P[X ≥ 3 n]. Recalling E[X ] = n , we obtain 42
P[X ≥ 3n]=P[X−n ≥ n]≤P[|X−n|≥ n]≤ Var(X). 4 24 24􏰀n􏰁2
4
which is much better than the constant bound of 2 given by Markov’s inequality. 3
3 Applications
In this section, we discuss applications of Chebyshev’s inequality to estimation problems.
3.1 Estimating the Bias of a Coin
Since X ∼ Binomial(n, 1), we have Var(X) = n1(1− 1) = n, so we obtain 2 224
P[X ≥ 3n]≤ Var(X) = 4, 4 􏰀n􏰁2 n
4
Let us go back to our motivating example of estimating the bias of a coin. Recall that we have a coin of unknown bias p, and our estimate of p is pˆ = 1 Sn where Sn is the number of Heads in n coin tosses.
n
As usual, we will find it helpful to write Sn = X1 +···+Xn, where Xi = 1 if the i-th coin toss comes up Heads
and Xi = 0 otherwise, and the random variables X1 , . . . , Xn are independent and identically distributed. Then E[Xi] = P[Xi = 1] = p, so by linearity of expectation,
􏰂1 􏰃 1n
E[pˆ]=E nSn =n∑E[Xi]=p.
i=1
What about the variance of pˆ? Note that since the Xi’s are independent, the variance of Sn = ∑ni=1 Xi is equal
to the sum of the variances:
Var(pˆ) = Var(nSn) = n2 Var(Sn) = n2 ∑Var(Xi) = n ,
1 1 1n σ2 i=1
where we have written σ2 for the variance of each of the Xi.
So we see that the variance of pˆ decreases linearly with n. This fact ensures that, as we take larger and larger
sample sizes n, the probability that we deviate much from the expectation p gets smaller and smaller.
Let’s now use Chebyshev’s inequality to figure out how large n has to be to ensure a specified accuracy in our estimate of the true bias p. As we discussed in the beginning of this note, a natural way to measure this is for us to specify two parameters, ε and δ , both in the range (0, 1). The parameter ε controls the error we are prepared to tolerate in our estimate, and δ controls the confidence we want to have in our estimate.
Applying Chebyshev’s inequality, we have
P [ | pˆ − p | ≥ ε ] ≤ ε 2
σ 2 = n ε 2 .
Var( pˆ)
EECS 70, Fall 2020, Note 17
5

To make the right hand side less than the desired value δ , we need to set σ2
n ≥ ε2δ . (3) Now recall that σ2 = Var(Xi) is the variance of a single sample Xi. So, since Xi is an indicator random
variable with P[X = 1] = p, we have σ 2 = p(1 − p), and inequality (3) becomes n ≥ p(1− p) . Since p(1 − p) i ε2δ
is maximized1 when p = 1/2, we can conclude that it is sufficient to choose n such that: n≥maxp(1−p)= 1 , (4)
p ε2δ 4ε2δ
For example, plugging in ε = 0.1 and δ = 0.05, we see that a sample size of n = 500 is sufficient to get an
as we claimed earlier.
estimate pˆ that is accurate to within an error of 0.1 with probability at least 95%.
As a concrete example, consider the problem of estimating the proportion p of Democrats in the US popu- lation, by taking a small random sample. We can model this as the problem of estimating the bias of a coin above, where each coin toss corresponds to a person that we select randomly from the entire population. And the coin tosses are independent.2 Our calculation above shows that to get an estimate pˆ that is accurate to within an error of 0.1 with probability at least 95%, it suffices to sample n = 500 people. In particular, notice that the size of the sample is independent of the total size of the population! This is how polls can accurately estimate quantities of interest for a population of several hundred million while sampling only a very small number of people.
3.2 Estimating a General Expectation
What if we wanted to estimate something a little more complex than the bias of a coin? For example, suppose we want to estimate the average wealth of people in the US. We can model this as the problem of estimating the expected value of an unknown probability distribution. Then we can use exactly the same scheme as before, except that now we sample the random variables X1,X2,…,Xn independently from our unknown distribution. Clearly E[Xi] = μ, the expected value that we are trying to estimate. Our estimate of μ will be μˆ = 1 ∑n Xi, for a suitably chosen sample size n.
n i=1
Following the same calculation as before, we have E[μˆ ] = μ and Var(μˆ ) = σ2 , where σ2 = Var(Xi) is the
n
variance of the Xi. (Recall that the only facts we used about the Xi was that they were independent and had
the same distribution — actually the same expectation and variance would be enough: why?) This time, however, since we do not have any a priori bound on the mean μ, it makes more sense to let ε be the relative error, i.e., we wish to find an estimate μˆ that is within an additive error of εμ:
P [ | μˆ − μ | ≥ ε μ ] ≤ δ .
Using equation (3), but substituting εμ in place of ε, it is enough for the sample size n to satisfy
σ2 1
n ≥ μ2 × ε2δ . (5)
Here ε and 1 − δ are the desired relative error and confidence level respectively. Now of course we don’t know the other two quantities, μ and σ2, appearing in equation (5). In practice, we would use a lower bound
1Use calculus to prove this.
2We are assuming here that the sampling is done “with replacement”; i.e., we select each person in the sample from the entire population, including those we have already picked. So there is a small chance that we will pick the same person twice.
EECS 70, Fall 2020, Note 17 6

on μ and an upper bound on σ 2 (just as we used an upper bound on p(1 − p) in the coin tossing problem). Plugging these bounds into equation (5) will ensure that our sample size is large enough.
For example, in the average wealth problem we could probably safely take μ to be at least (say) $20,000 (probably more). However, the existence of very wealthy people such as Bill Gates means that we would need to take a very high value for the variance σ2. Indeed, if there is at least one individual with wealth $50 billion in a population of size 325 million, then assuming a relatively small value of μ means that the
variance must be at least about (50×109 )2 ≈ 7.7 × 1012 . There is really no way around this problem with 325×106
simple uniform sampling: the uneven distribution of wealth means that the variance is inherently very large, and we will need a huge number of samples before we are likely to find anybody who is immensely wealthy. But if we don’t include such people in our sample, then our estimate will be way too low.
4 The Law of Large Numbers
The estimation method we used in the previous sections is based on a principle that we accept as part of everyday life: namely, the Law of Large Numbers (LLN). This asserts that, if we observe some random variable many times, and take the average of the observations, then this average will converge to a single value, which is of course the expectation of the random variable. In other words, averaging tends to smooth out any large fluctuations, and the more averaging we do the better the smoothing.
Theorem17.4(LawofLargeNumbers). LetX1,X2,…,beasequenceofi.i.d.(independentandidentically distributed) random variables with common finite expectation E[Xi] = μ for all i. Then, their partial sums Sn = X1 +X2 +···+Xn satisfy
P 􏰂| 1 Sn − μ | < ε 􏰃 → 1 as n → ∞, n for every ε > 0, however small.
Proof. Let Var(Xi) = σ2 be the common variance of the r.v.’s; we assume that σ2 is finite.3 With this
(relatively mild) assumption, the LLN is an immediate consequence of Theorem 17.3. Since X1,X2,… are
i.i.d. random variables with E[Xi] = μ and Var(Xi) = σ2, we have E􏰂1Sn􏰃 = μ and Var(1Sn) = σ2 , so by nnn
Chebyshev’s inequality we have
􏰂 􏰃 Var(1Sn) σ2
P |1Sn−μ|≥ε ≤ n = →0 asn→∞.
n ε2nε2 Hence, P 􏰂| 1 Sn − μ | < ε 􏰃 = 1 − P 􏰂| 1 Sn − μ | ≥ ε 􏰃 → 1 as n → ∞. nn Notice that the LLN says that the probability of any deviation ε from the mean, however small, tends to zero as the number of observations n in our average tends to infinity. Thus, by taking n large enough, we can make the probability of any given deviation as small as we like. 3If σ2 is not finite, the LLN still holds but the proof requires more advanced mathematical tools. EECS 70, Fall 2020, Note 17 7