程序代写代做代考 Binary predictions

Binary predictions
Coin tosses
Daniel Hsu (COMS 4771)
A coin is tossed, and your goal is to predict the outcome (which is either “heads” or “tails”). How should you predict?
If you know the initial conditions of the coin toss, as well as a lot of physics, then (in a non-quantum universe) you can put this knowledge to use to determine exactly how the coin will land. But suppose you don’t have all of this knowledge.
If the coin is “fair”, then intuitively it doesn’t matter how we predict. But if the coin is “biased”, then predicting one way may be better than the other.
We’ll use a statistical model of the problem to motivate a prediction strategy, as well as to evaluate the quality of various strategies. In this model, the outcome of the coin toss is random; it is “heads” with some probability, say, p; and it is “tails” with the remaining probability 1 − p. We’ll encode “heads” by 1, and “tails” by 0, so the outcome is a random variable Y . The number p is a parameter of the model; the possible values it can taken on, namely the interval [0, 1], is the parameter space. This particular model is the family of Bernoulli distributions {Bern(p) : p ∈ [0,1]}; we say that the distribution of Y is Bern(p) by writing
Y ∼ Bern(p).
If you know the parameter p, then how should you predict? Here is one strategy:
• If p > 1/2, then predict “heads”.
• If p < 1/2, then predict “tails”. • If p = 1/2, doesn’t matter. But, for concreteness, predict “tails”. Using this strategy, what is the probability that you predict incorrectly? A simple calculation shows that it is min{p, 1 − p}. Can any strategy have a smaller probability of predicting incorrectly? No. For example, if p > 1/2 and you predict “tails”, then your probability of predicting incorrectly is at least p, which is more than 1 − p.
Of course, this all assumes you know the parameter p exactly. In the next section, we’ll discuss what can be done when you don’t know p.
The plug-in principle and the iid model
In many prediction problems where a statistical model may be used, such as in the problem described above, the parameters of the model are generally not exactly known. In the case above, the parameter p is needed to determine the optimal prediction. Without knowledge of p, we need another way to derive a prediction.
If the outcomes of n previous tosses of the given coin are available, and the goal is to make a prediction as in the preceding problems for an (n + 1)-th toss, then we may again use a statistical model to derive a good prediction. In this model, the outcomes Y1, . . . , Yn, Y of the n + 1 tosses are independent and identically
1

distributed (iid ); the first n outcomes Y1 , . . . , Yn are observed, and the (n + 1)-th Y is the outcome to predict. We write this as
Y1,…,Yn,Y ∼iid P,
where P is the (unknown) distribution of Y . We think of (Yi)ni=1 as data that can be used to derive a prediction of Y . The optimal prediction for the outcome Y is given as some formula depending on unknown aspects of the distribution of Y . The plug-in principle prescribes the following steps to form a prediction in this iid model:
1. Estimate the unknowns based on the observed outcomes.
2. Plug-in these estimates into the formula for the optimal prediction.
The iid model is ubiquitous in machine learning, as it provides a very simple connection between the observed data and the target of prediction.
Using maximum likelihood estimation
When the statistical model for our problem is a parametric model, there is a well-weathered method for deriving estimators based on the maximum likelihood principle. Recall that a parametric model is a family of probability distributions {Pθ : θ ∈ Θ} for a random variable Z, where the family is indexed by a set Θ. The set Θ is called the parameter space, and θ ∈ Θ is the parameter of the distribution Pθ. In many cases, the random variable Z may actually be a vector of several random variables; in such cases, we say Z is a random vector. Similarly, in many cases, each θ ∈ Θ is actually a vector of multiple parameters, so we call each such θ a parameter vector. The likelihood L(θ) of a parameter vector θ ∈ Θ given an observation Z = z is the probability of Z = z under the distribution Pθ. The maximum likelihood estimator (MLE) is
θˆ := arg max L(θ), θ∈Θ
i.e., the parameter vector with the highest likelihood.1 Note that by the strict monotonicity of the logarithm function, the MLE is also the maximizer of the log-likelihood function ln L.
We return to the problem of predicting of the outcome of a coin toss, where the outcome must be either 1 (“heads”) or 0 (“tails”). There, we have
Y1,…,Yn,Y ∼iid Bern(p)
for some unknown parameter p ∈ [0,1]. The MLE for p given (Y1,…,Yn) = (y1,…,yn) is the maximizer of
the log-likelihood
n
lnL(p) = ln􏰋pyi(1−p)1−yi
i=1 n
= 􏰊 yi ln p + (1 − yi) ln(1 − p). i=1
We analytically determine the maximizer using calculus. First, we find the critical points of ln L (i.e., zeros of the derivative of ln L). The derivative of ln L is
and it is equal to zero exactly when
1􏰊n 1􏰊n
(1 − yi),
p
i=1
yi − 1 − p 1 􏰊n
p=n yi. i=1
i=1
1In general, there may be multiple distinct θ with the same highest likelihood, i.e., the MLE is not unique. And in other cases, there may not be any single θ with highest likelihood, i.e., the MLE does not exist.
2

Next, we determine whether the critical point is a maximizer, a minimizer, or a saddle point. The second-
derivative of ln L is
1􏰊n 1􏰊n
−p2 yi − (1−p)2 (1−yi),
i=1 i=1
which is always non-positive (for 0 < p < 1).2 Hence the critical point is a maximizer. Thus, the MLE for p given (Y1,...,Yn) = (y1,...,yn) is 1 􏰊n pˆ=pˆ(y1,...,yn):= n This estimator is used via the plug-in principle to derive the prediction strategy yˆ(y1,...,yn) := 1{pˆ(y1,...,yn) > 1/2}. Probability of an error
In the iid model, the probability that Yˆ := yˆ(Y1, . . . , Yn) does not correctly predict Y can be expressed as P(Yˆ ̸=Y)=P(yˆ(Y1,…,Yn)̸=Y)
=P(Y1 +···+Yn >n/2)·P(Y =0)+P(Y1 +···+Yn ≤n/2)·P(Y =1).
Suppose Y ∼ Bern(p) for some p > 1/2. Using a tail bound for sums of iid Bernoulli random variables (given
in the next section), this probability can be bounded as
P(Yˆ ̸=Y)=(1−p)+(2p−1)·P(Y1 +···+Yn ≤n/2)
≤ (1 − p) + (2p − 1) · e−n·RE(1/2,p), RE(a, b) := a ln a + (1 − a) ln 1 − a
yi.
where
is the relative entropy between Bern(a) and Bern(b). If instead Y ∼ Bern(p) for some p ≤ 1/2, the probability
of a prediction error is
Hence, in either case,
b 1−b
P(Yˆ ̸=Y)=p+(1−2p)·P(Y1 +···+Yn >n/2)
≤ p + (1 − 2p) · e−n·RE(1/2,p).
P(Yˆ ̸=Y)≤min{p,1−p}+|2p−1|·e−n·RE(1/2,p).
Recall that the optimal prediction predicts incorrectly with probability min{p, 1 − p}. The relative entropy is always non-negative, and RE(a, b) = 0 if and only if a = b. Therefore, the probability from the above displayed equation exceeds this by a quantity that goes to zero with n exponentially fast.
Probability tail bounds
Theorem (Tail bounds for sums of iid Bernoulli random variables). Let X1, . . . , Xn ∼iid Bern(p), and let S := X1 + · · · + Xn. For any 0 ≤ l ≤ p ≤ u ≤ 1,
P(S ≤ n · l) ≤ e−n·RE(l,p), P(S ≥ n · u) ≤ e−n·RE(u,p).
2The cases p = 0 and p = 1 need to be treated differently.
i=1
3

Proof. We just show the first inequality; the second one is proved similarly. We can also assume that 0 < l < p < 1, since the other cases are trivial. Let fp denote the probability mass function (pmf) for (X1, . . . , Xn). Let E ⊆ {0, 1}n be the set of outcomes x = (x1,...,xn) where 􏰉ni=1 xi ≤ n · l. Then P(S ≤ n · l) = 􏰉x∈E fp(x). Our goal is to bound this latter sum by e−n·RE(l,p). Let fl denote the pmf for (X1′ , . . . , Xn′ ) where X1′ , . . . , Xn′ ∼iid Bern(l). The proof proceeds by comparing fp tofl ateveryx∈E. Indeed,fixanyx∈E,andletkx bethenumberofi’ssuchthatxi =1. Since x ∈ E, we have kx ≤ n · l. Observe that because p/l > 1, we have (p/l)kx ≤ (p/l)n·l; similarly, because
(1 − p)/(1 − l) < 1, we have ((1 − p)/(1 − l))n−kx ≤ ((1 − p)/(1 − l))n·(1−l). Therefore fp(x) pkx(1−p)n−kx 􏰁p􏰂kx 􏰃1−p􏰄n−kx 􏰁p􏰂n·l􏰃1−p􏰄n·(1−l) fl(x) = lkx(1−l)n−kx = l 1−l ≤ l 1−l . Because the above inequality holds for every x ∈ E and 􏰉x∈E fl(x) ≤ 1, 􏰊 x∈E fp(x) ≤ 􏰊 􏰁p􏰂n·l 􏰃1 − p􏰄n·(1−l) 􏰁p􏰂n·l 􏰃1 − p􏰄n·(1−l) fl(x) l 1 − l ≤ l 1 − l = e−n·RE(l,p). x∈E The distribution of the random variable S from the tail bound above is called the binomial distribution with n trials and success probability p, written S ∼ Bin(n, p). The expected value of S is n · p, and the tail bound says that that chance that S deviates from this number by more than a constant factor of n is exponentially small in n. Although it is extremely unlikely for S ∼ Bin(n, p) to deviate from its mean by magnitudes proportional to n, one can expect deviations of a smaller size. The variance of S gives a bound on the expected magnitude of the deviation: E[|S − E(S)|] ≤ 􏰌E[(S − E(S))2] = 􏰌var(S) = 􏰌n · p(1 − p). The inequality in the first step follows from Jensen’s inequality. 4