Coin tosses
Coin tosses
Daniel Hsu (COMS 4771)
Binary predictions
A coin is tossed, and your goal is to predict the outcome (which is either “heads” or “tails”). How should you
predict?
If you know the initial conditions of the coin toss, as well as a lot of physics, then (in a non-quantum universe)
you can put this knowledge to use to determine exactly how the coin will land. But suppose you don’t have
all of this knowledge.
If the coin is “fair”, then intuitively it doesn’t matter how we predict. But if the coin is “biased”, then
predicting one way may be better than the other.
We’ll use a statistical model of the problem to motivate a prediction strategy, as well as to evaluate the
quality of various strategies. In this model, the outcome of the coin toss is random; it is “heads” with some
probability, say, p; and it is “tails” with the remaining probability 1 − p. We’ll encode “heads” by 1, and
“tails” by 0, so the outcome is a random variable Y . The number p is a parameter of the model; the possible
values it can taken on, namely the interval [0, 1], is the parameter space. This particular model is the family
of Bernoulli distributions {Bern(p) : p ∈ [0, 1]}; we say that the distribution of Y is Bern(p) by writing
Y ∼ Bern(p).
If you know the parameter p, then how should you predict? Here is one strategy:
• If p > 1/2, then predict “heads”.
• If p < 1/2, then predict “tails”.
• If p = 1/2, doesn’t matter. But, for concreteness, predict “tails”.
Using this strategy, what is the probability that you predict incorrectly? A simple calculation shows that it
is min{p, 1− p}.
Can any strategy have a smaller probability of predicting incorrectly? No. For example, if p > 1/2 and you
predict “tails”, then your probability of predicting incorrectly is at least p, which is more than 1− p.
Of course, this all assumes you know the parameter p exactly. In the next section, we’ll discuss what can be
done when you don’t know p.
The plug-in principle and the iid model
In many prediction problems where a statistical model may be used, such as in the problem described above,
the parameters of the model are generally not exactly known. In the case above, the parameter p is needed
to determine the optimal prediction. Without knowledge of p, we need another way to derive a prediction.
If the outcomes of n previous tosses of the given coin are available, and the goal is to make a prediction as in
the preceding problems for an (n+ 1)-th toss, then we may again use a statistical model to derive a good
prediction. In this model, the outcomes Y1, . . . , Yn, Y of the n + 1 tosses are independent and identically
1
distributed (iid); the first n outcomes Y1, . . . , Yn are observed, and the (n+ 1)-th Y is the outcome to predict.
We write this as
Y1, . . . , Yn, Y ∼iid P,
where P is the (unknown) distribution of Y . We think of (Yi)ni=1 as data that can be used to derive a
prediction of Y . The optimal prediction for the outcome Y is given as some formula depending on unknown
aspects of the distribution of Y . The plug-in principle prescribes the following steps to form a prediction in
this iid model:
1. Estimate the unknowns based on the observed outcomes.
2. Plug-in these estimates into the formula for the optimal prediction.
The iid model is ubiquitous in machine learning, as it provides a very simple connection between the observed
data and the target of prediction.
Using maximum likelihood estimation
When the statistical model for our problem is a parametric model, there is a well-weathered method for
deriving estimators based on the maximum likelihood principle. Recall that a parametric model is a family of
probability distributions {Pθ : θ ∈ Θ} for a random variable Z, where the family is indexed by a set Θ. The
set Θ is called the parameter space, and θ ∈ Θ is the parameter of the distribution Pθ. In many cases, the
random variable Z may actually be a vector of several random variables; in such cases, we say Z is a random
vector. Similarly, in many cases, each θ ∈ Θ is actually a vector of multiple parameters, so we call each such
θ a parameter vector. The likelihood L(θ) of a parameter vector θ ∈ Θ given an observation Z = z is the
probability of Z = z under the distribution Pθ. The maximum likelihood estimator (MLE) is
θ̂ := arg max
θ∈Θ
L(θ),
i.e., the parameter vector with the highest likelihood.1 Note that by the strict monotonicity of the logarithm
function, the MLE is also the maximizer of the log-likelihood function lnL.
We return to the problem of predicting of the outcome of a coin toss, where the outcome must be either 1
(“heads”) or 0 (“tails”). There, we have
Y1, . . . , Yn, Y ∼iid Bern(p)
for some unknown parameter p ∈ [0, 1]. The MLE for p given (Y1, . . . , Yn) = (y1, . . . , yn) is the maximizer of
the log-likelihood
lnL(p) = ln
n∏
i=1
pyi(1− p)1−yi
=
n∑
i=1
yi ln p+ (1− yi) ln(1− p).
We analytically determine the maximizer using calculus. First, we find the critical points of lnL (i.e., zeros
of the derivative of lnL). The derivative of lnL is
1
p
n∑
i=1
yi −
1
1− p
n∑
i=1
(1− yi),
and it is equal to zero exactly when
p =
1
n
n∑
i=1
yi.
1In general, there may be multiple distinct θ with the same highest likelihood, i.e., the MLE is not unique. And in other
cases, there may not be any single θ with highest likelihood, i.e., the MLE does not exist.
2
Next, we determine whether the critical point is a maximizer, a minimizer, or a saddle point. The second-
derivative of lnL is
−
1
p2
n∑
i=1
yi −
1
(1− p)2
n∑
i=1
(1− yi),
which is always non-positive (for 0 < p < 1).2 Hence the critical point is a maximizer. Thus, the MLE for p given (Y1, . . . , Yn) = (y1, . . . , yn) is p̂ = p̂(y1, . . . , yn) := 1 n n∑ i=1 yi. This estimator is used via the plug-in principle to derive the prediction strategy ŷ(y1, . . . , yn) := 1{p̂(y1, . . . , yn) > 1/2}.
Probability of an error
In the iid model, the probability that Ŷ := ŷ(Y1, . . . , Yn) does not correctly predict Y can be expressed as
P(Ŷ 6= Y ) = P(ŷ(Y1, . . . , Yn) 6= Y )
= P(Y1 + · · ·+ Yn > n/2) · P(Y = 0) + P(Y1 + · · ·+ Yn ≤ n/2) · P(Y = 1).
Suppose Y ∼ Bern(p) for some p > 1/2. Using a tail bound for sums of iid Bernoulli random variables (given
in the next section), this probability can be bounded as
P(Ŷ 6= Y ) = (1− p) + (2p− 1) · P(Y1 + · · ·+ Yn ≤ n/2)
≤ (1− p) + (2p− 1) · e−n·RE(1/2,p),
where
RE(a, b) := a ln
a
b
+ (1− a) ln
1− a
1− b
is the relative entropy between Bern(a) and Bern(b). If instead Y ∼ Bern(p) for some p ≤ 1/2, the probability
of a prediction error is
P(Ŷ 6= Y ) = p+ (1− 2p) · P(Y1 + · · ·+ Yn > n/2)
≤ p+ (1− 2p) · e−n·RE(1/2,p).
Hence, in either case,
P(Ŷ 6= Y ) ≤ min{p, 1− p}+ |2p− 1| · e−n·RE(1/2,p).
Recall that the optimal prediction predicts incorrectly with probability min{p, 1− p}. The relative entropy
is always non-negative, and RE(a, b) = 0 if and only if a = b. Therefore, the probability from the above
displayed equation exceeds this by a quantity that goes to zero with n exponentially fast.
Probability tail bounds
Theorem (Tail bounds for sums of iid Bernoulli random variables). Let X1, . . . , Xn ∼iid Bern(p), and let
S := X1 + · · ·+Xn. For any 0 ≤ ` ≤ p ≤ u ≤ 1,
P(S ≤ n · `) ≤ e−n·RE(`,p),
P(S ≥ n · u) ≤ e−n·RE(u,p).
2The cases p = 0 and p = 1 need to be treated differently.
3
Proof. We just show the first inequality; the second one is proved similarly. We can also assume that
0 < ` < p < 1, since the other cases are trivial.
Let fp denote the probability mass function (pmf) for (X1, . . . , Xn). Let E ⊆ {0, 1}n be the set of outcomes
x = (x1, . . . , xn) where
∑n
i=1 xi ≤ n · `. Then P(S ≤ n · `) =
∑
x∈E fp(x). Our goal is to bound this latter
sum by e−n·RE(`,p).
Let f` denote the pmf for (X ′1, . . . , X ′n) where X ′1, . . . , X ′n ∼iid Bern(`). The proof proceeds by comparing
fp to f` at every x ∈ E. Indeed, fix any x ∈ E, and let kx be the number of i’s such that xi = 1. Since
x ∈ E, we have kx ≤ n · `. Observe that because p/` > 1, we have (p/`)kx ≤ (p/`)n·`; similarly, because
(1− p)/(1− `) < 1, we have ((1− p)/(1− `))n−kx ≤ ((1− p)/(1− `))n·(1−`). Therefore
fp(x)
f`(x)
=
pkx(1− p)n−kx
`kx(1− `)n−kx
=
(p
`
)kx (1− p
1− `
)n−kx
≤
(p
`
)n·`(1− p
1− `
)n·(1−`)
.
Because the above inequality holds for every x ∈ E and
∑
x∈E f`(x) ≤ 1,
∑
x∈E
fp(x) ≤
∑
x∈E
f`(x)
(p
`
)n·`(1− p
1− `
)n·(1−`)
≤
(p
`
)n·`(1− p
1− `
)n·(1−`)
= e−n·RE(`,p).
The distribution of the random variable S from the tail bound above is called the binomial distribution with
n trials and success probability p, written S ∼ Bin(n, p). The expected value of S is n · p, and the tail bound
says that that chance that S deviates from this number by more than a constant factor of n is exponentially
small in n.
Although it is extremely unlikely for S ∼ Bin(n, p) to deviate from its mean by magnitudes proportional to
n, one can expect deviations of a smaller size. The variance of S gives a bound on the expected magnitude of
the deviation:
E[|S − E(S)|] ≤
√
E[(S − E(S))2] =
√
var(S) =
√
n · p(1− p).
The inequality in the first step follows from Jensen’s inequality.
4
Binary predictions
The plug-in principle and the iid model
Using maximum likelihood estimation
Probability of an error
Probability tail bounds