Week 5: Probability Goals this week
We are taking a turn this week towards probability and inference. You will likely find this week and the next (and the problem set) more difficult than before. But learning about probability is essential for understanding how the majority of data science and machine learning techniques really work. Even if you don’t go on to fit probabilistic models directly on your own real-life data, I guarantee that the techniques you do use will be rooted in probability calculus (from clustering to regression to neural networks to dimensionality reduction). By learning the underlying fundamentals, you will (1) better remember a method’s function and purpose, and, critically, (2) understand a method’s underlying assumptions and whether those assumptions are met in your data analysis.
The probability of boys and girls
In the 1770’s, the French mathematician started working on big data. He became interested in the curious biological observation that the ratio of boys to girls at birth isn’t exactly 50:50. Village records showed an apparent bias towards more boy births, but the numbers were small and vulnerable to statistical fluctuation. Laplace set out to use some new ideas about probability theory to put the question of sex ratio at birth to a mathematically rigorous test, and he needed big data sets to do it.
Copyright By PowCoder代写 加微信 powcoder
Laplace turned to the extensive census records in London and Paris. The Paris census for the years 1745-1770 recorded the birth of 251527 boys and 241945 girls, a 51:49 (1.04) ratio. The London census for 1664-1757 recorded the birth of 737629 boys and 698958 girls, again a 51:49 ratio.
Laplace’s work is one of the origins of probability theory. Today, his laborious manual calculations are easy to reproduce in a few lines of Python, and his problem makes a compact example for us to illustrate some key ideas of probabilistic inference
The binomial distribution
Let’s call the probability of having a boy . The probability of having a girl is . The probability of having boys in total births is given by a binomial distribution:
A couple of things to explain, in case you haven’t seen them before:
is “the probability of , given and ”: a conditional probability. The vertical line means “given”. That is: if I told you and , what’s the probability of observing data ?
is conventional shorthand for the binomial coefficient: .
Suppose , and in the Paris data (251527 boys + 241945 girls). The probability of getting 251527 boys is:
N p b ) N ,p ∣ b( P
5491425.07251525.0!549142!725152 =)N,p∣b(P !274394
!)b−N(!b )b( !N N
b−N)p − 1(bp) b ( = )N,p ∣ b(P N
274394= N 5.0=p
Your calculator isn’t likely to be able to deal with that, but Python can. For example, you can use the pmf (probability mass function) of scipy.stats.binom :
which gives .
Probabilities sum to one, so . You can verify this in Python easily, because there are only possible values for , from to .
Maximum likelihood estimate of p
Laplace’s goal isn’t to calculate the probability of the observed data, it’s to infer what is, given the observed census data. One way to approach this is ask, what is the best that explains the data – what is the value of that maximizes
It’s easily shown (by taking the derivative of with respect to ), that this optimal is just the frequency of
import scipy.stats as stats
b = 251527
N = 493472
Prob = stats.binom.pmf(b, N, p)
print(Prob)
. The hat on denotes an estimated parameter that’s been fitted to data.
probability of ; is not . When I deal you a five card poker hand, it’s laughably unlikely that I would have dealt you exactly those five cards if I were dealing fairly, but that doesn’t mean you should reach for your revolver.
boys, simply because there’s many other that could have happened. The probability of the data is not the
It seems like there ought to be some relationship, though. Our optimal explanation of the observed data than is.
is called the likelihood of , signifying our intuition that measure of how well a given explains our observed data . We call
does seem like a much better seems like it should be a relative
. So even with the best , it’s improbable that we would have observed
the maximum likelihood estimate of . The London and Paris data have different maximum likelihood values of : 0.5135 versus 0.5097. Besides asking
whether the birth sex ratio is 50:50, we might even ask, is the ratio the same in London as it is in Paris?
The probability of p
Laplace made an intuitive leap. He reasoned that one value is more probable than another by the ratio of these probabilities:
if we had no other reason to favor one value for over the other (we’ll see later that Laplace implicitly assumed a uniform “prior” for ). So is -fold more probable than , given the Paris data. This relationship also implies that we can obtain a probability distribution for by normalizing over the sum over all possible in the likelihood, which requires an integral, not just a simple sum, since is continuous:
1 = ) N ,p ∣ b( P 0=b∑ N
) N ,p ∣ b( P
01∼ 44−e5.4 04 100.0
)N,2p ∣ b(P = )N,b ∣ 2p(P )N,1p ∣ b(P )N,b ∣ 1p(P
)N,p ∣ b(P 15.0 = ^p
)N,p ∣ b(P
)N,p ∣ b(P p
100.0 = )N,^p ∣ b(P 15.0 = ^p
)N,b ∣ p(P bb
15.0 = 274394 = ^p 725152
44−01 ⋅ 5.4
)N,p ∣ b(P
This is a probability density, because is continuous. Strictly speaking, the probability of any specific value of is zero, because there are an infinite number of values of , and has to be 1.
It’s possible (and indeed, it frequently happens) that a probability density function like can be greater than 1.0 over a small range of , so don’t be confused if you see that; it’s the integral that counts. For
example, you can see that there are large values of in the figure above, where I’ve plotted Laplace’s probability densities for the Paris and London data.
In the figure, it seems clear that is not supported by either the Paris or London data. We also see that the uncertainty around does not overlap with the uncertainty around . It appears that the birth sex ratio in Paris and London is different.
We’re just eyeballing though, when we say that it seems that the two distributions for don’t overlap 0.5, nor do they overlap each other. Can we be more quantitative?
The cumulative probability of p
Because the probability at any given continuous value of is actually zero, it’s hard to frame a question like “is
?”. Instead, Laplace now framed a question with a probability he could calculate: what is the probability that p
<= 0.5? If that probability is tiny, then we have strong evidence that .
A cumulative probability distribution is the probability that a variable takes on a value less than or equal to . For a continuous real-valued variable with a probability density function , . For a continuous probability constrained to the range , and .
So Laplace framed his question as:
Then Laplace spent a bazillion pages working out those integrals by hand, obtaining an estimated log probability of -42.0615089 (i.e., a probability of ): decisive evidence that the probability of having a boy must be greater than 0.5.
These days we can replace Laplace’s virtuosic calculations and approximations with one call to Python:
1=pd)N,b∣p(P 0∫ 1
)N,b ∣ p(P
34−01 ⋅ 7.8 = ) N ,b ∣ 5.0 ≤ p( P
pd)N,p ∣ b(P 0∫ 1
pd)N,p ∣ b(P
)x(P ∞−∫ = )x(F )x(P x
)p(P 0∫=)p(F 1..0 p
pd ) N ,b ∣ p( P
pd)N,b∣p(P 0∫ 1
pd)N,p ∣ b(P 0∫
)N,p ∣ b(P
1 = ) N ,b ∣ p( P
which gives . Laplace got it pretty close!
Beta integrals
I probably won’t have time to actually cover this on the board in lecture, but I weant to quickly explain what this scipy.special.betainc function is. It’s something called a Beta integral.
The complete Beta integral is:
A gamma function is a generalization of the factorial from integers to real numbers. For integer ,
and conversely
The incomplete Beta integral is:
and has no clean analytical expression, but statistics packages typically give you a computational method for calculating it – hence the SciPy scipy.special.betainc function.
If you wrote out Laplace’s problem in terms of binomial probability distributions for , you’d see the binomial coefficient cancel out (it’s a constant with respect to ), leaving:
Alas, reading documentation in Python is usually essential, and it turns out that the scipy.special.betainc function doesn’t just calculate the incomplete beta function; it sneakily calculates a regularized incomplete beta integral by default, which is why all we needed to do was call:
Laplace treated the unknown parameter like it was something he could infer, and express an uncertain probability distribution over it. He obtained that distribution by inverse probability: by using , the probability of the data if the parameter were known and given, to calculate , the probability of the unknown parameter given the data.
answer = special.betainc(b+1, N-b+1, p)
import scipy.special as special
b = 251527
N = 493472
answer = special.betainc(b+1, N-b+1, p)
print (answer)
)b,a(B )b,a;x(B
)1+b−N,1+b(B pd)N,p∣b(P 0∫ = 1
= ) N ,b ∣ 5.0 ≤ p( P
)1+b−N,1+b;5.0(B pd)N,p∣b(P 0∫ 5.0
)N,p ∣ b(P
0 pd1−b)p−1(1−ap ∫=)b,a;x(B
)1 + a(Γ = !a
!)1 − a( = )a(Γ
)b+a(Γ =pd1−b)p−1(1−ap 0∫=)b,a(B )b(Γ)a(Γ 1
)b ,a ;x(B
24−01 ⋅ 1.1
Laplace’s reasoning was clear, and proved to be influential. Soon he realized that the Reverend Thomas Bayes, in England, had derived a very similar approach to inverse probability just a few years earlier. We’ll learn about Bayes’ 1763 paper in a bit. But for now, let’s leave Laplace and Bayes, and lay out some basic terminology of probabilities and probabilistic inference.
A minicourse in probability calculus 1. Random variables
A random variable is something that can take on a value. The value might be discrete (like “boy” or “girl”, or a roll of a die 1..6) or it might be real-valued (like a real number drawn from a Gaussian distribution). We’ll denote random variables or events with capital letters, like . We’ll denote values or outcomes with small letters, like .
When we say (the probability of random variable ), we are envisioning a set of values , the probability that we could get each possible outcome .
Probabilities sum to one. If has discrete outcomes .
, . If has continuous outcomes ,
For example, suppose we have a fair die, and a loaded die. With the fair die, the probability of each outcome 1..6 is . With the loaded die, let’s suppose that the probability of rolling a six is , and the probability of rolling anything else (1..5)
is 0.1. We have a bag with fair dice and loaded dice in it. We pick a die out of the bag randomly and roll it. What’s the probability of rolling 1, 2, 3, 4, 5 or 6? We have two random variables in this example: let’s call the outcome of whether we chose a fair or a loaded die, and the outcome of our roll. takes on values or (fair or loaded); takes on values 1..6.
2. Conditional probability
is a conditional probability distribution: the probability that X takes on some value, given a value of .
To put numbers to a discrete conditional probability distribution , envision a table with a row for each variable , and a column for each variable . Each row sums to one: .
In our example, I told you : the probability of rolling the possible outcomes 1..6, when you know whether the die is fair or loaded.
roll 123456
D = loaded:
3. Joint probability
is a joint probability: the probability that X takes on some value and Y takes on some value.
Again, envision a table with a row (or column) for each variable , and a column (or row) for each variable – but now the whole table sums to one, .
For instance, we might want to know the probability that we chose a loaded die and we rolled a six. You don’t know the joint distribution yet in our example, because I haven’t given you enough information.
4. Relationship between conditional and joint probability
) Y ∣ X( P
1 = ) Y ∣ X( P X∑ ) Y ∣ X( P
)D ∣ R(P X
X 1=)x=X(Px∑x x
1 = )x = X(P ∞−∫ ∞
)x = X(P X
1=)Y,X(P YX∑
2 01 01 01 01 01 111111
666666 111111
The joint probability that and both happen is the probability that happens, then happens given :
Also, conversely, because we’re not talking about causality (with a direction), only about statistical dependency:
So for our example, let’s suppose that the probability of choosing a fair die from the bag is , and the probability of choosing a loaded one is . That’s . Now we can calculate the joint probability distribution as
roll 123456
D = loaded:
If you have additional random variables in play, they stay where they are on the left or right side of the . For example, if the joint probability of was conditional on :
Or, if I start from the joint probability :
5. Marginalization
If we have a joint distribution like , we can “get rid” of one of the variables or by summing it out:
It’s called marginalization because imagine a 2-D table with rows for Y’s values and columns for X’s values. Each entry in the table is for two specific values . If you sum across the columns to the right margin, your row sums give
you . If you sum down the rows to the bottom margin of the table, your column sums give you
. When we .
obtain a distribution by marginalizing , we say In our example, we can marginalize our joint probability matrix:
roll 123456
is the marginal distribution of
D = loaded:
Now we have the marginal distribution
you don’t know what kind of die you pulled out of the bag. You’ve marginalized over your uncertainty of an unknown variable . Because sometimes you pull a loaded die out of the bag, the probability that you’re going to roll a six is
. This is the probability that you’re going to observe a specific roll of 1..6, if
)Y(P)Y ∣ Z,X(P = )X(P)X ∣ Z,Y(P = )Z,Y,X(P )Z,Y,X(P
)Z∣ Y(P)Z,Y∣X(P=)Z∣X(P)Z,X∣ Y(P=)Z∣ Y,X(P
)D,R(P )D(P
)X(P)X ∣ Y(P = )Y(P)Y ∣ X(P
)X(P)X ∣ Y(P = )Y,X(P
)Y(P)Y ∣ X(P = )Y,X(P YXYYX
02 001 001 001 001 001 111111
06 06 06 06 06 06 999999
)X(P )Y,X(P )X(P
Y )Y,X(P∑ = )X(P
y,x )Y,X(P
02 001 001 001 001 001 111111
06 06 06 06 06 06 999999
=R )D(P)D ∣ R(P
slightly higher than .
If I have additional random variables in play, again they stay where they are. Thus:
6. Independence
Two random variables and are independent if:
which necessarily also means:
In our example, the outcome of a die roll is not independent of the die type , of course. However, if I chose a die and rolled it times, we can assume the individual rolls are independent, and the joint probability of those rolls could be factored into a product of their individual probabilities:
In probability modeling, we will often use independence assumptions to break a big joint probability distribution down into a smaller set of terms, to reduce the number of parameters in our models. The most careful way to invoke an independence assumption is in two steps: first write the joint probability out as a product of conditional probabilities, then
specify which conditioning variables are going to be dropped. For example, we can write
Then state, “and I assume Y is independent of Z, so:”
It’s possible to have a situation where is dependent on in , but when a variable . In this case we say that is conditionally independent of given
is introduced,
. For example, cause , and could cause ; ‘s effect on is entirely through . This starts to get at ideas from Bayesian networks, a class of methods that give us tools for manipulating conditional dependencies and doing inference in
complicated networks.
7. Bayes’ theorem
We’re allowed to apply the above rules repeatedly, algebraically, to manipulate probabilities. Suppose we know but we want to know . From the definition of conditional probability we can obtain:
and from the definition of marginalization we know:
) Y ∣ X( P
X )Z ∣ X(P = )Z,Y ∣ X(P
)X(P =)X∣ Y(P )Y(P)Y ∣ X(P
)Y∣X(P Y X )Z(P)Y(P)Z,Y ∣ X(P ≃
)Z(P)Z ∣ Y(P)Z,Y ∣ X(P = )Z,Y,X(P
)D ∣ iX(P∏ = )D|NX..1X(P
)Y(P=)X∣ Y(P )X( P = ) Y ∣ X( P
)Y(P)X(P = )Y,X(P
Y .)Z,Y,X(P∑ = )Z,X(P
)Z∣ Y,X(P∑=)Z∣X(P
Congratulations, you’ve just derived and proven Bayes’ theorem:
If you assume that is a constant (a uniform prior) it cancels out, and you recognize Laplace’s inverse probability calculation.
Less trivially than just talking about and as random variables, we can use Bayes’ theorem when talking more generally about observed data and a hypothesis :
The probability of our hypothesis, given the observed data, is proportional to the probability of the data given the hypothesis, times the probability of the hypothesis before you saw any data. The denominator, the normalization factor, is
: the probability of the data summed over all possible hypotheses.
, the probability of the data, is usually the easiest bit. This is often called the likelihood. (It’s the probability of the data ; it’s the likelihood of the model .)
is the prior.
is sometimes called the evidence: the marginal probability of the data, summed over all the possible hypotheses that could’ve generated it.
is called the posterior probability of .
So Bayes’ theorem gives us a principled way to calculate the posterior probability of a hypothesis , given data that we’ve observed.
But: where do we get from, if it’s supposed to be a probability of something before any data have arrived? We may have to make a subjective assumption about it, like saying we assume a uniform prior: assume that all hypotheses are equiprobable before the data arrive.
How do we enumerate all possible hypotheses ? Sometimes we’ll be in a hypothesis test situation of explicitly comparing one hypothesis against another, but in general, there’s always more we could come up with.
And what does it mean to talk about the probability of a hypothesis?
Further reading
and .C. MacKay, Is the Pope the Pope? (https://www.nature.com/articles/382490a0.pdf), Nature, 1996.
, What is Bayesian statistics? (http://www.nature.com/nbt/journal/v22/n9/full/nbt0904-1177.html), Nature Biotechnology, 2004.
.C. Mackay, Bayesian Interpolation (https://pdfs.semanticscholar.org/8e68/c54f39e87daf3a8bdc0ee005aece3c652d11.pdf), Neural Computation, 1992.
)H(P)H∣D(PH∑ =)D∣H(P )H(P)H ∣ D(P
)Y(P)Y∣X(P Y∑ =)X∣ Y(P )Y(P)Y ∣ X(P
)Y(P)Y ∣ X(P∑ = )Y,X(P∑ = )X(P
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com