CS代写 IN2064 Machine Learning Lecture 3

IN2064 Machine Learning Lecture 3
Lecture 3: Probabilistic Inference
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Technical University of Munich

Winter term 2020/2021

We flip the same coin 10 times:
Probability that the next coin flip is T ?
Probabilistic Inference 2
Data Analytics and Machine Learning

We flip the same coin 10 times:
Probability that the next coin flip is T ?
∼ 0 ∼ 0.3 ∼ 0.38 ∼ 0.5 ∼ 0.76 ∼ 1
Probabilistic Inference 2
Data Analytics and Machine Learning

30%? This seems reasonable, but why?
Probabilistic Inference 3
Data Analytics and Machine Learning

30%? This seems reasonable, but why?
Every flip is random. So every sequence of flips is random, i.e., it has some probability to be observed.
Probabilistic Inference 3
Data Analytics and Machine Learning

30%? This seems reasonable, but why?
Every flip is random. So every sequence of flips is random, i.e., it has some probability to be observed.
For the i-th coin flip we write
pi􏰎Fi = T 􏰏 = θi
To denote that the probability distribution depends on θi, we write pi􏰎Fi = T |θi􏰏=Ber(Fi = T |θi)=θi
i.e. Fi ∼ Ber(θi)
Note the i in the index! We are trying to reason about θ11.
Probabilistic Inference 3
Data Analytics and Machine Learning

Maximum likelihood estimation

All the randomness of a sequence of flips is governed (modeled) by the parameters θ1, . . . , θ10:
p􏰎H T H H T H H H T H |θ1,θ2,…,θ10􏰏
What do we know about θ1, . . . , θ10? Can we infer something about θ11?
At first sight, there is no connection.
Findθi’ssuchthatthatp􏰎H T H H T H H H T H |θ1,θ2,…,θ10􏰏 is as high as possible. This is a very important principle:
Maximize the likelihood of our observation. (Maximum Likelihood)
Probabilistic Inference 5
Data Analytics and Machine Learning

???? p􏰎H T H H T H H H T H |θ1,θ2,…,θ10􏰏 ???? We need to model this.
First assumption: The coin flips do not affect each other—independence.
p􏰎H T H H T H H H T H |θ1,θ2,…,θ10􏰏
=p1􏰎F1 = H |θ1􏰏·p2􏰎F2 = T |θ2􏰏·…·p10􏰎F10 = H |θ10􏰏 10
=􏰟pi(Fi =fi |θi) i=1
Notice the i in pi,θi! This indicates: The coin flip at time 1 is different from the one at time 2, …
But the coin does not change.
Probabilistic Inference 6
Data Analytics and Machine Learning

Second assumption: The flips are qualitatively the same—identical distribution.
􏰟pi(Fi =fi |θi)=􏰟p(Fi =fi |θ) i=1 i=1
In total: The 10 flips are independent and identically distributed (i.i.d.).
Remember θ11? With the i.i.d. assumption we can link it to θ1, . . . , θ10. Now we can write down the probability of our sequence with respect to θ:
􏰟p(Fi =fi |θ)=(1−θ)θ(1−θ)(1−θ)θ(1−θ)(1−θ)(1−θ)θ(1−θ) i=1
= θ3(1 − θ)7
Probabilistic Inference 7
Data Analytics and Machine Learning

Under our model assumptions (i.i.d.):
p(H T H H T H H H T H |θ)=θ3(1−θ)7 􏰌 􏰋􏰊 􏰍
observed data, D
Probabilistic Inference 8
Data Analytics and Machine Learning

Under our model assumptions (i.i.d.):
p(H T H H T H H H T H |θ)=θ3(1−θ)7 􏰌 􏰋􏰊 􏰍
observed data, D
This can be interpreted as a function f(θ) := p(D | θ). We want to find
the maxima (maximum likelihood) of this function.
0.0025 0.0020 0.0015 0.0010 0.0005 0.0000
θMLE =argmaxf(θ) θ∈[0,1]
critical point
0.4 0.6 0.8 1.0
Very important: the likelihood function is not a probability distribution over θ since 􏰞 p(D | θ)dθ ̸= 1 in general.
Probabilistic Inference 8
Data Analytics and Machine Learning

How do we maximize the likelihood function?
High-school math! Take the derivative df , set it to 0, and solve for θ.
Check these critical points by checking the second derivative.
This is possible, but even for our simple f(θ) the math is rather ugly. Can we simplify the problem?
Luckily, monotonic functions preserve critical points.
0.0025 0.0020 0.0015 0.0010 0.0005 0.0000
arg max f(θ) = arg max log f(θ) θ∈[0,1] θ∈[0,1]
critical point
Probabilistic Inference 9
Data Analytics and Machine Learning

Maximum Likelihood Estimation (MLE) for any coin sequence?
Probabilistic Inference 10
Data Analytics and Machine Learning

Maximum Likelihood Estimation (MLE) for any coin sequence? θMLE = |T|
|T |+|H | |T|,|H| denote number of T , H , respectively.
(derivation in the exercise sheet)
Remember we wanted to find the probability the next coin flip is T
F11 ∼ Ber(θMLE)
p(F11 = T |θMLE)=Ber(F11 = T |θMLE)=θMLE = |T|
This justifies 30% as a reasonable answer to our initial question. Problem solved?!
Probabilistic Inference 10
Data Analytics and Machine Learning

Just for fun, a totally different sequence (same coin!): H
But even a fair coin (θ = 0.5) has 25% chance of showing this result! The MLE solution seems counter-intuitive. Why?
Probabilistic Inference 11
Data Analytics and Machine Learning

Just for fun, a totally different sequence (same coin!): H
But even a fair coin (θ = 0.5) has 25% chance of showing this result!
The MLE solution seems counter-intuitive. Why?
We have prior beliefs:
”Coins usually don’t land heads all the time”
How can we
1. represent such beliefs mathematically? 2. incorporate them into our model?
Probabilistic Inference 11
Data Analytics and Machine Learning

Bayesian inference

How can we represent our beliefs about θ mathematically? (Subjective) Bayesian interpretation of probability:
Distribution p(θ) reflects our subjective beliefs about θ.
A prior distribution p(θ) represents our beliefs before we observe any data.
How do we choose p(θ)? The only constraints are 1. It must not depend on the data
2. p(θ)≥0forallθ
3. 􏰞 p(θ) dθ = 1
Properties 2 and 3 have to hold on the support (i.e., feasible values) of θ. In our setting, only values θ ∈ [0, 1] make sense.
This leaves room for (possibly subjective) model choices!
Probabilistic Inference 13
Data Analytics and Machine Learning

Some possible choices for the prior on θ:
3.0 2.5 2.0 1.5 1.0 0.5 0.0
3.0 2.5 2.0 1.5 1.0 0.5 0.0
2.0 1.5 1.0 0.5 0.0
2.0 1.5 1.0 0.5 0.0
Probabilistic Inference 14
Data Analytics and Machine Learning

The Bayes formula tells us how to we update our beliefs about θ after observing the data D
p(θ|D)= p(D|θ)·p(θ) p(D)
Here, p(θ | D) is the posterior distribution. It encodes our beliefs in the value of θ after observing data.
The posterior depends on the following terms:
p(D | θ) is the likelihood.
p(θ) is the prior that encodes our beliefs before observing the data.
p(D) is the evidence. It acts as a normalizing constant that ensures that the posterior distribution integrates to 1.
posterior ∝ likelihood · prior
Probabilistic Inference 15
Data Analytics and Machine Learning

The Bayes formula tells us how to we update our beliefs about θ after observing the data D
p(θ|D)= p(D|θ)·p(θ) p(D)
Here, p(θ | D) is the posterior distribution. It encodes our beliefs in the value of θ after observing data.
The posterior depends on the following terms:
• p(D | θ) is the likelihood.
• p(θ) is the prior that encodes our beliefs before observing the data. • p(D) is the evidence. It acts as a normalizing constant that ensures
that the posterior distribution integrates to 1.
posterior ∝ likelihood · prior
Usually, we define our model by specifying the likelihood and the prior. We can obtain the evidence using the sum rule of probability
p(D) = p(D,θ)dθ = p(D | θ)p(θ)dθ
Probabilistic Inference 15
Data Analytics and Machine Learning

The Bayes formula tells us how to update our beliefs given the data
p( )-prior
p( | ) – likelihood
2.5 2.0 1.5 1.0 0.5 0.0
2.5 2.0 1.5 1.0 0.5 0.0
2.5 2.0 1.5 1.0 0.5 0.0
2.5 2.0 1.5 1.0 0.5 0.0
0.0 0.2 0.4
0.6 0.8 1.0
p( | )p( ) – unnormalized
0.0 0.2 0.4
0.6 0.8 1.0
Probabilistic Inference 16
Data Analytics and Machine Learning

Observing more data increases our confidence
7 |H|=2,|T|=4 7
0.0 0.2 0.4 0.6 0.8 1.0 0.0
7 |H|=8,|T|=16 7
0.0 0.2 0.4 0.6 0.8 1.0 0.0
|H|=4,|T|=8
C p( | ) p( | )-p
– scaled l osterior
0.6 0.8 1.0
|H|=16,|T|=32
Note that the likelihood is scaled in the plots for better visibility.
0.2 0.4 0.6 0.8 1.0
Probabilistic Inference 17
Data Analytics and Machine Learning

Question: What happens if p(θ) = 0 for some particular θ?
Probabilistic Inference 18
Data Analytics and Machine Learning

Question: What happens if p(θ) = 0 for some particular θ?
posterior ∝ likelihood · prior p(θ | D) ∝ p(D | θ) · p(θ)
Probabilistic Inference 18
Data Analytics and Machine Learning

Question: What happens if p(θ) = 0 for some particular θ?
posterior ∝ likelihood · prior
p(θ | D) ∝ p(D | θ) · p(θ)
Posterior will always be zero for that particular θ regardless of the likelihood/data.
Probabilistic Inference 18
Data Analytics and Machine Learning

Maximum a posteriori estimation

Back to our coin problem: How do we estimate θ from the data? In MLE, we were asking the wrong question
θMLE = arg max p(D | θ) θ
MLE ignores our prior beliefs and performs poorly if little data is available.
Actually, we should care about the posterior distribution p(θ | D).
What if we instead maximize the posterior probability?
θMAP = arg max p(θ | D) θ
This approach is called maximum a posteriori (MAP) estimation.
Probabilistic Inference 20
Data Analytics and Machine Learning

Maximum a posterior estimation
We can ignore 1 p(D)
θMAP = arg max p(θ | D) θ
= arg max p(D | θ)p(θ) θ p(D)
since it’s a (positive) constant independent of θ = arg max p(D | θ)p(θ)
We already know the likelihood p(D | θ) from before, how do we choose the prior p(θ)?
Probabilistic Inference 21
Data Analytics and Machine Learning

Often, we choose the prior to make subsequent calculations easier.
We choose Beta distribution for reasons that will become clear later.
Beta(θ|a,b)= Γ(a+b)θa−1(1−θ)b−1, θ∈[0,1] Γ(a)Γ(b)
• a > 0, b > 0 are the distribution parameters
• Γ(n)=(n−1)!forn∈Nisthegammafunction
Probabilistic Inference 22
Data Analytics and Machine Learning

3.0 2.5 2.0 1.5 1.0 0.5 0.0
3.0 2.5 2.0 1.5 1.0 0.5 0.0
3.0 2.5 2.0 1.5 1.0 0.5 0.0
3.0 2.5 2.0 1.5 1.0 0.5 0.0
0.4 0.6 0.8 1.0
The Beta PDF for different choices of a and b:
Beta(8, 4)
0.4 0.6 0.8 1.0
Probabilistic Inference 23
Data Analytics and Machine Learning

Let’s put everything together
p(θ|D)= p(D|θ)·p(θ) p(D)
∝ p(D | θ) · p(θ) because p(D) is constant w.r.t. θ.
Probabilistic Inference 24
Data Analytics and Machine Learning

Let’s put everything together
p(θ|D)= p(D|θ)·p(θ) p(D)
∝ p(D | θ) · p(θ) because p(D) is constant w.r.t. θ.
p(D | θ) = θ|T |(1 − θ)|H|, p(θ)≡p(θ|a,b)= Γ(a+b)θa−1(1−θ)b−1.
So we get:
p(θ|D)∝θ|T|(1−θ)|H| · Γ(a+b)θa−1(1−θ)b−1 Γ(a)Γ(b)
∝ θ|T |+a−1(1 − θ)|H|+b−1.
Probabilistic Inference 24
Data Analytics and Machine Learning

We are looking for
θMAP =argmaxp(θ|D) θ
= arg max θ|T |+a−1(1 − θ)|H|+b−1 θ
As before, the problem becomes much easier if we consider the logarithm
θMAP = arg max log p(θ | D) θ
= arg max (|T | + a − 1) log θ + (|H| + b − 1) log(1 − θ) θ
With some algebra we obtain
θMAP = |T|+a−1 |H|+|T|+a+b−2
Probabilistic Inference 25
Data Analytics and Machine Learning

Estimating the posterior distribution

What we have so far
θMAP =argmaxp(θ|D) θ
The most probable value of θ under the posterior distribution.
Is this the best we can do?
• How certain are we in our estimate?
• What is the probability that θ lies in some interval?
For this, we need to consider the entire posterior distribution p(θ | D), not just its mode θMAP.
Probabilistic Inference 27
Data Analytics and Machine Learning

We know the posterior up to a normalizing constant (slide 24)
p(θ | D) ∝ θ|T |+a−1(1 − θ)|H|+b−1. Finding the true posterior p(θ | D) boils down to finding the
normalization constant, such that the distribution integrates to 1. Option 1: Brute-force calculation
• Computing 􏰞 1 θ|T |+a−1(1 − θ)|H|+b−1 dθ. 0
• This is tedious, difficult and boring. Any alternatives? Option 2: Pattern matching
• The unnormalized posterior θ|T |+a−1(1 − θ)|H|+b−1 looks similar to the PDF of the Beta distribution
Beta(θ|α,β)= Γ(α+β)θα−1(1−θ)β−1 Γ(α)Γ(β)
• Can we use this fact?
Probabilistic Inference 28
Data Analytics and Machine Learning

The unnormalized posterior
Beta distribution
p(θ | D) ∝ θ|T |+a−1(1 − θ)|H|+b−1
Beta(θ|α,β)= Γ(α+β)θα−1(1−θ)β−1 Γ(α)Γ(β)
Probabilistic Inference 29
Data Analytics and Machine Learning

The unnormalized posterior
Beta distribution
p(θ | D) ∝ θ|T |+a−1(1 − θ)|H|+b−1
Beta(θ|α,β)= Γ(α+β)θα−1(1−θ)β−1 Γ(α)Γ(β)
Thus, we can conclude that the appropriate normalizing constant is
Γ(|T|+a+|H|+b) Γ(|T | + a)Γ(|H| + b)
and the posterior is a Beta distribution
p(θ | D) = Beta(θ | a + |T |, b + |H|)
Always remember this trick when you try to solve integrals that involve known pdfs (up to a constant factor)!
Probabilistic Inference 29
Data Analytics and Machine Learning

We started with the following prior distribution
p(θ) = Beta(θ | a, b) And obtained the following posterior
p(θ | D) = Beta(θ | a + |T |, b + |H|) Was this just a lucky coincidence?
Probabilistic Inference 30
Data Analytics and Machine Learning

We started with the following prior distribution
p(θ) = Beta(θ | a, b) And obtained the following posterior
p(θ | D) = Beta(θ | a + |T |, b + |H|) Was this just a lucky coincidence?
No, this is an instance of a more general principle. Beta distribution is a conjugate prior for the Bernoulli likelihood.
If a prior is conjugate for the given likelihood, then the posterior will be of the same family as the prior.
Probabilistic Inference 30
Data Analytics and Machine Learning

We started with the following prior distribution
p(θ) = Beta(θ | a, b) And obtained the following posterior
p(θ | D) = Beta(θ | a + |T |, b + |H|) Was this just a lucky coincidence?
No, this is an instance of a more general principle. Beta distribution is a conjugate prior for the Bernoulli likelihood.
If a prior is conjugate for the given likelihood, then the posterior will be of the same family as the prior.
In our case, we can interpret the parameters a, b of the prior as the number of tails and heads that we saw in the past.
Probabilistic Inference 30
Data Analytics and Machine Learning

What are the advantages of the fully Bayesian approach? We have an entire distribution, not just a point estimate
3.0 2.5 2.0 1.5 1.0 0.5 0.0
We can answer questions such as:
• What is the expected value of θ under p(θ | D)? • What is the variance of p(θ | D)?
• Find a credible interval [θ1, θ2], such that Pr(θ ∈ [θ1, θ2] | D) = 95% (not to be confused with frequentist confidence intervals).
0.0 0.2 0.4 0.6
Probabilistic Inference 31
Data Analytics and Machine Learning

We learned about three approaches for parameter estimation:
Maximum likelihood estimation (MLE)
• Goal: Optimization problem maxθ log p(D | θ)
• Result: Point estimate θMLE
• Coin example: θMLE = |T | |T |+|H |
Maximum a posteriori (MAP) estimation
• Goal: Optimization problem maxθ log p(θ | D)
• Result: Point estimate θMAP
• Coin example: θMAP = |T |+a−1
|T |+|H |+a+b−2
Estimating the posterior distribution
• Goal: Find the normalizing constant p(D)
• Result: Full distribution p(θ | D)
• Coinexample: p(θ|D)=Beta(θ|a+|T|,b+|H|)
Probabilistic Inference 32
Data Analytics and Machine Learning

The three approaches are closely connected. The posterior distribution is
p(θ | D) = Beta(θ | a + |T |, b + |H|). Recall that the mode of Beta(α, β) is α−1 , for α, β > 1.
We see that the MAP solution is the mode of the posterior distribution
θMAP = |T|+a−1 |H|+|T|+a+b−2
If we choose a uniform prior (i.e. a = b = 1) we obtain the MLE solution θMLE = |T|+1−1 = |T|
|H|+|T|+1+1−2 |H|+|T|
All these nice formulas are a consequence of choosing a conjugate prior. Had we chosen a non-conjugate prior, p(θ | D) and θMAP could not have a closed form.
Probabilistic Inference 33
Data Analytics and Machine Learning

How many flips?
We had p(θ | D) = Beta(θ | a + |T |, b + |H|)
Visualize the posterior (for given prior, e.g. a = b = 1):
|T| = 1,|H| = 4 2.5
2.0 1.5 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
With more data the posterior becomes more peaky – we are more certain about our estimate of θ
|T| = 10,|H| = 40
7 6 5 4 3 2 1 0
0.0 0.2 0.4 0.6 0.8 1.0
|T| = 100,|H| = 400 20
0.0 0.2 0.4 0.6 0.8 1.0
Probabilistic Inference 34
Data Analytics and Machine Learning

Alternative view: a frequentist perspective For MLE we had θMLE = |T|
Clearly, we get the same result for |T| = 1,|H| = 4 and |T| = 10,|H| = 40. Which one is better? Why?
Probabilistic Inference 35
Data Analytics and Machine Learning

Alternative view: a frequentist perspective For MLE we had θMLE = |T|
Clearly, we get the same result for |T| = 1,|H| = 4 and
|T| = 10,|H| = 40. Which one is better? Why?
How many flips? Hoeffding’s Inequality for a sampling complexity bound:
p(|θMLE − θ| ≥ ε) ≤ 2e−2Nε2 ≤ δ, where N = |T| + |H|
For example, I want to know θ, within ε = 0.1 error, with probability at least 1−δ = 0.99
N≥ln(2/δ) → N≈265 2ε2
Probabilistic Inference 35
Data Analytics and Machine Learning

Predicting the next flip

Remember that we want to predict the next coin flip… For MLE:
1. Estimate θMLE = |T | from the data. |H |+|T |
2. The probability tha

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts