代写代考 COMP2610 / COMP6261 – Information Theory Lecture 5: Bernoulli, Binomial, Ma

COMP2610 / COMP6261 – Information Theory Lecture 5: Bernoulli, Binomial, Maximum Likelihood and MAP
U Logo Use Guidelines Robert Williamson
logo is a contemporary n of our heritage.
presents our name, ld and our motto:

Copyright By PowCoder代写 加微信 powcoder

earn the nature of things.
authenticity of our brand identity, there are n how our logo is used.
go – horizontal logo
go should be used on a white background. ludes black text with the crest in Deep Gold in
rinting is not available, the black logo can hite background.
e used white reversed out of a black occasionally a neutral dark background.
Research School of Computer Science
Preferred logo
Black version
6 August 2018

Examples of application of Bayes’ rule
􏰜 Formalizing problems in language of probability
􏰜 Eating hamburgers, detecting terrorists, …
Frequentist vs Bayesian probabilities

The Bayesian Inference Framework
Bayesian Inference
Bayesian inference provides us with a a mathematical framework explaining how to change our (prior) beliefs in the light of new evidence.
likelihood prior 􏰖 􏰙􏰘 􏰗􏰖􏰙􏰘􏰗
p(Z|X) = p(X|Z)p(Z) 􏰘 􏰗􏰖 􏰙 p(X)
p(X |Z )p(Z )
= 􏰐Z′ p(X|Z′)p(Z′)
Prior: Belief that someone is sick
Likelihood: Probability of testing positive given someone is sick Posterior: Probability of being sick given someone tests positive

The Bernoulli and binomial distribution (we will make much use of this henceforth in studying binary channels)
Estimating probabilities from data
Bayesian inference for parameter estimation

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

The Bernoulli Distribution Introduction
Consider a binary variable X ∈ {0, 1}. It could represent many things: Whether a coin lands heads or tails
The presence/absence of a word in a document A transmitted bit in a message
The success of a medical trial
Often, these outcomes (0 or 1) are not equally likely What is a general way to model such an X ?

The Bernoulli Distribution Definition
The variable X takes on the outcomes
􏰈1 probability θ
X = 0 probability 1 − θ
Here, 0 ≤ θ ≤ 1 is a parameter representing the probability of success
For higher values of θ, it is more likely to see 1 than 0 e.g. a biased coin

The Bernoulli Distribution Definition
By definition,
More succinctly,
p(X =1|θ)=θ p(X =0|θ)=1−θ
p(X = x|θ) = θx(1 − θ)1−x

The Bernoulli Distribution Definition
By definition,
More succinctly,
p(X =1|θ)=θ p(X =0|θ)=1−θ
p(X = x|θ) = θx(1 − θ)1−x
This is known as a Bernoulli distribution over binary outcomes:
p(X = x|θ) = Bern(x|θ) = θx (1 − θ)1−x Note the use of the conditioning symbol for θ; will revisit later

The Bernoulli Distribution Mean and Variance
The expected value (or mean) is given by: E[X|θ]= 􏰄 x·p(x|θ)
=1·p(X =1|θ)+0·p(X =0|θ) = θ.
The variance (or squared standard deviation) is given by:
V[X|θ] = E[(X − E[X])2] = E[(X − θ)2]
=(0−θ)2 ·p(X =0|θ)+(1−θ)2 ·p(X =1|θ) = θ(1 − θ).

Example: Binary Symmetric Channel
Suppose a sender transmits messages s that are sequences of bits The receiver sees the bit sequence (message) r
Due to noise in the channel, the message is flipped with probability

Example: Binary Symmetric Channel
We can think of r as the outcome of a random variable, with conditional
distribution given by:
p(r =0|s=0)=1−f p(r =1|s=0)=f
p(r =0|s=1)=f
p(r =1|s=1)=1−f
If E denotes whether an error occurred, clearly
p(E = e) = Bern(e|f), e ∈ {0,1}.

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

The Binomial Distribution Introduction
Suppose we perform N independent Bernoulli trials e.g. we toss a coin N times
e.g. we transmit a sequence of N bits across a noisy channel Each trial has probability θ of success
What is the distribution of the number of times (m) that X = 1? e.g. the number of times we obtained m heads
e.g. the number of errors in the transmitted sequence

The Binomial Distribution Definition
where Xi ∼ Bern(θ).
Then Y has a binomial distribution with parameters N, θ:
􏰂N􏰃 m N−m p(Y =m)=Bin(m|N,θ)= m θ (1−θ)
for m ∈ {0,1,…,N}. Here
m =(N−m)!m!
is the # of ways we can we obtain m heads out of N coin flips

The Binomial Distribution: Mean and Variance
It is easy to show that:
E[Y]= 􏰄m·Bin(m|N,θ)=Nθ
V[Y]= 􏰄(m−E[m])2 ·Bin(m|N,θ)=Nθ(1−θ) m=0
Follows from linearity of mean and variance
E[Y]=E 􏰄Xi =􏰄E[Xi]=Nθ
i=1 i=1 􏰑N􏰒N
V[Y]=V 􏰄Xi =􏰄V[Xi]=Nθ(1−θ) i=1 i=1

The Binomial Distribution: Example
Ashton is an excellent off spinner. The probability of him getting a wicket during a cricket match is 41 . (That is, on each attempt, there is a 1/4 chance he will get a wicket.)
His coach commands him to make 10 attempts of wickets in a particular game.
1 What is the probability that he will get exactly three wickets? Bin(3|10, 0.25)
2 What is the expected number of wickets he will get? E[Y ], where Y ∼ Bin(·|10, 0.25).
3 What is the probability that he will get at least one wicket?
􏰐10 Bin(m|N = 10, θ = 0.25) = 1 − Bin(m = 0|N = 10, θ = 0.25) m=1

The Binomial Distribution: Example: Distribution of the Number of Wickets
Histogram of the binomial distribution with N = 10 and θ = 0.25. From Bishop (PRML, 2006)

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

The Bernoulli Distribution: Parameter Estimation
Consider the set of observations D = {x1,…,xN} with xi ∈ {0,1}: The outcomes of a sequence of coin flips
Whether or not there are errors in a transmitted bit string
Each observation is the outcome of a random variable X, with distribution p(X =x)=Bern(x|θ)=θx(1−θ)1−x
for some parameter θ

The Bernoulli Distribution: Parameter Estimation
We know that
X ∼ Bern(x|θ) = θx (1 − θ)1−x But often, we don’t know what the value of θ is
The probability of a coin toss resulting in heads
The probability of the word defence appearing in a document about sports
What would be a reasonable estimate for θ from D?

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Say that we observe
D = {0,0,0,1,0,0,1,0,0,0} Intuitively, which seems more plausible: θ = 21 ? θ = 15 ?

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Say that we observe
D = {0,0,0,1,0,0,1,0,0,0}
If it were true that θ = 21 , then the probability of this sequence would be
p(D|θ) = 􏰕 p(xi |θ) i=1

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Say that we observe
D = {0,0,0,1,0,0,1,0,0,0}
If it were true that θ = 51 , then the probability of this sequence would be
p(D|θ) = 􏰕 p(xi |θ) i=1
=5·5 ≈ 0.007.

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
We can write down how likely D is under the Bernoulli model. Assuming independent observations:
p(D|θ) = 􏰕 p(xi |θ) = 􏰕 θxi (1 − θ)1−xi
i=1 i=1 We call L(θ) = p(D|θ) the likelihood function

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
We can write down how likely D is under the Bernoulli model. Assuming independent observations:
p(D|θ) = 􏰕 p(xi |θ) = 􏰕 θxi (1 − θ)1−xi
i=1 i=1 We call L(θ) = p(D|θ) the likelihood function
Maximum likelihood principle: We want to maximize this function wrt θ
The parameter for which the observed sequence has the highest probability

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
NN L(θ)=logp(D|θ)=􏰄logp(xi|θ)=􏰄[xi logθ+(1−xi)log(1−θ)]

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
NN L(θ)=logp(D|θ)=􏰄logp(xi|θ)=􏰄[xi logθ+(1−xi)log(1−θ)]
Setting dL = 0 we obtain: dθ

The Bernoulli Distribution: Parameter Estimation: Maximum Likelihood
Maximising p(D|θ) is equivalent to maximising L(θ) = log p(D|θ)
NN L(θ)=logp(D|θ)=􏰄logp(xi|θ)=􏰄[xi logθ+(1−xi)log(1−θ)]
i=1 i=1 Setting dL = 0 we obtain:
The proportion of times x = 1 in the dataset D!

The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood
Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’
􏰜 What is the estimate of the probability of a coin flip resulting in ‘heads’? In a small set of documents about sports, the words defence never appeared.
􏰜 What are the consequences when predicting whether a document is about sports (using Bayes’ rule)?

The Bernoulli Distribution:
Parameter Estimation — Issues with Maximum Likelihood
Consider the following scenarios:
After N = 3 coin flips we obtained 3 ‘tails’
􏰜 What is the estimate of the probability of a coin flip resulting in ‘heads’? In a small set of documents about sports, the words defence never appeared.
􏰜 What are the consequences when predicting whether a document is about sports (using Bayes’ rule)?
These issues are usually referred to as overfitting Need to “smooth” out our parameter estimates
Alternatively, we can do Bayesian inference by considering priors over the parameters

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

The Bernoulli Distribution: Parameter Estimation: Bayesian Inference
likelihood prior 􏰖 􏰙􏰘 􏰗􏰖􏰙􏰘􏰗
p(θ|X) = p(X|θ)p(θ) 􏰘 􏰗􏰖 􏰙 p(X)
If we treat θ as a random variable, we may have some prior belief p(θ) about its value
e.g. we believe θ is probably close to 0.5
Our prior on θ quantifies what we believe θ is likely to be, before looking at
Our posterior on θ quantifies what we believe θ is likely to be, after looking at the data

The Bernoulli Distribution: Parameter Estimation: Bayesian Inference
The likelihood of X given θ is
Bern(x |θ) = θx (1 − θ)1−x
For the prior, it is mathematically convenient to express it as a Beta
distribution:
Beta(θ|a, b) = 1 θa−1(1 − θ)b−1, Z(a,b)
where Z (a, b) is a suitable normaliser
We can tune a, b to reflect our belief in the range of likely values of θ

Beta Prior Examples
33 2.5 2.5 22 1.5 1.5 11 0.5 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
33 2.5 2.5 22 1.5 1.5 11 0.5 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
a=0.1, b=0.1

Beta Prior and Binomial Likelihood: Beta Posterior Distribution
Recall that for D = {x1, . . . , xN }, the likelihood under a Bernoulli model is: p(D|θ) = θm(1 − θ)l,
wherem=♯(x =1)andl=N−m=♯(x =0).

Beta Prior and Binomial Likelihood: Beta Posterior Distribution
Recall that for D = {x1, . . . , xN }, the likelihood under a Bernoulli model is: p(D|θ) = θm(1 − θ)l,
wherem=♯(x =1)andl=N−m=♯(x =0).
For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:
p(θ|D, a, b) = p(D|θ)p(θ|a, b) p(D|a, b)
p(D|θ)p(θ|a, b)
= 􏱎 1 p(D|θ)p(θ|a, b)d θ
= Beta(θ|m + a, l + b).

Beta Prior and Binomial Likelihood: Beta Posterior Distribution
Recall that for D = {x1, . . . , xN }, the likelihood under a Bernoulli model is: p(D|θ) = θm(1 − θ)l,
wherem=♯(x =1)andl=N−m=♯(x =0).
For the prior p(θ|a, b) = Beta(θ|a, b) we can obtain the posterior:
p(θ|D, a, b) = p(D|θ)p(θ|a, b) p(D|a, b)
p(D|θ)p(θ|a, b)
= 􏱎 1 p(D|θ)p(θ|a, b)d θ
Can use this as our new prior if we see more data!
= Beta(θ|m + a, l + b).

Beta Prior and Binomial Likelihood: Beta Posterior Distribution
Now suppose we choose θMAP to maximise p(θ|D) (MAP= Maximum A Posteriori)
One can show that
θMAP= m+a−1 N+a+b−2
cf. the estimate that did not use any prior,
The prior parameters a and b can be seen as adding some “fake” trials!
WhatvaluesofaandbensureθMAP =θML?a=b=1.Makesense? (Note that the choice of the beta distribution was not accidental here — it is the “conjugate prior” for the binomial distribution. )

1 The Bernoulli Distribution
2 The Binomial Distribution
3 Parameter Estimation
4 Bayesian Parameter Estimation
5 Wrapping up

Distributions involving binary random variables 􏰜 Bernoulli distribution
􏰜 Binomial distribution
Bayesian inference: Full posterior on the parameters
􏰜 Beta prior and binomial likelihood → Beta posterior Reading: Mackay §23.1 and §23.5; Bishop §2.1 and §2.2

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com