COMP9418: Advanced Topics in Statistical Machine Learning
Gaussian Models
Instructor: University of Wales
Copyright By PowCoder代写 加微信 powcoder
Introduction
§ This lecture discusses Graphical Models with continuous variables
§ We will focus on Gaussian distributions and formalise a Gaussian Bayesian network § Our findings can be adapted to other models such as Markov networks
§ We will see that our existing knowledge about probabilities applies to continuous variables
§ Independence, conditional independence
§ Bayes conditioning, product rule, chain rule, case analysis, Bayes rule, etc.
§ We will develop a representation for Gaussian Factors
§ Including operations such as join, marginalisation and reduction (observation of evidence) § We will use these operations to illustrate how works
Introduction
§ Let’s now see how we can incorporate continuous variables in our models
§ Some variables are best modelled in the continuous space, such as temperature, humidity, position and velocity.
§ We cannot use tables anymore, unless we discretise the variables
§ Discretisation is a common approach
§ We can approximate a variables distribution by its histogram § But it is hardly the answer for all models
§ Imagine the problem of robot navigation
§ A large environment and a resolution of 15 x 15 cm would lead to
millions of values
§ Such large CPTs would be too expensive to make inference
§ Besides, we lose the notion of distance between values
Introduction
§ First, everything we know about probabilities holds for continuous distributions
§ Bayes conditioning and product rule § Chain rule
§ Case analysis and marginalisation
§ Bayes rule
§ However, our operations over tables will not work for continuous variables
§ We need to represent the distribution using a probability density function (PDF)
§ A common PDF for continuous variables is the Gaussian distribution
𝑃 𝐴!,…,𝐴”
𝑃𝐴𝐵 =𝑃𝐴,𝐵 𝑃𝐵
𝑃(𝐴, 𝐵) = 𝑃 𝐴 𝐵 𝑃(𝐵)
= )𝑃(𝐴#|𝐴#$!,…,𝐴!) #
𝑃 𝐴 = + 𝑃 𝐴, 𝐵 𝑑𝐵 %
𝑃𝐴𝐵 =𝑃𝐵𝐴𝑃(𝐴) 𝑃(𝐵)
Gaussian Bayesian Networks
§ Gaussian Bayesian networks
§ All variables are continuous and modelled by Gaussian
§ Gaussians are often good approximation for many real- world distributions
§ Modelling decisions
§ Root nodes use univariate distributions
Outside Temp (𝑂)
Energy (𝐸)
𝛽%+𝛽&𝐼+𝛽#𝑂𝜎’# 𝐸~𝒩(𝛽% + 𝛽&𝐼 + 𝛽#𝑂; 𝜎’#)
Inside Temp (𝐼)
§ We need to represent the CPD 𝑃(𝑋|𝑼), where 𝑼 are the (𝑆)
parents of 𝑋
§ A common solution is a linear Gaussian model
𝜎$# 𝐼 𝑆~𝒩(𝐼; 𝜎$#)
Linear Gaussian Model
§ Let 𝑋 be a continuous variable with continuous parents 𝑈!,…,𝑈”
§ 𝑋 has a linear Gaussian model with parameters 𝛽%, … , 𝛽( and 𝜎# iff
§ Similarly, in vector notation §Yet,wecanunderstand𝑋asalinearfunctionof𝑈&,…,𝑈(
with a Gaussian noise with mean 0 and variance 𝜎# § The linear model assumes the variance does not
depend on the parents 𝑼
§ We can easily extend it to have mean and variance of 𝑋
depend on parents
§ However, linear Gaussian model is a useful approximation in many practical problems.
§ It also provides an alternative representation for multivariable Gaussian distributions
𝑃(𝑋|𝑢&, … , 𝑢() = 𝒩(𝛽% + 𝛽&𝑢& + ⋯ + 𝛽(𝑢(; 𝜎#) 𝑃(𝑋|𝒖)=𝒩(𝛽% +𝜷)𝒖;𝜎#)
𝑋=𝛽 +𝛽𝑢 +⋯+𝛽𝑢 +𝜖
Gaussian Distribution: 1 Dimension
§ Univariate Gaussian distribution has two parameters § Mean 𝜇 and
§ Variance 𝜎# or standard deviation 𝜎
§ Learning is estimating the parameter 𝜇 and 𝜎 from data
§ 𝜇 = 𝔼[𝑋]
§𝜎= 𝔼[(𝑋−𝜇)#]
§ Once, we have learned the parameters, we can sample from the distribution
§ 𝑋~𝒩(𝜇; 𝜎#)
& +*, ! 2𝜋 𝑒*# –
Gaussian Distribution: 2 Dimensions
§ Let’s suppose we have two independent variables 𝑋@ and 𝑋A
! && (&)*& ‘ ! && (‘)*’ ‘ 𝑝𝑥!,𝑥#=$&#%𝑒’ +& $’#%𝑒’ +’
! && (&)*& ‘&& (‘)*’ ‘
= 𝑒’+&’+’ $&#%$’#%
&(&)*&’ &(‘)*” = ! 𝑒&’ +& &’ +’
&& ((&)*&)’ ‘((‘)*’)’ =!𝑒’+’& +’
= ! 𝑒 & &’ 𝒙 & 𝝁 . * ) & ( 𝒙 & 𝝁 ) $&$’#%
𝑝(𝑥&, 𝑥#) = 𝑝(𝑥&)𝑝(𝑥#)
Σ= 𝜎&# 0 0𝜎#
Gaussian Distribution: 𝑛 Dimensions
§ Multivariate Gaussian distribution is characterised by § 𝑛-dimensional mean vector 𝝁 and
§ Symmetric 𝑛 × 𝑛 covariance matrix Σ
§ Quadratic number of parameters
§ Covariance matrix for 2 dimensions § 𝑐𝑜𝑣 𝑋!, 𝑋# = 𝔼[(𝑥! − 𝜇!)(𝑥# − 𝜇#)]
§𝑋!~𝒩 𝜇!, 𝜎!# 𝑐𝑜𝑣(𝑋!,𝑋#) 𝑋# 𝜇# 𝑐𝑜𝑣(𝑋#, 𝑋!) 𝜎#
§ Covariance matrix is symmetric since 𝑐𝑜𝑣 𝑋!, 𝑋# = 𝑐𝑜𝑣(𝑋#,𝑋!)
1 𝑒*&# 𝒙*𝝁 “2#$(𝒙*𝝁) 2𝜋 /⁄# Σ &⁄#
Gaussian Distribution: Example
§ Consider a joint distribution 𝑋A, 𝑋G) over three variables
4 2 −2 §Σ= 2 5 −5
§ We can observe that
§ 𝑋! is positively correlated with 𝑋# § 𝑋! is negatively correlated with 𝑋- § 𝑋# is negatively correlated with 𝑋-
Gaussian Distribution: Independencies
§ We can identify independence assumptions directly from the Gaussian distribution parameters
§ 𝑋. and 𝑋/ are independent iff and only if Σ.,/ = 0 § Let 𝐽 = ΣH@ be the information matrix
§ 𝑋. ⊥ X/ | 𝑿 − {𝑋., 𝑋/} iff 𝐽.,/ = 0 § Example
4 2 −2 §Σ= 2 5 −5
−2 −5 8 § 𝑋! ⊥ 𝑋- | 𝑋#
&! .3125 −.125 0 𝐽=Σ = −.125 .5833 .3333 0 .3333 .3333
Gaussian Bayesian Networks
§ A Gaussian Bayesian network has § All variables are continuous
§ All CPDs are linear Gaussian models
§ Gaussian Bayesian networks are simple to understand
§ If we compare to multivariate Gaussian distributions
§ Yet, we can transform one representation into another
Outside Temp (𝑂)
Energy (𝐸)
𝛽%+𝛽&𝐼+𝛽#𝑂𝜎’# 𝐸~𝒩(𝛽% + 𝛽&𝐼 + 𝛽#𝑂; 𝜎’#)
Inside Temp (𝐼)
Sensor (𝑆)
𝜎$# 𝐼 𝑆~𝒩(𝐼; 𝜎$#)
GBN and Multivariate Gaussian 1
§ A linear Gaussian network defines a joint multivariate Gaussian distribution
§ 𝑌 is a linear Gaussian with parents 𝑋!, … , 𝑋” §𝑃𝑌𝒙 =𝒩(𝛽1+𝜷2𝒙;𝜎#)
§ 𝑋!, … , 𝑋” are jointly Gaussian with 𝒩(𝝁; Σ)
§ Then, 𝑌 distribution is normal, 𝑝 𝑌 = 𝒩 𝜇I; 𝜎IA , where § 𝜇3 = 𝛽1 + 𝜷2𝝁
§ 𝜎 3# = 𝜎 # + 𝜷 2 Σ 𝜷
§ The joint distribution over {𝑿, 𝑌} is normal with
§𝐶𝑜𝑣𝑋;𝑌=Σ” 𝛽Σ
. /4! / .,/
𝑋@ 𝒩(.5𝑋& − 3.5; 4) 𝑋A
𝒩(−𝑋#+1;3) 𝑋G
GBN and Multivariate Gaussian 2
§ A joint multivariate Gaussian distribution defines a linear Gaussian network
§ Given a set of variables {𝑿, 𝑌} in the form of a joint normal distribution §𝑝𝑌𝑿 =𝒩(𝛽1+𝜷2𝑿;𝜎#)
2 5 −5 −2 −5 8
§𝛽 =𝜇 −Σ Σ&!𝝁
1 3 3𝑿 𝑿𝑿 𝑿
§ 𝜷 = Σ&!Σ 𝑿𝑿 𝑿3
§𝜎#=Σ −Σ Σ&!Σ 33 3𝑿 𝑿𝑿 𝑿3
Gaussian Bayesian Networks
§ Given our knowledge about inference, we need the following operations
§ Multiply factors
§ Marginalise out variables (using integration)
§ Multiplication
§ We do not have a universal representation, as we have for discrete distributions
§ Multiplication of factors of different families is difficult
§ Multiplication of factors in the same family may lead to results in a different family
§ Marginalisation
§ Not all functions are integrable
§ If they are, not all have a closed-form integral
Canonical Form
§ We need to adopt a representation that allow us perform inference operations in closed form
§ A simple option is the canonical form
§ Factor product, reduction and marginalisation in closed form
§ We can define a data structure that stores factors in the canonical form
§ The canonical form can represent multidimensional Gaussian distributions and linear Gaussian CPDs
§ Adapt inference algorithms, such as VE, to operate over this new factor
Canonical Form
§Thecanonicalform𝒞𝑿;𝐾,𝒉,𝑔 isdefinedas
§Thus,𝒩 𝝁; Σ =𝒞 𝐾,𝒉,𝑔 ,where § 𝐾 = Σ&!
§ 𝒉 = Σ&!𝝁
§𝑔=−!#𝝁2Σ&!𝝁−log 2𝜋7⁄#Σ!⁄#
𝑝(𝒙)= ! exp −! 𝒙−𝝁2Σ&!(𝒙−𝝁) #% 0⁄’ * &⁄’ #
=exp −!#𝒙2Σ&!𝒙+𝝁2Σ&!𝒙−!#𝝁2Σ&!𝝁−log 2𝜋 7⁄# Σ!⁄#
𝒞 𝑿;𝐾,𝒉,𝑔 =exp −12𝑿)𝐾𝑿+𝒉)𝑿+𝑔
Canonical Form: Join
§ The product of two canonical form factors over scope 𝑿 is § If the factors have different scopes, we extend the scopes to make
them match
§ The extension of scope is simply adding zero entries to 𝐾 and 𝒉
§Letuscompute𝜙@ 𝑋,𝑌 ⋅𝜙A 𝑌,𝑍 §𝜙!𝐴,𝐵=𝒞𝐴,𝐵;1 −1,1,−3
𝒞 𝐾!, 𝒉!, 𝑔! a 𝒞 𝐾#, 𝒉#, 𝑔#
= 𝒞(K! + K#, 𝐡! + 𝐡#, g! + g#)
§𝜙#𝐵,𝐶=𝒞𝐵,𝐶;3 −2,5,1 −2 4 −1
1 −1 0 1 §𝜙!𝐴,𝐵⋅𝜙#𝐵,𝐶=𝒞𝐴,𝐵,𝐶;−1 4 −2, 4 ,−2
Canonical Form: Marginalisation
§ The marginalisation of 𝒀 for a canonical form 𝒞 𝑿,𝒀;𝐾 ,𝒉 ,𝑔 overscope{𝑿,𝒀}is
𝐾’ = 𝐾𝑿𝑿 − 𝐾𝑿𝒀𝐾&!𝐾𝒀𝑿 𝒀𝒀
§𝐾= 𝐾𝑿𝑿 𝐾𝑿𝒀 𝐾𝒀𝑿 𝐾𝒀𝒀
log 2𝜋𝐾&! + 𝒉2𝐾&!𝒉 𝒀𝒀 𝒀 𝒀𝒀 𝒀
𝒉’=𝒉 −𝐾 𝐾&!𝒉 𝑿 ! 𝑿𝒀 𝒀𝒀 𝒀
§ Let us compute ∫𝒞 𝐴,𝐵,𝐶;𝐾,𝒉,𝑔 𝑑𝐶
§𝒞 𝐴,𝐵,𝐶;−1 4 −2, 4 ,−2
Canonical Form: Reduction
§ The reduction of a canonical form
𝒞 bysettingevidence𝒀=𝐲is
§𝐾= 𝐾𝑿𝑿 𝐾𝑿𝒀 𝐾𝒀𝑿 𝐾𝒀𝒀
𝐾’ = 𝐾𝑿𝑿 𝒉’=𝒉𝑿−𝐾𝑿𝒀𝒚!
𝑔’=𝑔+𝒉2𝒀𝒚−#𝒚2𝐾𝒀𝒀𝒚
§Letusset𝐶=2for𝒞 𝐴,𝐵,𝐶;𝐾,𝒉,𝑔
§𝒞 𝐴,𝐵,𝐶;−1 4 −2, 4 ,−2
Canonical Form: Linear Model
§ The linear Gaussian model !⁄ ‘ − !⁄ ‘ 𝜷2 𝐾𝒀|𝑿= !$ ! $ 2
§𝑌~𝒩(𝛽 +𝜷2𝑿;𝜎#) −⁄’𝜷 ⁄’𝜷𝜷 1 !$$
𝒉𝒀|𝑿 = ! $
− ⁄ ‘ 𝛽1𝜷 $
𝑔𝒀|𝑿=−! :1′ −!log(2𝜋𝜎#) #$’ #
Variable Elimination and Gaussian Models
Input: Bayesian network 𝑁, query variables 𝑸, variable ordering 𝜋, evidence 𝒆
Output: joint marginal 𝑃(𝑸, 𝒆)
1: 𝑺 ← {𝑓𝒆: 𝑓 is a CPDs of network 𝑁} 2:for𝑖 =1tolengthoforder𝜋do
3: 𝑓 ← ∏( 𝑓( where 𝑓( belongs to 𝑺 and mentions variable 𝜋(𝑖)
4: 𝑓6 ←∑7(6)𝑓
5: replace all factors 𝑓( in 𝑺 by factor 𝑓6
6: return ∏8∈𝑺 𝑓
§ A is a Hidden Markov Model with continuous variables
§ Root nodes are modelled with Gaussian distributions § Internal nodes are linear Gaussian models
§ Thus, is a Gaussian Bayesian network
§ Let’s start with a Markov chain
§ Instead of tracking discrete states such as sun and rain
§ We will track a continuous variable such as temperature
𝑋! 𝑋# 𝑋- 𝑋; 𝑋<
26°C 24°C 22°C 23°C 25°C 𝑇! 𝑇# 𝑇- 𝑇; 𝑇<
§ A is a Hidden Markov Model with continuous variables
§ Root nodes are modelled with Gaussian distributions § Internal nodes are linear Gaussian models
𝑋! 𝑋# 𝑋- 𝑋; 𝑋<
𝜇)$ = 23°C 𝜎)$ = 5°C
𝑇! 𝜇)&'$ = 𝑇;
𝜎)&'$ = 1°C
§ A is a Hidden Markov Model with continuous variables
§ Root nodes are modelled with Gaussian distributions § Internal nodes are linear Gaussian models
𝜇: 23°C 𝑇!
23°C 23°C 23°C 𝑇# 𝑇- 𝑇; 𝑇<
5.2°C 5.3°C 5.4°C
=i 𝑝 𝑇='!|𝑇= 𝑝 𝑇= 𝑑𝑇= 22
𝜇)$ = 23°C 𝜎)$ = 5°C
𝑇! 𝜇)&'$ = 𝑇;
𝜎)&'$ = 1°C
§ A is a Hidden Markov Model with continuous variables
§ Root nodes are modelled with Gaussian distributions § Internal nodes are linear Gaussian models
𝑇! 𝑇# 𝑇- 𝑇; 𝑇< 𝑆! 𝑆# 𝑆- 𝑆; 𝑆<
𝑝 𝑇='! =i 𝑝 𝑇='!|𝑇= 𝑝 𝑇= 𝑑𝑇= 22
𝑝 𝑇='! 𝑠='! ∝ 𝑝 𝑠='! 𝑇='! 𝑝(𝑇='!)
𝜇)$ =23°C 𝜎)$ = 5°C
𝜇$& =𝑇; 𝜎$& = .5°C
We must renormalise the results
§ Let’s simulate this algorithm with the following evidence:
§𝑒 = [20,20.5,22,21.5,21,23,22,20.5,21,22]
𝑇;<&~𝒩(𝑇;; 1#°C)
𝑇! 𝑇= 𝑇='! 𝑇&~𝒩(23; 5#°C)
𝑆='! 𝑆;<&~𝒩(𝑇;<&; . 5#°C)
§ A is the core algorithm of GPS systems
§ Example of data fusion algorithm
§ We can have additional observations such as the
phone accelerometer
§ Kalman filters are often applied to guidance and navigations systems
§ Initially applied to trajectory estimation for the Apollo program
§ Currently used in missiles and spacecraft navigation systems, including the International Space Station
GPS Time 1
GPS Time 2
GPS Time 3
Conclusion
§ This lecture discussed Graphical Models with continuous variables
§ We used the normal distribution for root variables and the linear Gaussian model for CPDs § The canonical representation allows efficient implementation of operations
§ Generally, the use of different continuous distributions is very challenging
§ There are several possible extensions
§ Hybrid networks mix continuous and Gaussian variables
§ Nonlinear models such as Extended and Unscented Kalman filter can provide better models when the linear Gaussian model is not appropriated
§ The material of this lecture is spread in multiple chapters of Koller & Friedman § Continuous variables 5.5, pg. 185-190
§ Gaussian Networks 7.1 & 7.2, pgs. 247-253
§ Variable Elimination in Gaussian Networks, 14.1 & 14.2, pgs. 605-614
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com