CS代考 COMP9418: Advanced Topics in Statistical Machine Learning

COMP9418: Advanced Topics in Statistical Machine Learning
Bayesian Networks as Classifiers
Instructor: University of Wales

Introduction
§ This lecture discusses the use of Bayesian networks as classifiers § Variables can be divided into attributes and class
§ We want to predict the class based on the information on variables
§ Classification will help us to discuss several aspects of Bayesian networks § Such as independence, learning and inference
§ Some of these topics will be further discussed in forthcoming lectures
§ We will review a simple Bayesian network for classification § The Naïve Bayes and some extensions

Bayesian Networks as Classifiers
§ Suppose we have a Bayesian network for
breast cancer diagnosis Age
Microcalcification
Breast Cancer
§ Our aim is to predict whether a patient has breast cancer given a series of mammography results
§ The network variables can be divided into two sets
§ Class (query variable) § Attributes (evidence)
§ Some relevant aspects § Type of query
§ Independence assumptions
§ Learning structure and parameters
Location Breast Density
Skin Retract
Nipple Discharge
Fibr Tissue Architectural distortion
Spiculation

Bayesian Networks as Classifiers
§ Given a set of attribute values for a
patient, our objective is to correctly Age identify the class value
Microcalcification
Breast Cancer
Skin Retract
Nipple Discharge
§ In this example, the values are “no”, “insitu”, “invasive”
§ Therefore, we have an MPE query with 𝑸 = {𝐵} and evidence set with the values of the remaining attributes
Location Breast Density
Fibr Tissue Architectural distortion
§ Bayesian networks naturally handle
missing data Size
§ If there is missing evidence in queries, we need to compute a MAP query
§ MAP queries are more costly than MPE since it involves eliminating unobserved variables
Spiculation

Classification of Complete Data
§ A relevant question is whether all variables contribute to the classification given Age complete data
Microcalcification
Breast Cancer
Skin Retract
Nipple Discharge
§ Given the network independence assumptions § We can use the concept of Markov blanket
§ Markov blanket for 𝑋 is constituted of its parents, children, and spouses
Location Breast Density
Fibr Tissue Architectural distortion
Spiculation

Classification of Complete Data
§ A relevant question is whether all variables contribute to the classification given Age complete data
Microcalcification
Breast Cancer
Skin Retract
Nipple Discharge
§ Given the network independence assumptions § We can use the concept of Markov blanket
§ Markov blanket for 𝑋 is constituted of its parents, children, and spouses
§ Not every variable contributes to the classification
§ Since some are d-separated given complete evidence
§ Let us gain more intuition looking at a simpler network
Location Breast Density
Fibr Tissue Architectural distortion
Spiculation

Classification of Complete Data
§ Classification of complete data is a very 𝐴𝐵 𝑎 𝑏
.8 .75 .25
Winter? (𝐴)
Sprinkler? (𝐵)
𝐴 Θ% 𝑎 .6 𝑎C .4
simple case of MPE inference 𝑎𝑏C § To illustrate, let us use this simpler example 𝑎C𝑏
𝐴 𝐶 Θ#|% 𝑎 𝑐 .8 𝑎 𝑐̅ .2 𝑎C 𝑐 .1 𝑎C 𝑐̅ .9
with𝑸 = {𝐵} 𝑎C 𝑏C
§ We select the Markov blanket of 𝐵
§ Suppose we want to classify the instance
𝒆:𝐴 = 𝑡𝑟𝑢𝑒,𝐶 = 𝑓𝑎𝑙𝑠𝑒,𝐷 = 𝑡𝑟𝑢𝑒,E = false
§ We can use the chain rule of Bayesian networks to compute the classifications
§𝑃(𝐵,𝒆)=𝑃 𝑎 𝑃 𝐵|𝑎 𝑃 𝑑𝐵,𝑐̅ 𝑃 𝑐̅𝑎 𝑃(𝑒̅|𝑐̅)
Θ&|$,# 𝑑 .95 𝑑̅ .05
Wet Grass? (𝐷)
𝐶 𝐸 Θ!|# 𝑐 𝑒 .7 𝑐 𝑒̅ .3
𝑐̅ 𝑒 0 𝑐̅ 𝑒̅ 1
Slippery Road? (𝐸)
𝑏 𝑐 𝑏 𝑐 𝑏 𝑐̅ 𝑏 𝑐̅ 𝑏C 𝑐 𝑏C 𝑐 𝑏C 𝑐̅ 𝑏C 𝑐̅
𝑑̅ .9 𝑑 .1 𝑑 .8 𝑑̅ .2 𝑑 .01 𝑑̅ .99

Classification of Complete Data
§ Since variable 𝐵 has only two possible values
𝑃 𝑎 𝑃 𝑏|𝑎 𝑃 𝑑𝑏,𝑐̅ 𝑃 𝑐̅𝑎 𝑃(𝑒̅|𝑐̅)=.0216
𝐴 𝐵 𝑎 𝑏 𝑎 𝑏C 𝑎C 𝑏 𝑎C 𝑏C
.8 .75 .25
Winter? (𝐴)
Sprinkler? (𝐵)
𝐴 Θ% 𝑎 .6 𝑎C .4
𝑏 𝑐 𝑑 𝑏𝑐𝑑̅ 𝑏𝑐̅𝑑̅ 𝑏 𝑐̅ 𝑑 𝑏C 𝑐 𝑑 𝑏C 𝑐 𝑑̅ 𝑏C 𝑐̅ 𝑑 𝑏C 𝑐̅ 𝑑̅
.95 .05 .9 .1 .8 .2 .01 .99
Wet Grass? (𝐷)
𝐶 𝐸 Θ!|# 𝑐 𝑒 .7 𝑐 𝑒̅ .3
𝑐̅ 𝑒 0 𝑐̅ 𝑒̅ 1
𝐴 𝐶 Θ#|% 𝑎 𝑐 .8 𝑎 𝑐̅ .2 𝑎C 𝑐 .1 𝑎C 𝑐̅ .9
Slippery Road? (𝐸)

Classification of Complete Data
§ Since variable 𝐵 has only two possible values
𝑃 𝑎 𝑃 𝑏|𝑎 𝑃 𝑑𝑏,𝑐̅ 𝑃 𝑐̅𝑎 𝑃(𝑒̅|𝑐̅)=.0216
§ 𝑃 𝑏C , 𝒆 =
𝑃 𝑎 𝑃 𝑏C|𝑎 𝑃 𝑑𝑏C,𝑐̅ 𝑃 𝑐̅𝑎 𝑃 𝑒̅|𝑐̅ =.00096
§ Although all variables in Markov(𝐵) influence 𝑃(𝐵, 𝑒)
§ Just the CPTs that include 𝐵 will be determinant to the final computation
§ We can visualise this using network pruning technique
𝐴 𝐵 Θ$|% 𝑎 𝑏 .2 𝑎 𝑏C .8 𝑎C 𝑏 .75 𝑎C 𝑏C .25
Winter? (𝐴)
𝐴 Θ% 𝑎 .6 𝑎C .4
Sprinkler? (𝐵)
𝐵 𝐶 𝐷 &|$,#
WetGrass? (𝐷)
𝐸 Θ!|# 𝑒 .7
𝑒̅ .3 𝑒 0 𝑒̅ 1
SlipperyRoad? (𝐸)
𝐶 Θ#|% 𝑐 .8
𝑐̅ .2 𝑐 .1 𝑐̅ .9
𝑏 𝑐 𝑑̅ 𝑏 𝑐 𝑑 𝑏 𝑐̅ 𝑑̅ 𝑏 𝑐̅ 𝑑 𝑏C 𝑐 𝑑 𝑏C 𝑐 𝑑̅ 𝑏C 𝑐̅ 𝑑 𝑏C 𝑐̅ 𝑑̅
.8 𝑐̅ .2 𝑐̅
𝐴 𝑎 𝑎 𝑎C 𝑎C

Classification of Complete Data
§ Consider the query 𝑃(𝐵, 𝒆)
𝒆:𝐴 = 𝑡𝑟𝑢𝑒,𝐶 = 𝑓𝑎𝑙𝑠𝑒,𝐷 = 𝑡𝑟𝑢𝑒,𝐸 = 𝑓𝑎𝑙𝑠𝑒
§ The pruned network is shown on the right
§ If we remove the disconnected nodes, we end up with the compacted CPTs in nodes 𝐵 and 𝐷
𝐴 𝐵 Θ$|% 𝑎 𝑏 .2 𝑎 𝑏C .8
𝐵 𝐶 𝐷 Θ&|$,# 𝑏 𝑐̅ 𝑑 .9 𝑏C 𝑐̅ 𝑑 .01
Winter? (𝐴)
Sprinkler? (𝐵)
Wet Grass? (𝐷)
Slippery Road? (𝐸)
𝐴 𝐶 Θ#|% 𝑐̅ 𝑒̅ 1 𝑎 𝑐̅ .2

Classification of Complete Data
§ Consider the query 𝑃(𝐵, 𝒆)
𝒆:𝐴 = 𝑡𝑟𝑢𝑒,𝐶 = 𝑓𝑎𝑙𝑠𝑒,𝐷 = 𝑡𝑟𝑢𝑒,𝐸 = 𝑓𝑎𝑙𝑠𝑒
§ The pruned network is shown on the right
§ If we remove the disconnected nodes, we end
up with the compacted CPTs in nodes 𝐵 and 𝐷 § For classification, we need to compute
𝑎𝑟𝑔𝑚𝑎𝑥!𝑃(𝐵|𝒆)
𝐴 𝐵 Θ$|% 𝑎 𝑏 .2 𝑎 𝑏C .8
𝐵 𝐶 𝐷 Θ&|$,# 𝑏 𝑐̅ 𝑑 .9 𝑏C 𝑐̅ 𝑑 .01
§𝑎𝑟𝑔𝑚𝑎𝑥!𝑃𝐵𝒆 =𝑎𝑟𝑔𝑚𝑎𝑥!”#,𝒆 =
𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝐵,𝒆) “(𝒆) !
𝑃 𝑏,𝒆 =𝑃 𝑏𝑎 𝑃 𝑑𝑏,𝑐̅ = .2 .9 =.18 𝑃𝑏0,𝒆 =𝑃𝑏0𝑎𝑃𝑑𝑏0,𝑐̅ = .8 .01 =.008
𝑃 𝑏|𝒆 = .18 = .9574 .18 + .008
Sprinkler? (𝐵)
Wet Grass? (𝐷)

Classification of Complete Data
§ Let us look back to the breast cancer network
§ Only a relatively small subset of CPTs are used for classification with complete data
§ Other methods that use all attributes may make a better use of the information
§ Remember out discussion is restricted to inference with complete case
§ If some evidence is missing, then more nodes may take part of the inference
§ Our inference procedure is the same as VE_PR, but we can develop a specialised algorithm
§ VE_PR handles both complete case and missing data
Location Breast Density
Microcalcification
Breast Cancer
Skin Retract
Nipple Discharge
Fibr Tissue Architectural distortion
Spiculation

Naïve Bayes (NBC)
§ A different approach for classification is to use a fixed structure
§ In a Naïve Bayes classifier each attribute has the class variable as its only parent
§ As the structure is fixed, the only task involved in learning is to estimate the parameters
§ NBC assumes the attributes are independent given the class
§ This is rarely the case, but NBCs are surprisingly precise
§ In classification, we are often only interested in the class of maximal probability and not in the exact probability distribution
§ They are popular models in some areas such as text classification
𝐴& 𝐴’ … 𝐴()& 𝐴(

Spam Filter
§ The task is to receive an email as input and output spam/ham
§ Possible attributes
§ Words, e.g., “medicine”, “million”, “dollars”
§ Patterns, such as $?\d+ for currency
§ Non-text: sender in contact list, list of spam servers
§ We need a “corpus”, i.e., a collection of emails § The documents must be labelled (manual task)
§ We want to be successful to label unseen emails
Dear colleagues,
It is with much excitement that I am writing to let you know that nominations are now open for the NSW International Student Awards 2020.
Now, contact my secretary in . Ask him to send you the total of $850,000.00 which I kept for your compensation for all the past efforts and attempts to assist me in this matter…
My name is Mrs. , I’m from Hiroshima City in Japan. Please get back to me urgently for full details, Let discuss about charity project in your location (e.g. Less privileged people, the Orphanage home)

Naïve Bayes Model
§ Using the chain rule for Bayesian networks, we get
𝑃 𝐶,𝐴7,…,𝐴8 = 𝑃 𝐶 𝑃 𝐴7 𝐶 …𝑃(𝐴9|𝐶)
=𝑃𝐶 ∏:𝑃(𝐴:|𝐶)
𝐶 𝐴 * parameters 𝐶 + 𝑛 𝐶 |𝐴| parameters
§ The number of parameters is linear in 𝑛
§ If we use the rule of thumb of 10 instances per parameter
§ 𝑛 = 10 and binary variables would require 20,480 versus 400
§ 𝑛 = 50 and binary variables would require 2×10() versus 2,000
… 𝐴()& 𝐴(

Naïve Bayes for Text
§ NBC for text often use the bag-of-words model
§ Attribute 𝑊+ is the word at position 𝑖 in the document
§ However, we assume each 𝑊+ is identically distributed, independently of 𝑖
§ This model accounts for the same word occurring multiple times
§ “Bag of words” because the model is insensitive to word order
𝑃(𝐶) 𝑃(𝑊|spam)
𝑊& 𝑊’ … 𝑊()& 𝑊(
Example from Berkeley AI Course
ham : 0.66
spam: 0.33
the : 0.0156
to : 0.0153
and : 0.0115
of : 0.0095
you : 0.0093
a : 0.0086
with: 0.0080
from: 0.0075
the : 0.0210
to : 0.0133
of : 0.0119
2002: 0.0110
with: 0.0108
from: 0.0107
and : 0.0105
a : 0.0100

Spam Example
Word P(w|spam) P(w|ham) Tot Spam Tot Ham
(prior) 0.33333 0.66666 -1.1 -0.4
Gary 0.00002 0.00021 -11.8 -8.9
would 0.00069 0.00084 -19.1 -16.0
you 0.00881 0.00304 -23.8 -21.8
like 0.00086 0.00083 -30.9 -28.9
to 0.01517 0.01339 -35.1 -33.2
lose 0.00008 0.00002 -44.5 -44.0
weight 0.00016 0.00002 -53.3 -55.0
while 0.00027 0.00027 -61.5 -63.2
you 0.00881 0.00304 -66.2 -69.0
sleep 0.00006 0.00001 -76.0 -80.5
Example from Berkeley AI Course
P(spam | w) = 98.9

Parameter Estimation
TFT TFT FTF FFT TFF TFT FFF TFT TFT FFT TFT TTT TFT TTT TFT TFT
§ The parameter of a Bayesian network can be estimated
§ Through elicitation, which is the process of asking a human
§ Empirically using data (Machine Learning approach) § We can define an empirical distribution 𝑃𝒟
§ According to this distribution, the empirical probability of an instantiation is simply its frequency of occurrence
§ We can estimate parameters based on the empirical distribution
𝑃𝒟 𝑤(|𝑠 =𝑃𝒟 𝑤(,𝑠 = 2/16 =1 𝑃𝒟 ( 𝑠 ) 1 2 / 1 6 6
Friday ( 𝑊! )
Medicine ( 𝑊” )
TTT TTF TFT TFF FTT FTF FFT FTF
0/16 9/16 1/16 0/16 1/16 2/16 1/16

Overfitting
§ Our objective is to classify unseen instances
§ We say a model “generalises” to unseen data
§ A common procedure is to split the data into training and test sets
§ However, frequency parameters tend to overfit the training data
§ Some words may only appear in one of the classes in the training set. Such as “medicine” for spam and “indeed” for ham
§ Several words may not occur in the training set, but they may appear in the test set
§ In general, we should avoid assigning zero probabilities for any event, unless we are completely sure
𝑃𝐶,𝑊!,…,𝑊” =𝑃𝐶?𝑃(𝑊#|𝐶) #

Additive Smoothing
§ Also known as Laplacian smoothing
§ Developed by Laplace when he tried to estimate the chance the sum will rise
𝑃$ = 𝑐 𝑋 = 𝑥 + 𝛼 𝑁 + 𝛼|𝑋|
Friday ( 𝑊! )
Medicine ( 𝑊” )
§ where 𝑐(𝑋 = 𝑥) is the number of occurrences of 𝑋 = 𝑥
§ 𝛼 ≥ 0 is a “pseudo count” parameter
§ 𝑁 is the number of instances and
§ |𝑋| is the number of values for 𝑋
TTT TTF TFT TFF FTT FTF FFT FTF
0/16 9/16 1/16 0/16 1/16 2/16 1/16
§ For example, 𝛼 = 1 (weak)
𝑃+ 𝑠,𝑤&,𝑤’ = 2+1 = 3 ≈.125
§ 𝛼 = 1000 (strong)
𝑃+ 𝑠,𝑤&,𝑤’ = 2+1000 =1002≈.125
16+8000 8016
𝑃+ 𝑠,𝑤&,𝑤4′ = 0+1 = 1 ≈.041 16+8 24
𝑃+ 𝑠,𝑤&,𝑤4′ = 0+1000 =1000≈.1248 16+8000 8016

Naïve Bayes Extensions
§ Several NBC extensions have been proposed
§ Most of them, try to address the assumption of conditional
independence of attributes given the class
§ A well-known extension is the Tree-augmented Bayes classifier (TAN)
§ It allows a more elaborate dependency structure among variables
§ Such structure is not predefined, as in the case of NBCs
§ Tree means that each attribute variables has at most one attribute variable as parent
§ The central idea is to use conditional mutual information (MI) to link attributes
§ Conditional MI can be seen as a measure of dependency of attributes (given the class)

Tree-augmented Bayes Classifier
𝑀𝐼 (𝐴,𝐴|𝐶)≝ K 𝑃 𝑎,𝑎,𝑐 log
𝑃𝒟(𝑎#,𝑎&|𝑐) *𝑃𝒟 𝑎#|𝑐𝑃+(𝑎&|𝑐)
𝐴(𝐴-𝐴.𝐴/ 𝐶
1 2 3 4 5 6 7 8 9 10
𝑎(𝑎-𝑎.𝑎/ 𝑐̅
𝑎(𝑎-𝑎.𝑎/ 𝑐 𝑎(𝑎-𝑎.𝑎/ 𝑐
𝑎(𝑎-𝑎.𝑎/ 𝑐 𝑎(𝑎-𝑎.𝑎/ 𝑐̅
𝑎(𝑎-𝑎.𝑎/ 𝑐 𝑎(𝑎-𝑎.𝑎/ 𝑐̅
𝑎(𝑎-𝑎.𝑎/ 𝑐̅
𝑎(𝑎-𝑎.𝑎/ 𝑐 𝑎(𝑎-𝑎.𝑎/ 𝑐̅
𝐴O.08 𝐴P 𝐴O 𝐴P
.16 .16.21 .16 .21

Tree-augmented Bayes Classifier
Input: Dataset 𝐷 with attributes 𝐴!, … , 𝐴” and class 𝐶 Output: TAN classifier
for 𝑖 = 1 to 𝑛 do
for j = 1 to 𝑛 do
𝑚𝑖,𝑗 ←𝑀𝐼(𝐴#,𝐴&|𝐶)
𝐺 ← complete undirected graph over {𝐴! , … , 𝐴” } with weight 𝑚
𝐺. ← maximal spanning tree for 𝐺
𝐺.+ ← 𝐺. directed by choosing any variable as root and setting edge directions outward from root 𝐺.+ ← 𝐺.+ with node 𝐶 added and direct edges from 𝐶 to each attribute node
Learn parameters for 𝐺.+
return 𝐺.+

Other Naïve Bayes Extensions
§ Other extensions to the NBC are
§ Bayesian Network augmented Naïve Bayes (BAN) § General Bayesian Network (GBN)
§ Both GBN and BAN are very similar to TAN
§ With the main difference they induce DAG structures from data,
instead of trees
§ BANs create the network structure using only the attributes. The class is included afterwards, similarly to TAN
§ GBNs create the structure with all variables including the class. It finds the Markov blanket of the class and delete all nodes outside the blanket
§ We will study the algorithms to induce DAG structures from data later on in the course

Comparison of Classification Accuracy
§ This are the results of an empirical comparison of Bayesian classifiers in eight UCI dataset
GBN (Sel. At.)
86.11±0.27
85.82±0.27
86.01±0.27
84.18±0.29
89.72±0.46
93.08±0.39
91.71±0.42
90.32±0.45
99.30±0.16
99.82±0.08
95.75±0.39
94.65±0.69
94.18±0.72
92.50±0.81
87.34±1.02
79.09±1.18
88.28±0.93
93.59±0.71
94.27±0.68
86.11±1.46
94.04±0.44
94.10±0.48
86.58±1.78
82.27±1.45
82.85±2.00
83.49±1.29
80.11±3.14
95.17±1.89
95.63±3.85
94.25±3.63
89.89±5.29
Cheng, J., & Greiner, R. Comparing Bayesian network classifiers. UAI. 26

Conclusion
§ Bayesian networks are models for probabilistic reasoning § Classification is a task that matches MAP/MPE queries
§ BNs provides an attractive approach
§ They can naturally learn (more about this later) and classify in the presence of
missing data
§ For complete data, the Markov blanket provides an approach to select the most relevant features
§ NBC is Bayesian classification algorithm with fixed structure
§ They are very popular in certain areas such as text mining
§ Some NBC extensions induce more complex structures, usually leading to better accuracy rates

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts