COMP 424 – Artificial Intelligence Learning with Missing Values
Instructors: Jackie CK Cheung and Readings: R&N Ch 20.3 (3rd or 4th ed)
Final Project
Copyright By PowCoder代写 加微信 powcoder
• If you still don’t have a project partner, please see the megathread on Ed and let TA Koustuv Sinha know ASAP!
COMP-424: Artificial intelligence 2
Today’s overview
• Topic: Parameter learning with missing values
• Expectation maximization (EM)
• K-means clustering
COMP-424: Artificial intelligence 3
Learning in Bayesian networks
• Given data in the form of instances:
Tampering Fire Smoke Alarm Leaving Report
No No No No No No No Yes Yes Yes Yes No ……………
• Goal: Find parameters of the Bayes net.
• We discussed how to do this using maximum likelihood.
COMP-424: Artificial intelligence 4
Learning in Bayesian networks
• Plot twist: Suppose some values are missing!
Tampering Fire Smoke Alarm Leaving Report
? No No No No No No Yes Yes Yes ? No ……………
• Can we still use MLE?
• How do we deal with the missing data?
COMP-424: Artificial intelligence
Why do we get incomplete data?
• Some variables may not be assigned values in some instances.
• e.g., not all patients undergo all medical tests.
• Some variables may not be observed in any of the data items.
• e.g., viewer preferences for a show may depend on their metabolic cycle (what time they are awake) – which is not usually measured
• These are called latent (or hidden) variables
COMP-424: Artificial intelligence 6
Why model latent variables?
• You can imagine designing a Bayesian network that ignores latent variables
• e.g., heart disease domain
• Factors: Smoking, diet, exercise
• Symptoms: s1, s2, s3
• Latent variable: whether a person has heart disease
COMP-424: Artificial intelligence 7
Latent variables add perspicuity
Shaded means missing R&N Fig 20.10
If each variable has 3 possible values:
78 parameters 708 parameters!
COMP-424: Artificial intelligence
Learning with missing data
• Our problem is to estimate the parameters of the network
• If we had the parameters, we could get P(HeartDisease|OtherVariables), then we would have complete data for doing supervised learning
• A chicken and egg conundrum
• Let’s pretend we know the parameters of the model
• Then we can do inference to figure out the value of HeartDisease
• Then we can re-estimate the parameters!
• This seems like magic! But it works (kind of)!
COMP-424: Artificial intelligence 9
Expectation Maximization (EM)
• General purpose method for learning from incomplete data (not only Bayes nets), whenever an underlying distribution is assumed.
• Main idea: Alternate between two steps
1. (E-step): For all the instances of missing data, we will “fantasize” how the data should look based on the current parameter setting.
• This means we compute expected sufficient statistics.
2. (M-step): Then maximize parameter setting, based on these
statistics.
COMP-424: Artificial intelligence 10
Outline of EM
• Initialization:
• Start with some initial parameter setting (e.g. P(T), P(F), P(A|F,T),
• These can be estimated from all complete data instances.
1. Expectation (E-step): Complete the data by assigning “values” to
the missing items based on current parameter setting.
2. Maximization (M-step): Compute the maximum likelihood parameter setting based on the completed data. This is what we did last class.
• Convergence:
• Nothing changes in E-step or M-step between 2 consecutive
COMP-424: Artificial intelligence 11
Consider a simple network X → Y, and suppose we want to learn its parameters from samples
Suppose that x1 is missing and y1=1. What can we do?
COMP-424: Artificial intelligence
EM in our example
• To start, guess the parameters of the network (using the known data): e.g. X = Nx=1(2:m) / (m-1)
Y|x=0 = NY=1,X=0(2:m) / NX=0(2:m) Y|x=1 = NY=1,X=1(2:m) / NX=1(2:m)
Using initial , compute:
(Note that this step requires exact inference – so not cheap!) Complete dataset with most likely value of x1.
Call new dataset D*.
Compute new parameter vector , which maximizes the likelihood given the completed data: L(|D*) = P(D* | )
P(x1=0 | y1), P(x1=1 | y1)
e.g. X = Nx=1/m
Y|x=0 = NY=1,X=0 / NX=0 Y|x=1 = NY=1,X=1 / NX=1
• Repeat E-step and M-step until the parameter vector converges.
COMP-424: Artificial intelligence 13
Two version of the algorithm
• Hard EM: for each missing data point, assign the value that is most likely.
(This is the version we just saw.)
• Soft EM: for each missing data point, put a weight on each value, equal to its probability, and use the weights as counts.
(This is the most common version.)
Then these numbers are used as real counts, to provide a maximum likelihood estimate for .
COMP-424: Artificial intelligence 14
Soft EM in our example
• To start, guess the parameters of the network (using the known data): E.g. X = Nx=1(2:m) / (m-1)
Y|x=0 = NY=1,X=0(2:m) / NX=0(2:m) Y|x=1 = NY=1,X=1(2:m) / NX=1(2:m)
• E-step: Using initial , compute: w0 = P(x1=0 | y1)
Now hypothesize two datasets: D0 =
D1 =
• M-step: Compute new parameter vector , which maximizes the expected likelihood given the completed data: L(|D*) = w0 P(D0 | ) + w1 P(D1 | )
E.g. X = (NX=1(2:m) + w1) / m
Y|x=0 = (NY=1,X=0(2:m) + w0) / (NX=0(2:m) + w0) Y|x=1 = (NY=1,X=1(2:m) + w1) / (NX=1(2:m) + w1)
Repeat E and M steps!
w1 = P(x1=1 | y1)
COMP-424: Artificial intelligence 15
Comparison of hard EM and soft EM
• Soft EM does not commit to specific value for the missing item.
• Instead, it considers all possible values, with some probability.
• This is a pleasing property, given the uncertainty in the value.
• Complexity:
• Hard EM requires computing most probable values.
• Soft EM requires computing conditional probabilities for completing the missing values.
• Same complexity: both require full probabilistic inference – which can be expensive!
COMP-424: Artificial intelligence 16
Properties of EM
• Likelihood function is guaranteed to improve (or stay the same) with each iteration.
• Algorithm can be stopped when no more improvement is achieved between iterations.
• EM is guaranteed to converge to a local optimum of the likelihood function.
• Starting with different values of initial parameters is necessary
• Or find an informed way to initialize the parameters (e.g., with a
small labelled dataset)
• EM is a widely used algorithm in practice!
COMP-424: Artificial intelligence 17
A harder example
COMP-424: Artificial intelligence 18
A harder example
COMP-424: Artificial intelligence
And repeat…
A seemingly harder problem
• What if one of your variables is always missing? Data =
• Can you estimate a maximum-likelihood parameter for this case?
• We call this learning from unlabeled data, or unsupervised learning
COMP-424: Artificial intelligence 20
An example
• A fruit merchant approaches you, with a set of apples to classify according to their variety.
• Tells you there are five varieties of apples in the dataset.
• Tells you the weight and colour of each apple in the dataset.
• Can you label each apple with the correct variety?
• What would you need to know / assume?
COMP-424: Artificial intelligence
Plotting the data
What if you observe that the data looks like this?
(One axis is weight, the other is colour.)
Image courtesy of , U.
COMP-424: Artificial intelligence 22
Reminder: Gaussian distribution
• a.k.a., normal distribution
• Probability density function:
221𝑥−𝜇2 𝑓𝑥|𝜇,𝜎 =𝑁𝜇,𝜎 = 2𝜎2𝜋exp(− 2𝜎2 )
Image source: Wikipedia
• Outcomes are continuous values (e.g., height, weight, size, …)
• Parameterized by mean and variance
COMP-424: Artificial intelligence 23
K-means clustering
• K-means clustering: cluster instances into K distinct classes.
• Need to know K in advance.
• Need to assume a parametric distribution for each class.
• Back to our apples…
• You know there are 5 varieties.
• Assume each variety generates apples according to a (variety- specific) 2-D Gaussian distribution, N(i, i2)
• If you know i, i2 for each class, it’s easy to classify the apples!
• If you know the class of each apple, it’s easy to estimate i, i2!
What if we know neither?
COMP-424: Artificial intelligence 24
K-means algorithm
This data could easily be modeled by Gaussians. 1. Ask user how many clusters.
Images courtesy of , U.
COMP-424: Artificial intelligence 25
K-means algorithm
This data could easily be modeled by Gaussians.
1. Ask user how many clusters.
2. Randomly guess k centers:
{ 1,…, k } (assume 2 is known).
COMP-424: Artificial intelligence 26
K-means algorithm
This data could easily be modeled by Gaussians.
1. Ask user how many clusters.
2. Randomly guess k centers:
{ 1,…, k } (assume 2 is known).
3. Assign each data point to
closest center.
COMP-424: Artificial intelligence
K-means algorithm
This data could easily be modeled by Gaussians.
1. Ask user how many clusters.
2. Randomly guess k centers:
{ 1,…, k } (assume 2 is known).
3. Assign each data point to
the closest center.
4. Eachcenterfindsthe centroid of the points it owns.
COMP-424: Artificial intelligence 28
K-means algorithm
This data could easily be modeled by Gaussians.
1. Ask user how many clusters.
2. Randomly guess k centers:
{ 1,…, k } (assume 2 is known).
3. Assign each data point to
the closest center.
4. Eachcenterfindsthe centroid of the points it owns.
5. Repeat steps 3-4
COMP-424: Artificial intelligence 29
K-means algorithm starts
COMP-424: Artificial intelligence 30
K-means algorithm continues (2)
COMP-424: Artificial intelligence 31
K-means algorithm continues (3)
COMP-424: Artificial intelligence 32
K-means algorithm continues (4)
COMP-424: Artificial intelligence 33
K-means algorithm continues (5)
COMP-424: Artificial intelligence 34
K-means algorithm continues (6)
COMP-424: Artificial intelligence 35
K-means algorithm continues (7)
COMP-424: Artificial intelligence 36
K-means algorithm continues (8)
COMP-424: Artificial intelligence 37
K-means algorithm continues (9)
COMP-424: Artificial intelligence 38
K-means algorithm terminates
COMP-424: Artificial intelligence 39
Starting and Stopping K-means
• How to decide on value of K?
• User must specify – not always easy to do!
• Could try for different values of K, and inspect output
• There are more sophisticated methods which attempt to learn a good value of K, given assumptions about cluster coherence and shape
• When to stop?
• Usually stop when label of datapoints / cluster centres stop
COMP-424: Artificial intelligence 40
K-means and EM
• K-means as shown above is just the hard EM algorithm, where the underlying model that generated the data is a mixture of Gaussian distributions.
• i.e., each cluster (each type of apples) is generated by a different Gaussian distribution with its own mean and variance.
• There exists an analogous algorithm for the standard, soft EM algorithm for a mixture of Gaussians.
• In E-step, must assign a responsibility score for each data point being generated by each Gaussian distribution in the mixture.
• M-step update must then use these responsibility scores to weight the samples when computing the centroids.
COMP-424: Artificial intelligence 41
Properties of K-means
• Time complexity? O(nkd) where k = #centers
n = #datapoints
d = dimensionality of data
• Optimality?
• Converges to a local optimum.
• Can use random re-starts to get better local optimum.
• Alternately, can choose your initial centers carefully:
• Place 1 on top of a randomly chosen datapoint.
• Place 2 on top of datapoint that is furthest from 1.
• Place 3 on top of datapoint that is furthest from both 1 and 2.
COMP-424: Artificial intelligence 42
What you should know
• Learning maximum-likelihood parameters with missing data.
• EM algorithm in general case.
• Properties of EM (complexity, convergence).
• Clustering of data using K-means algorithm
• Basic procedure.
• Properties of K-means (complexity, convergence).
• Relation between EM and K-means.
COMP-424: Artificial intelligence 43
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com