CSE 150A. Assignment 3 Summer 2020, Session 1 Out: Thu Jul 16
Due: Thu Jul 23
3.1 Maximum likelihood estimation
(a) Complete data
Consider a complete data set of i.i.d. examples {at, bt, ct, dt}Tt=1 drawn from the joint distribution of the above belief network. Compute the maximum likelihood estimates of the conditional proba- bility tables (CPTs) shown below for this data set. Express your answers in terms of equality-testing functions, such as:
1ifa=at, I(a,at) = 0 if a ̸= at.
For example, in terms of this function, the maximum likelihood estimate for the CPT at node A is
given by P(A = a) = 1 T I(a,at). Complete the numerators and denominators in the below T t=1
AD
BC
expressions.
P(B=b|A=a) =
P(C=c|A=a,B=b) =
P(D=d|A=a,C=c) =
1
AD
BC
(b) Posterior probability
Consider the belief network shown above, with observed nodes B and D and hidden nodes A and C. Compute the posterior probability P (a, c|b, d) in terms of the CPTs of the belief network—that is, in terms of P (a), P (b|a), P (c|a, b) and P (d|a, c).
(c) Posterior probability
Compute the posterior probabilities P (a|b, d) and P (c|b, d) in terms of your answer from part (b). In other words, in this problem, you may assume that P (a, c|b, d) is given, since you showed how to compute that in part (b).
(d) Log-likelihood
Consider a partially complete data set of i.i.d. examples {bt, dt}Tt=1 drawn from the joint distribution
of the above belief network. The log-likelihood of the data set is given by:
L = logP(B=bt,D=dt). t
Compute this log-likelihood in terms of the CPTs of the belief network. You may re-use work from earlier parts of the problem.
2
AD
BC
(e) EM algorithm
The posterior probabilities from parts (b) and (c) can be used by an EM algorithm to estimate CPTs that maximize the log-likelihood from part (d). Complete the numerator and denominator in the below expressions for the EM update rules. Simplify your answers as much as possible, expressing them in terms of the posterior probabilities P (a, c|bt, dt), P (a|bt, dt), and P (c|bt, dt), as well as the functions I(b, bt), and I(d, dt).
P(A=a) ←
P(B=b|A=a) ←
P(C=c|A=a,B=b) ←
P(D=d|A=a,C=c) ←
3
3.2 EM algorithm for noisy-OR
In this problem, we will see how the EM algorithm can be used to learn the parameters of a noisy-OR model. Part (a) is related to the derivation of the EM update rules for the noisy-OR model. Parts (b) through (d) will ask you to write code to learn the parameters of a noisy-OR model based on a particular data set.
(a) Equivalence of models
The following diagrams depict two variations of the noisy-OR model:
X1 X2 … Xn
Y
Model A
X1 X2 … Xn
Z1 Z2 … Zn
Y
Model B
For this part of the problem, we will use the notation PA(···) and PB(···) to distinguish between probabilities computed using Model A and Model B, respectively.
Model A is a standard noisy-OR model in which the CPT for Y is
n
PA(Y =1|X⃗ =⃗x)=1−(1−pi)xi.
i=1
Model B is a variation of the noisy-OR model in which we’ve inserted a layer of hidden variables
Z1, . . . , Zn, with the following CPTs:
PB(Zi =1|Xi =xi)= 0, ifxi =0
pi, ifxi=1
1, ifZi =1foranyi
PB(Y=1|Z⃗=⃗z)= 0, ifZi=0foralli
In contrast to Model A, the Y node in Model B is a deterministic OR, since it will be 1 if and only if any of the Zi’s are 1. We can view the Zi nodes as being a sort of “noisy copy” of the corresponding Xi nodes: if Xi = 0, then Zi is guaranteed to be 0 as well, but if Xi = 1, then Zi has probability pi of being 1.
Note also that both models are defined in terms of parameters pi for i ∈ {1, . . . , n}. 4
To prove that the Y nodes in both models are equivalent, it is enough to show that PA(Y =0|X⃗ =⃗x)=PB(Y =0|X⃗ =⃗x).
The following equations are an incomplete proof of this fact, which differs slightly from the one seen in lecture. To complete this proof, provide a brief justification for each of the steps labeled (i) through (iv).
PB(Y =0|X⃗ =⃗x)=PB(Y =0,Z⃗ =⃗z|X⃗ =⃗x) (i) ⃗z
n
= PB(Y=0|Z⃗=⃗z)PB(Zi=zi|Xi=xi) (ii)
⃗z i=1 n
= PB(Zi = 0|Xi = xi) (iii) i=1
n
= (1 − pi)xi (iv)
i=1
(Because of this result, we can simply use the notation P (· · · ) for probabilities in the remaining parts
of this problem, since we have shown that the the two models PA(···) and PB(···) agree.)
(b) EM Implementation: Per-iteration statistics
Suppose we are given a data set of T samples {(⃗xt, yt)}Tt=1 from Model A.
If we try to apply standard maximum-likelihood techniques to estimate the values of the parameters pi based on Model A, we will find that there is no closed-form solution.
However, since part (a) showed that Model A and Model B are equivalent, we can instead view the data as an incomplete data set corresponding to Model B. This allows us to estimate the values of pi using the EM algorithm.
Noting that pi corresponds to P (Zi = 1|Xi = 1), which is one of the CPT entries for Model B, we can write the EM update rule as:
count(Zi = 1, Xi = 1) pi ←
count(Xi = 1)
tP(Zi =1,Xi =1|X⃗ =⃗xt,Y =yt)
1 T ytxitpi
pi ←Tit=11−nj=1(1−pj)xjt,
where Ti = t xit is the number of observations where Xi = 1. 5
= t P ( X i = 1 | X⃗ = ⃗x t , Y = y t )
Then, as shown in class, the EM update rule for noisy-OR can be simplified to the following:
Next, you will use the EM algorithm for estimating the noisy-OR parameters pi. Download the data files hw3 noisyOr x.txt and hw3 noisyOr y.txt for this homework assignment. The data set has T =267 examples over n=23 inputs.1
For a data set {(⃗xt, yt)}Tt=1, the normalized log (conditional) likelihood is given by: 1 T
L=T logP(Y=yt|X=⃗xt). t=1
In your code, you should initialize all pi = 0.05 and perform 512 iterations of the EM algorithm using the update rule shown above. An easy to miss reason for getting different results on this assignment is to not use the initialization specified here! At each iteration, compute the log conditional likelihood shown above. (If you have implemented the EM algorithm correctly, this log conditional likelihood will always increase from one iteration to the next.)
Also compute the number of mistakes made by the model at each iteration; a mistake occurs either • when yt =0 and P(Y =1|X⃗ = ⃗xt) ≥ 0.5 (indicating a false positive), or
• when yt =1 and P(Y =1|X⃗ = ⃗xt) < 0.5 (indicating a false negative).
In other words, use the current parameter values of the iteration to try to predict Y given the values for X, and count how many of those predictions are wrong. The number of mistakes should generally decrease as the model is trained, though it is not guaranteed to do so at each iteration.
For this part, turn in a completed version of the following table:
iteration
number of mistakes M
log conditional likelihood L
0
175
-0.9581
1
56
2
-0.4082
4
8
16
32
64
128
36
256
512
-0.3100
You should use the already completed entries of this table to check your work. As always you may program in the language of your choice.
(c) EM Implementation: Estimated values for pi
Produce a table or bar plot of your final estimated values for pi.
(d) EM Implementation: Source code
Include your source code for this problem as part of your Gradescope submission.
1For those interested, more information about this data set is available here:
http://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECT.names
6
Debugging hints: If you are having trouble getting your results to match starting at iteration 1, try printing out the values of p1 and p23 after the first few iterations. After iteration 1, you should find that p1 ≈ 0.11655 and p23 ≈ 0.15310. If you find that your value of p1 matches but p23 does not, this is likely a sign that you are using the updated values for pi too early. To solve this, you should save the new values for pi in temporary variables until p1, p2, . . . , p23 have all been computed; then, overwrite the old values of pi all at once.
7
3.3 EM algorithm for binary matrix completion
In this problem you will use the EM algorithm to build a simple movie recommendation system. Download the files hw3 movieTitles su20.txt and hw3 movieRatings su20.txt. The first file provides the titles of 60 movies. The second file contains a 150×60 matrix of zeros, ones, and missing elements denoted by question marks. The (i, j)th element in this matrix contains the ith user’s rating of the jth movie (corresponding to the jth line of the first file), according to the following key:
1 recommended,
0 not recommended, ? not seen.
Note: In past versions of this course, this data came from a survey given to you and your peers. In the interest of time for the Summer session, I have generated synthetic data to only loosely match the distribution seen from past students, so as to give them some privacy.
(a) Sanity check
Compute the mean popularity rating of each movie, given by the simple ratio
number of users who recommended the movie , number of users who saw the movie
and sort the movies by this ratio. Print out the movie titles from most popular (Doctor Strange) to least popular (The Last Airbender).
(b) Likelihood
Now you will learn a naive Bayes model of these movie ratings, represented by the belief net- workshownbelow,withhiddenvariableY ∈{1,2,...,k}andpartiallyobservedbinaryvariables R1, R2, . . . , R60 (corresponding to movie ratings).
Y
R1 R2 R3 ...R60
This model assumes that there are k different types of movie-goers, and that the ith type of movie- goer—who represents a fraction P(Y =i) of the overall population—likes the jth movie with condi- tional probability P (Rj = 1|Y = i).
8
This model is an interesting extension of the hidden variable models studied in lecture: for this prob- lem, each sampled set of user’s reviews contains its own mix of observed and unobserved variables. Some users have watched many movies, some users have watched only a few. A hidden variable for one user might be observed for another user. This is what we mean by ”partially observed”. Note the the type of movie-goer (variable Y ) is always hidden for all users.
Let Ωt denote the set of movies seen (and hence rated) by the tth user. Show that the likelihood of the tth user’s ratings is given by
(t) k (t) P Rj=rj =P(Y=i)P Rj=rj Y=i .
j∈Ωt i=1 j∈Ωt
The E-step of this model is to compute, for each user, the posterior probability that they correspond
(c) E-step
to a particular type of movie-goer. Show that
P(Y=i) PR =r(t)Y=i (t) j∈Ωt jj
PY=iRj=r = . jj∈Ωt k ′ (t)′
i′=1P(Y=i) j∈ΩtP Rj=rj Y=i
(d) M-step
The M-step of the model is to re-estimate the probabilities P (Y = i) and P (Rj = 1|Y = i) that define the CPTs of the belief network. As shorthand, let
of users. Show that the EM updates are given by
1 T P(Y=i) ← T ρit,
(t) ρit=P Y=iRj=r
j j∈Ωt
denote the probabilities computed in the E-step of the algorithm. Also, let T = 150 denote the number
t=1 t:j ∈Ωt
P(Rj=1|Y=i) ←
Here, t:j∈Ωt denotes a sum over all users t that rated movie j. Similarly, t:j∈/Ωt denotes a sum
over all users t that did not rate movie j. (e) Implementation
Download the files hw3 probType init su20.txt and hw3 probRatingGivenType init su20.txt, and use them to initialize the probabilities P (Y = i) and P (Rj = 1|Y = i) for a model with k = 4 types of movie-goers. Run 128 iterations of the EM algorithm, computing the (normalized) log-likelihood
1T (t) L=T logP Rj=rj j∈Ωt
t=1
at each iteration. Your log-likelihood should increase (i.e., become less negative) at each iteration.
Fill in the following table, using the already provided entries to check your work: 9
ρit Ir(t),1 +
j t:j ̸∈Ωt
ρit P(Rj =1|Y =i) .
T ρ t=1 it
iteration
log-likelihood L
0
-33.5120
1
-16.7113
2
4
8
16
-12.8396
32
64
128
Note: Choosing k = 4 is somewhat arbitrary, but it is based on assuming that there are a small number of fundamentally different movie-goer types. Moreover, the initial values for P (Rj = 1|Y = i) in hw3 probRatingGivenType init su20.txt have been chosen randomly. We provide these to you so that this HW unfolds in a certain manner and you’ll get the same results we got. If you are interested in exploring further after completing this assignment, feel free to try out other initializations and other values of k.
(f) User categorization
Using the formula from part (c) along with the parameters you estimated in part (e), evaluate the
probability distribution over movie-goer types (i.e., over Y ) given the following ratings:
”Tron”
recommended
”Frozen”
recommended
”Star Trek Beyond”
NOT recommended
”Jumanji: Welcome to the Jungle”
NOT recommended
”It”
NOT recommended
”Batman v Superman: Dawn of Justice”
NOT recommended
”The Dark Knight”
NOT recommended
”The Lord of the Rings: The Fellowship of the Ring”
recommended
”Black Panther”
recommended
”The Avengers”
recommended
”Avengers: Infinity War”
recommended
all others
not seen
Include either a table or a plot of P(Y =i|those ratings) for i ∈ {1,...,k}.
What value of i ∈ {1, . . . , k} maximizes the value of P (Y = i | those ratings)? (In other words, which
of the k types of movie-goers does this user most closely match?) (g) User movie recommendations
Next, compute this user’s expected ratings on the movies they haven’t yet seen. You can do this by applying the following formula for each movie l that they haven’t seen:
k
P(Rl=1|theirratings) = P(Y=i|theirratings)P(Rl=1|Y=i)
i=1
10
What are the top ten recommended movies? (Print out the first ten movies in the list of these (unseen) movies sorted by their expected ratings.) Does this list seem different than the list in part (a)? Hope- fully it does (although our data set is obviously far smaller and more incomplete than the data sets at actual movie streaming companies, not to mention the fact this particular data set was randomly generated).
Note: If you are interested in exploring further after completing this assignment, feel free to try out placing your own ratings in and re-doing parts (f) and (g).
(h) Source code
Include your source code for this problem as part of your Gradescope submission. As usual, you may program in the language of your choice.
11