Last Modified: May 17, 2021
CS 179: Introduction to Graphical Models: Spring 2021
Homework 5
Due Date: Wednesday, May 26th
The submission for this homework should be a single PDF file containing all of the relevant code, figures, and any text explaining your results. When coding your answers, try to write functions to encapsulate and reuse code, instead of copying and pasting the same code multiple times. This will not only reduce your programming efforts, but also make it easier for us to understand and give credit for your work. You may also use any functions in the
pyGMs toolbox to assist you. Show and explain the reasoning behind your work! Background & Data
In this homework, we will explore learning a graphical model’s parameters and structure from training data, and comparing two different models’ fits.
Download the data set associated with the homework (“ ”, linked from the homework assignment). These data are surveyed information about risk factors for certain heath outcomes (heart disease, diabetes, etc.). The data are from the 2011 Behavioral Risk Factor Surveillance System (BRFSS) survey, run anually by the Centers for Disease Control (CDC), and were distilled by Prof. Tom Fletcher (U. Utah) into the provided spreadsheet. The variables and their meanings are as follows:
(a) income – Annual personal income level.
1 (<$10,000)
4 ($20,000-$25,000) 7 ($50,000-$75,000)
2 ($10,000-$15,000) 5 ($25,000-$35,000) 8 (>$75,000)
3 ($15,000-$20,000) 6 ($35,000-$50,000)
4 (obese)
(b) smoke – Smoked 100 or more cigarettes in lifetime. 1 (yes) 2 (no)
(c) cholesterol – Has high cholesterol. 1 (yes) 2 (no)
(d) bmi – Body mass index (category).
1 (underweight) 2 (normal) 3 (overweight)
(e) exercise – Exercised in past 30 days. 1 (yes) 2 (no)
(f) attack – Had a heart attack. 1 (yes) 2 (no)
(g) bp – Has high blood pressure.
1 (yes) 2 (only when pregnant) (h) angina – Had heart disease (angina).
1 (yes) 2 (no) (i) stroke – Had a stroke.
1 (yes) 2 (no)
(j) diabetes – Had diabetes.
1 (yes) 2 (only during pregnancy)
Maximum Likelihood for Bayesian Networks
(a) Load the data into Python:
1 2 3
convert it into zero-based, integer indices:
3 (no)
3 (no)
4 (pre-hypertensive)
4 (pre-diabetic)
179-hw5-riskdata.csv
import numpy as np
data = np.genfromtxt(‘RiskFactorData.csv’, delimiter=”,”,names=True) data
1 data_int = np.array([list(xj) for xj in data], dtype=int)-1
Homework 5 UC Irvine 1/ 3
CS 179: Introduction to Graphical Models Spring 2021
and split it into training and validation data sets:
1 2 3
(b) Estimate the probabilities required for the following Bayesian network, using maximum likelihood (i.e., the empirical probability estimates of the parameters) on the training data. Note: the order of the variables in the file (and list above) is not a topological order for this Bayesian network; so be careful when defining your factors and variables. You may want to define the variable IDs to be a topological order of the graph, and then re-order the data to match; or just be careful when defining and estimating your conditional probabilities.
nTrain = int(.75*len(data_int))
train = data_int[:nTrain]
valid = data_int[nTrain:]
◦ (Just show your code for this part.)
bmi bp
diab stroke attack
income
exercise
smoke
chol
angina
◦ Show the tables you learned for p(Income) and p(Exercise|Income).
◦ What is the total number of probabilities you need to estimate for this network? (Only count “free” probabilities, i.e., do not count any probability parameters that are deterministically known given the others due to the fact that the probabilities are normalized to sum to one.)
◦ In contrast, what is the total number of probabilities in the full joint distribution?
(c) Compute the average log-likelihood on the training data,
1 = 1 logpx(j)|x(j) mm iipai
ji
of the model you have estimated on the m observations in the data set.
(d) Compute the average log-likelihood on the validation data. You should find this is negative infinity, meaning that some validation data have zero probability under our maximum likelihood model!
(e) One way to fix such estimates is to use a prior (or equivalently, a regularization term) in the parameter estimation. A Dirichlet prior Dir(α) is conjugate to the discrete distribution, and so can be interpreted as adding “pseudocounts” αi to each outcome i. Change your probability estimates to include this prior, using αi = 1. (Equivalently, in each estimated probability, imagine that you saw every possible outcome once, plus the number you actually saw in the training data.) After re-estimating your probabilities, print out the log-likelihood of your training data and validation data on the new model.
Learning Graph Structure
Now, we will estimate the maximum likelihood tree structured model using the Chow-Liu algorithm described in class.
Homework 5 UC Irvine 2/ 3
CS 179: Introduction to Graphical Models Spring 2021
(a) Estimatethesingle-variableandpairwisemarginalprobabilitydistributionsˆp(Xi),ˆp(Xi,Xj)foreachpairof variables in the model. (Again, regularize your probability estimates by adding α = 1 to each outcome.)
Then, for each pair of variables, compute the empirical mutual information: ˆI(Xi,Xj)=ˆp(xi,xj)log ˆp(xi,xj) .
Display your results.
(b) Now, estimate the maximum likelihood tree model structure, by finding a maximum-weight spanning tree of the graph, where the weight for edge (i, j) is equal to the empirical mutual informations ˆI(Xi,Xj). Sketch or otherwise display the resulting tree. Note that the identified structure may be quite different from the previous, hand-designed structure (which is not a tree). How many parameters (probabilites) do you need to estimate for this network? (Count only “free” parameters, e.g., not including any that are constrained to be equal to some function of other, already counted parameters.)
For displaying the tree, you may use e.g., pyGMs.draw.drawMarkovGraph , in which case it may be helpful to use the positions of the variables in the homework drawing:
1 2
(c) Compute the average log-likelihood of the resulting tree-structured model on the data, on both training and validation data. (You should find, in this particular case, that the best tree actually fits better than the hand-designed model, although its structure is probably less “causally meaningful”.)
(d) Since the two models have significantly different complexities (different numbers of parameters), we should penalize for complexity when comparing their training scores. (Penalized likelihood can be used as a substitute for validation data likelihood when the latter is not available.)
Adjust both training likelihoods by including a BIC penalty, by computing 1 − R log(m)
m2m
where R is the number of (free) parameters in each model, and m is the size of the training set. Which model is preferred under this measure of quality?
xi,xj ˆp(xi)ˆp(xj)
pos = {Income: (2,3), Exercise:(1,2), Smoke:(3,2), BMI:(0,1), BP:(2,1), Chol:(3,1), Diab:(0,0), Stroke:(1,0), Attack:(2,0), Angina:(3,0)}
where e.g. “ ” is either the pyGMs variable corresponding to income, or its integer ID. (Please display the text labels in the graph, or otherwise make clear which node corresponds to which variable, to simplify checking your solution.)
Income
Homework 5 UC Irvine 3/ 3