Last Modified: May 7, 2020
CS 179: Introduction to Graphical Models: Spring 2020
Homework 4
Due Date: Tuesday, May 19th
The submission for this homework should be a single PDF file containing all of the relevant code, figures, and any text explaining your results. When coding your answers, try to write functions to encapsulate and reuse code, instead of copying and pasting the same code multiple times. This will not only reduce your programming efforts, but also make it easier for us to understand and give credit for your work. Show and explain the reasoning behind your work!
A Jupyter notebook with some helpful code structure has been provided; you may want to use it as a starting point.
Background & Data
In this homework, we will explore learning a graphical model from training data. Download the data set associated with the homework (.csv format, linked from the homework). These data are surveyed information about risk factors for certain heath outcomes (heart disease, diabetes, etc.). The data are from the 2011 Behavioral Risk Factor Surveillance System (BRFSS) survey, run anually by the Centers for Disease Control (CDC), and were distilled by Prof. Tom Fletcher (U. Utah) into the spreadsheet RiskFactorData.csv. The variables and their meanings are as follows:
(a) income – Annual personal income level.
1 (<$10,000)
4 ($20,000-$25,000) 7 ($50,000-$75,000)
2 ($10,000-$15,000) 5 ($25,000-$35,000) 8 (>$75,000)
3 ($15,000-$20,000) 6 ($35,000-$50,000)
4 (obese)
(b) smoke – Smoked 100 or more cigarettes in lifetime. 1 (yes) 2 (no)
(c) cholesterol – Has high cholesterol. 1 (yes) 2 (no)
(d) bmi – Body mass index (category).
1 (underweight) 2 (normal) 3 (overweight)
(e) exercise – Exercised in past 30 days. 1 (yes) 2 (no)
(f) attack – Had a heart attack. 1 (yes) 2 (no)
(g) bp – Has high blood pressure.
1 (yes) 2 (only when pregnant) (h) angina – Had heart disease (angina).
1 (yes) 2 (no) (i) stroke – Had a stroke.
1 (yes) 2 (no)
(j) diabetes – Had diabetes.
1 (yes) 2 (only during pregnancy)
Maximum Likelihood for Bayesian Networks
(a) Load the data into Python:
1 2 3
import numpy as np
data = np.genfromtxt (‘RiskFactorData.csv’, delimiter=”,”,names=True) data
convert it into zero-based, integer indices:
3 (no)
3 (no)
4 (pre-hypertensive)
4 (pre-diabetic)
1 data_int = np.array([list(xj) for xj in data], dtype=int)-1
Homework 4 UC Irvine 1/ 3
CS 179: Introduction to Graphical Models Spring 2020
and split it into training and validation data sets:
1 2 3
(b) Estimate the probabilities required for the following Bayesian network, using maximum likelihood (i.e., the empirical probability estimates of the parameters) on the training data. Show the tables you learned for p(Income) and p(Exercise|Income).
nTrain = int(.75*len(data_int))
train = data_int[:nTrain]
valid = data_int[nTrain:]
income
exercise
smoke
chol
angina
bmi bp
diab stroke attack
What is the total number of probabilities you need to estimate for this network? (Only count “free” probabil- ities, i.e., do not count any probability parameters that are deterministically known given the others due to the fact that the probabilities are normalized to sum to one.) In contrast, what is the total number of probabilities in the full joint distribution?
(c) Compute the average log-likelihood on the training data,
1 = 1 logpx(j)|x(j) mm iipai
ji
of the model you have estimated on the m observations in the data set.
(d) Compute the average log-likelihood on the validation data. You should find this is negative infinity, meaning that some validation data have zero probability under our maximum likelihood model!
(e) One way to fix such estimates is to use a prior (or equivalently, a regularization term) in the parameter estimation. A Dirichlet prior Dir(α) is conjugate to the discrete distribution, and so can be interpreted as adding “pseudocounts” αi to each outcome i. Change your probability estimates to include this prior, using αi = 1. (Equivalently, in each estimated probability, imagine that you saw every possible outcome once, plus the number you actually saw in the training data.) After re-estimating your probabilities, print out the log-likelihood of your training data and validation data on the new model.
Learning Graph Structure
Now, we will estimate the maximum likelihood tree structured model using the Chow-Liu algorithm described in class.
(a) Estimatethesingle-variableandpairwisemarginalprobabilitydistributionsˆp(Xi),ˆp(Xi,Xj)foreachpairof variables in the model. (Again, regularize your probability estimates by adding α = 1 to each outcome.)
Then, for each pair of variables, compute the empirical mutual information:
ˆI(Xi,Xj)=ˆp(xi,xj)log ˆp(xi,xj) . xi,xj ˆp(xi)ˆp(xj)
Display your results.
Homework 4 UC Irvine 2/ 3
CS 179: Introduction to Graphical Models Spring 2020
(b) Now, estimate the maximum likelihood tree model structure, by finding a maximum-weight spanning tree of the graph, where the weight for edge (i, j) is equal to the empirical mutual informations ˆI(Xi,Xj). Sketch or otherwise display the resulting tree. Note that the identified structure may be quite different from the previous, hand-designed structure (which is not a tree). How many parameters (probabilites) do you need to estimate for this network? (Count only “free” parameters, e.g., not including any that are constrained to be equal to some function of other, already counted parameters.)
(c) Compute the average log-likelihood of the resulting tree-structured model on the data, on both training and validation data. Are they higher or lower than the log-likelihood of the hand-designed model?
(d) Since the two models have significantly different complexities (different numbers of parameters), we should penalize for complexity when comparing their training scores. (Penalized likelihood can be used as a substitute for validation data likelihood when the latter is not available.)
Adjust both training likelihoods by including a BIC penalty, by computing 1 − R log(m)
m2m
where R is the number of (free) parameters in each model, and m is the size of the training set. Which model is preferred under this measure of quality?
Homework 4 UC Irvine 3/ 3