T3-2020 Exam Solution¶
COMP9418 – Advanced Topics in Statistical Machine Learning
Before proceeding, please read and acknowledge the following (double-click on this cell and put an X
between the brackets [X]
):
- [ ] I acknowledge that I will complete all of the work I submit for this exam without assistance from anyone else.
Instructions¶
- This exam will last for 24 hours, starting 7/12/2020 at 00:00:01 AEST and ending 7/12/2020 at 23:59:59 AEST.
- Questions will be answered from 9 am to 9 pm AEST.
- You must provide all answers in this Jupyter notebook.
- You must use the cells provided to answer the questions. Use markdown cells for textual answers and code cells for programming answers.
- Submit this exam by give (command line or WebCMS) before the deadline. If WebCMS submission is slow, or if you are submitting in the last hour, please submit using the give command on the CSE servers (via VLAB or ssh).
The appropriate command is
give cs9418 exam *.ipynb
. We will not accept late submissions. - The exam will have three parts: Multiple choice questions (20%); Questions that require a textual answer (50%); and, programming questions in Python (30%).
- This exam is an open book exam. You are permitted to access papers and books as well as the course materials, including slides and solved tutorials.
- You are not permitted to communicate (email, phone, message, talk, etc.) with anyone during the exam, except COMP9418 staff via email or forum.
- Do not communicate your exam answers after you finish your exam. Some students may have extended time to complete the exam.
- Do not place your exam work in any location accessible to any other person, such as Dropbox and Github.
- Ensure that no other person in your household can access your work.
- Do not disclose your zpass to any other person. If you have revealed your zpass, you should change it immediately.
- We will refer deliberate violations of exam conditions to Student Integrity as serious misconduct.
- This exam has nine questions. The total number of marks is 100.
- Type your student number and name on the next cell.
Part 1 [20 marks]¶
Part 1 is composed of four multiple-choice questions of five marks each. To answer each question, double-click the cell with the alternatives and write an X
between the [ ]
of the chosen option.
This is an example before inserting X
- [ ] Alternative one
- [ ] Alternative two
This is an example after inserting X
- [X] Alternative one
- [ ] Alternative two
For all four questions, choose only one among the alternatives.
Question 1 [5 marks]¶
Log-probabilities is the term we use to denote the value of $\log p$ for a probability $p$. The main use of log-probabilities is to avoid underflows that may occur when we multiply a large number of probabilities together. In this case, the multiplication $\prod_i p_i$ becomes a summation since $\sum_i \log p_i = \log \prod_i p_i$. We also note that the maximisation is a valid operation for log probabilities since $\log \max(p_1,…,p_n) = \max(\log p_1, …, \log p_n)$. The use of log-probabilities is a standard technique in the implementation of several algorithms such as Viterbi and MPE-VE. Still, it is not usually used in others such as Variable Elimination (VE) and MAP-VE. Why?
- [ ] The Viterbi and MPE-VE are algorithms that involve the multiplication of a large number of probabilities and, therefore, justify the use of log-probabilities.
- [X] The Viterbi and MPE-VE algorithms require multiplication and maximisation of probabilities. However, VE and MAP-VE may also involve sums of probabilities, which is an operation without a counterpart in the log-probability representation.
- [ ] The Viterbi and MPE-VE algorithms are algorithms in which we are interested in the assignment with maximum probability, instead of the value of the probabilities. Therefore, in these algorithms, we can use log-probabilities since we do not need to report the probabilities.
- [ ] The Viterbi is a particular case of MPE-VE. In turn, MPE-VE is a specific case of MAP-VE. All these algorithms are specialisations of the VE algorithm. Therefore, these are essentially the same algorithm. It makes no difference if we use log-probabilities or probabilities in all of them.
- [ ] None of the above alternatives is correct.
Explanation¶
The marginalisation operation involves summing probabilities. If we use the log-probability representation, we cannot simply sum the log-probabilities since $\sum_i p_i \neq \exp(\sum_i \log p_i$). Therefore, the marginalisation operation would require conversion back to probabilities.
Question 2 [5 marks]¶
During the course, we covered a multitude of inference algorithms, from the most straightforward Variable Elimination (VE) to more sophisticated ones such as Iterative Joingraph Propagation (IJGP) and Gibbs samplings. Select the correct alternative regarding the inference algorithms.
- [X] In the course, we covered two exact inference algorithms: variable elimination and the jointree algorithm. The efficiency of these algorithms is tightly coupled. If the best VE order has width $w$, then it is guaranteed the best jointree will also have width $w$. Similarly, if the width of the best jointree is $w$, then the best VE order has also width $w$.
- [ ] VE is a simple algorithm, but its complexity is exponential for both best and worst cases. Therefore, this algorithm does not scale to large networks. With an extensive network with hundreds of nodes, it is guaranteed VE will not provide answers in feasible time and we will need to rely on more efficient algorithms, such as approximate inference with sampling.
- [ ] We can use the Chebyshev and Hoeffding bounds to compute the number of samples necessary for inference with sampling algorithms. Those bounds are accurate independently of the sampling algorithm: Forward, Rejection, Gibbs sampling, as well as Likelihood Weighting.
- [ ] Joingraphs are similar to Jointrees but with relaxed constraints. The inference algorithms for both structures are also very similar. While the inference algorithm for Jointrees converges in a single iteration, the IJGP is guaranteed to converge in one or more iterations.
- [ ] None of the above alternatives is correct.
Explanation¶
We can convert an elimination order of width $w$ into a jointree with the same width. Also, we can convert a jointree of width $w$ into an elimination order of the same width. Given that the treewidth of a graph is the width of the best elimination order $o^*$, we can convert $o^*$ into a jointree that also has a width equal to the treewidth.
Question 3 [5 marks]¶
In Lecture 16, we studied approaches to learn network structures from data. Choose the incorrect alternative:
- [ ] Learning tree structures is more straightforward than learning graph structures because trees have a fixed number of edges.
- [ ] A limitation of the tree structure learning methods is that they can infer the presence of an edge but do not infer its direction.
- [X] Learning DAG structures is always better than learning tree structures. The addition of new edges to tree structures will transform the tree into a DAG and will never decrease the log-likelihood of the network.
- [ ] Model complexity is a technique to avoid overfitting in DAG structure learning. Model complexity is necessary because the addition of an edge never decreases the log-likelihood of the resulting structure.
- [ ] Optimal search, as the name suggests, can identify the optimal parent set for a given total order of variables. However, the time complexity is exponential. Therefore, the importance of relying on pruning techniques to reduce running time.
Explanation¶
Although DAGs have log-likelihoods equal to or larger than trees, DAG structures are not necessarily better than tree structures. The addition of edges can lead to overfitting and make the models increasingly more complex. Complex models may require more data to learn and tend to be more costly to make an inference.
Question 4 [5 marks]¶
In Lecture 4, we discussed the concept of I-MAP, D-MAP and P-MAP for Bayesian networks. We can extend this concept to Markov networks, Jointrees and Joingraphs. Also, in Lecture 9, we discussed that our graphical models could be understood as languages to represent independencies. Now, suppose we have a probability distribution $P$ over five variables $A,B,C,D$ and $E$ such that $P(A,B,C,D,E) = \phi(A,B,C)\phi(B,E)\phi(E,D)\phi(C,D)$. Which probabilistic graphical model is a language able to provide a graph that is a P-MAP for the probability distribution $P$?
- [ ] A Bayesian Network.
- [ ] A Markov Network and a Bayesian Network.
- [X] A Joingraph and a Markov Network.
- [ ] A Joingraph, Markov Network and Bayesian Network.
- [ ] A Joingraph, Jointree, Markov Network and Bayesian Network.
- [ ] A Markov Network.
Explanation¶
A Markov network can directly represent this factorisation. From this network, we can derive, using separation, all independency assumptions encoded in the factorisation. They are:
$A \perp D, E | B, C$
$A \perp D, E | B, D$
$B \perp D | C, E$
$C \perp E | B, D$
Now, we need to find a Bayesian Network, Joingraph or Jointree that is a P-MAP for these independency assumptions. Remember that to be a P-MAP, the PGM must encode only these assumptions, no more or less.
We can find a joingraph that encodes these assumptions, which is not surprising since the joingraph can have the same structure as the Markov network. However, Bayesian networks cannot encode the same assumptions due to the presence of “convergent structures”. Similarly, the jointree cannot represent the same independencies due to its tree shape.
Part 2 [50 marks]¶
Part 2 is composed of three open questions. To answer each question, edit the markdown cell after the question statement and insert your answer.
Question 5 [15 marks]¶
As a data scientist, you visit a potential client, a doctor, that poses the following problem to you. He wants to use Machine Learning to improve his understanding of cancer patients. All patients in the database were diagnosed with some variation of the disease. The database has information about the patients (such as age, gender, etc.), medical history (previous and existing conditions) and the disease (cancer type, tumour size, etc.). The doctor has three main requirements:
- They want to understand and validate the model;
- They want the model to incorporate their knowledge about the domain, such as “lung cancer has a high prevalence for smokers”, but the model should also learn other relationships from data;
- They have not a single query they are interested in; they want to probe the model to find “interesting things”.
Answer:
- [5 Marks] What model would you recommend: generative or discriminative. Briefly explain why.
- [5 Marks] Briefly compare the suitability of two explainable models for this problem: decision trees and Bayesian networks.
- [5 Marks] In the case of Bayesian networks, how would you deal with the requirement of incorporating existing human knowledge in the model? Respond considering the model structure and parameters.
Your answer for 1¶
The generative models are recommended here since the doctor has not a single query in mind. “Probing” the model means posing different queries that would involve modelling $P(\textbf{X})$ instead just $P(Y|\textbf{X})$.
Your answer for 2¶
Decision trees are models that would allow the doctor to “understand and validate the model”. For every classification, we could provide the sequence of decisions (from the root node to the leaf node) that lead to that particular decision.
However, decision trees are classification models, and we cannot easily make them answer different types of queries. The best we can do is to change the class-attribute and generate a new model. However, Bayesian networks would allow for posing queries involving different attributes as query and evidence information.
Also, incorporating existing knowledge is not easily done with current implementations of decision trees. The default behaviour is to learn everything from data.
Your answer for 3¶
Bayesian networks can be very flexible in incorporating existing knowledge to learn both structure and parameters.
For instance, for the Chow-Liu algorithm, we can enforce the MST algorithm to incorporate edges provided by the experts. Similarly, the local search algorithms for DAGs can start with an initial incomplete structure defined by the doctors and add new edges in the search process. Simple modifications can enforce that the search procedure does not discard the provided edges.
Also, automatic structure learning is not good to infer the direction of edges. However, we can provide the learned structure to the experts, and they can set the direction of the edges according to some semantics such as causation.
Similarly, we can learn the parameters from data or use probabilities provided by the expert. This is not as simple as structure, since a change in the structure will also impact the CPTs, adding or removing variables. The simplest option is to learn the probabilities from data and revise them with the aid of the expert, if necessary.
Question 6 [20 marks]¶
The drive in golf is the first shot in playing a hole. If you drive with a 3-wood (a particular type of golf club), there is a 2% risk of a miss (hitting the ball at the wrong angle, so it goes in the wrong direction), and 1/4 of the drives have a length of 180 m, 1/2 are 200 m, and 1/4 have a length of 220 m. You may also use a driver (another type of golf club). This will increase the length by 20 m, but you will also have three times as high a risk of a miss. Both wind and the slope of the hole may affect the result of the drive. The presence of wind doubles the risk of a miss, and the length is affected by 20 m (longer if the wind is from behind and shorter from the front). A downhill slope yields 20 m longer drives and an uphill slope decreases the length of the drive by 20 m. What is the probability of a miss given a shot greater or equal to 260 m?
- [5 Marks] Show a Bayesian network structure (graph) for this problem. Briefly explain your network. You will have to make reasonable assumptions when constructing your model.
- [5 Marks] Provide a query that solves this problem.
- [10 Marks] Answer the query by solving this as a programming exercise, use the tutorial code to make a program that computes the query. If you do not have information about a certain variable, assume it has a uniform distribution.
# Your answer for 1 - Bayesian network structure
# Define your graph here
graph = {
'Club': ['Miss', 'Length'],
'Wind': ['Miss', 'Length'],
'Slope': ['Length'],
'Length':[],
}
# Do not modify this cell, it simply plots the graph above
import graphviz
from graphviz import Digraph
dot = Digraph(engine="neato", comment='Direct graph example')
dot.attr(overlap="false", splines="true")
for v in graph.keys():
dot.node(v)
for v in graph.keys():
for w in graph[v]:
dot.edge(v, w)
dot
Your answer for 1 – Brief explanation¶
The probability of missing a shot is influenced by the club and the presence of wind, while the drive length is influenced by these variables as well as the slope.
Your answer for 2¶
$P(M|L \geq 260) = \frac{P(M,L \geq 260)}{P(L \geq 260)} = \frac{P(M,L=260)+P(M,L=280)}{P(L=260)+P(L=280)}$
#### Your answer for 3
from collections import OrderedDict as odict
from itertools import product, combinations, permutations
from tabulate import tabulate
def prob(factor, *entry):
"""
argument
`factor`, a dictionary of domain and probability values,
`entry`, a list of values, one for each variable in the same order as specified in the factor domain.
Returns p(entry)
"""
if entry in factor['table']:
return factor['table'][entry] # insert your code here, 1 line
else:
return 0
def marginalize(f, var, outcomeSpace):
"""
argument
`f`, factor to be marginalized.
`var`, variable to be summed out.
`outcomeSpace`, dictionary with the domain of each variable
Returns a new factor f' with dom(f') = dom(f) - {var}
"""
# Let's make a copy of f domain and convert it to a list. We need a list to be able to modify its elements
new_dom = list(f['dom'])
new_dom.remove(var) # Remove var from the list new_dom by calling the method remove(). 1 line
table = list() # Create an empty list for table. We will fill in table from scratch. 1 line
for entries in product(*[outcomeSpace[node] for node in new_dom]):
s = 0; # Initialize the summation variable s. 1 line
# We need to iterate over all possible outcomes of the variable var
for val in outcomeSpace[var]:
# To modify the tuple entries, we will need to convert it to a list
entriesList = list(entries)
# We need to insert the value of var in the right position in entriesList
entriesList.insert(f['dom'].index(var), val)
p = prob(f, *tuple(entriesList)) # Calculate the probability of factor f for entriesList. 1 line
s = s + p # Sum over all values of var by accumulating the sum in s. 1 line
# Create a new table entry with the multiplication of p1 and p2
table.append((entries, s))
return {'dom': tuple(new_dom), 'table': odict(table)}
def join(f1, f2, outcomeSpace):
"""
argument
`f1`, first factor to be joined.
`f2`, second factor to be joined.
`outcomeSpace`, dictionary with the domain of each variable
Returns a new factor with a join of f1 and f2
"""
# First, we need to determine the domain of the new factor. It will be union of the domain in f1 and f2
# But it is important to eliminate the repetitions
common_vars = list(f1['dom']) + list(set(f2['dom']) - set(f1['dom']))
# We will build a table from scratch, starting with an empty list. Later on, we will transform the list into a odict
table = list()
# Here is where the magic happens. The product iterator will generate all combinations of varible values
# as specified in outcomeSpace. Therefore, it will naturally respect observed values
for entries in product(*[outcomeSpace[node] for node in common_vars]):
# We need to map the entries to the domain of the factors f1 and f2
entryDict = dict(zip(common_vars, entries))
f1_entry = tuple((entryDict[var] for var in f1['dom']))
f2_entry = tuple((entryDict[var] for var in f2['dom']))
# Insert your code here
p1 = prob(f1, *f1_entry) # Use the fuction prob to calculate the probability in factor f1 for entry f1_entry
p2 = prob(f2, *f2_entry) # Use the fuction prob to calculate the probability in factor f2 for entry f2_entry
# Create a new table entry with the multiplication of p1 and p2
table.append((entries, p1 * p2))
return {'dom': tuple(common_vars), 'table': odict(table)}
# Answer
def VE(factors, order, outcomeSpace):
"""
argument
`factors`, a dictionary of factors, each factor is a dictionary of domain and probability values,
`order`, a list of variable names specifying an elimination order,
`outcomeSpace`, a dictionary with variable names and respective domains.
Returns a dictionary with non-eliminated factors
"""
# Let's make a copy of factors, so we can freely modify it without distroying the original dictionary
f = factors.copy()
# We process the factor in elimination order
for i, var in enumerate(order):
# This is the domain of the new factor. We use sets as it is handy to eliminate duplicate variables
newFactorDom = set()
# This is a list of factors that will be removed from f because they were joined with other factors
listFactorsRemove = list()
# This is a flag to indicate if we are processing the first factor
first = True
# Lets iterate over all factors
for f_id in f.keys():
# and select the ones that have the variable to be eliminated
if var in f[f_id]['dom']:
if first:
# We need this code since join requires two factors, so we save the first one in fx and wait for the next
fx = f[f_id]
first = False
else:
# Join fx and f[f_id] and save the result in fx
fx = join(fx, f[f_id], outcomeSpace)
printFactor(fx)
print()
# f_id was joined, so we will need to eliminate it from f later. Let's save that factor id for future removal
listFactorsRemove.append(f_id)
# Now, we need to remove var from the domain of the new factor doing a marginalization
fx = marginalize(fx, var, outcomeSpace)
printFactor(fx)
print()
# Now, we remove all factors that we joined. We do it outside the for loop since it modifies the data structure
for f_id in listFactorsRemove:
del f[f_id]
# We will create a new factor with id equal a sequential number and insert it into f, so it can be used in future joins
f[i] = fx
return f
def printFactor(f):
"""
argument
`f`, a factor to print on screen
"""
# Create a empty list that we will fill in with the probability table entries
table = list()
# Iterate over all keys and probability values in the table
for key, item in f['table'].items():
# Convert the tuple to a list to be able to manipulate it
k = list(key)
# Append the probability value to the list with key values
k.append(item)
# Append an entire row to the table
table.append(k)
# dom is used as table header. We need it converted to list
dom = list(f['dom'])
# Append a 'Pr' to indicate the probabity column
dom.append('Pr')
print(tabulate(table,headers=dom,tablefmt='orgtbl'))
def evidence(var, e, outcomeSpace):
"""
argument
`var`, a valid variable identifier.
`e`, the observed value for var.
`outcomeSpace`, dictionary with the domain of each variable
Returns dictionary with a copy of outcomeSpace with var = e
"""
newOutcomeSpace = outcomeSpace.copy() # Make a copy of outcomeSpace with a copy to method copy(). 1 line
newOutcomeSpace[var] = (e,) # Replace the domain of variable var with a tuple with a single element e. 1 line
return newOutcomeSpace
def normalize(f):
"""
argument
`f`, factor to be normalized.
Returns a new factor f' as a copy of f with entries that sum up to 1
"""
table = list()
sum = 0
for k, p in f['table'].items():
sum = sum + p
for k, p in f['table'].items():
table.append((k, p/sum))
return {'dom': f['dom'], 'table': odict(table)}
def query(factors, order, outcomeSpace, q_vars, **q_evi):
"""
argument
`factors`, a dictionary of factors
`order`, a list with variable elimination order
`outcomeSpace`, dictionary will variable domains
`q_vars`, list of variables in query head
`q_evi`, dictionary of evidence in the form of variables names and values
Returns a new factor with P(Q, e) or P(Q|e)
"""
# Let's make a copy of these structures, since we will reuse the variable names
outSpace = outcomeSpace.copy()
# First, we set the evidence
for var_evi, e in q_evi.items():
outSpace = evidence(var_evi, e, outSpace)
for q_var in q_vars:
order.remove(q_var)
f = VE(factors, order, outSpace)
first = True
for f_id in f.keys():
if first:
# We need this code since join requires two factors, so we save the first one in fx and wait for the next
fx = f[f_id]
first = False
else:
# Join fx and f[f_id] and save the result in fx
fx = join(fx, f[f_id], outSpace)
return fx
outcomeSpace = dict(
C=('3W','driver'),
W=('no','front', 'behind'),
M=('yes','no'),
S=('no', 'down', 'up'),
L=('140','160', '180', '200', '220', '240', '260', '280'),
)
factors = {
'C': {
'dom': ('C',),
'table': odict([
(('3W',), .5),
(('driver',), .5),
])
},
'W': {
'dom': ('W',),
'table': odict([
(('no',), .3333),
(('front',), .3333),
(('behind',), .3333),
])
},
'S': {
'dom': ('S',),
'table': odict([
(('no',), .3333),
(('down',), .3333),
(('up',), .3333),
])
},
'M': {
'dom': ('C', 'W', 'M'),
'table': odict([
(('3W', 'no', 'yes'), .02),
(('3W', 'no', 'no'), .98),
(('3W', 'front', 'yes'), .04),
(('3W', 'front', 'no'), .96),
(('3W', 'behind', 'yes'), .04),
(('3W', 'behind', 'no'), .96),
(('driver', 'no', 'yes'), .06),
(('driver', 'no', 'no'), .94),
(('driver', 'front', 'yes'), .12),
(('driver', 'front', 'no'), .88),
(('driver', 'behind', 'yes'), .12),
(('driver', 'behind', 'no'), .88),
])
},
'L': {
'dom': ('C', 'W', 'S', 'L'),
'table': odict([
(('3W', 'no', 'no', '180'), .25),
(('3W', 'no', 'no', '200'), .50),
(('3W', 'no', 'no', '220'), .25),
(('3W', 'no', 'down', '200'), .25),
(('3W', 'no', 'down', '220'), .50),
(('3W', 'no', 'down', '240'), .25),
(('3W', 'no', 'up', '160'), .25),
(('3W', 'no', 'up', '180'), .50),
(('3W', 'no', 'up', '200'), .25),
(('3W', 'front', 'no', '160'), .25),
(('3W', 'front', 'no', '180'), .50),
(('3W', 'front', 'no', '200'), .25),
(('3W', 'front', 'down', '180'), .25),
(('3W', 'front', 'down', '200'), .50),
(('3W', 'front', 'down', '220'), .25),
(('3W', 'front', 'up', '140'), .25),
(('3W', 'front', 'up', '160'), .50),
(('3W', 'front', 'up', '180'), .25),
(('3W', 'behind', 'no', '200'), .25),
(('3W', 'behind', 'no', '220'), .50),
(('3W', 'behind', 'no', '240'), .25),
(('3W', 'behind', 'down', '220'), .25),
(('3W', 'behind', 'down', '240'), .50),
(('3W', 'behind', 'down', '260'), .25),
(('3W', 'behind', 'up', '180'), .25),
(('3W', 'behind', 'up', '200'), .50),
(('3W', 'behind', 'up', '220'), .25),
(('driver', 'no', 'no', '200'), .25),
(('driver', 'no', 'no', '220'), .50),
(('driver', 'no', 'no', '240'), .25),
(('driver', 'no', 'down', '220'), .25),
(('driver', 'no', 'down', '240'), .50),
(('driver', 'no', 'down', '260'), .25),
(('driver', 'no', 'up', '180'), .25),
(('driver', 'no', 'up', '200'), .50),
(('driver', 'no', 'up', '220'), .25),
(('driver', 'front', 'no', '180'), .25),
(('driver', 'front', 'no', '200'), .50),
(('driver', 'front', 'no', '220'), .25),
(('driver', 'front', 'down', '200'), .25),
(('driver', 'front', 'down', '220'), .50),
(('driver', 'front', 'down', '240'), .25),
(('driver', 'front', 'up', '160'), .25),
(('driver', 'front', 'up', '180'), .50),
(('driver', 'front', 'up', '200'), .25),
(('driver', 'behind', 'no', '220'), .25),
(('driver', 'behind', 'no', '240'), .50),
(('driver', 'behind', 'no', '260'), .25),
(('driver', 'behind', 'down', '240'), .25),
(('driver', 'behind', 'down', '260'), .50),
(('driver', 'behind', 'down', '280'), .25),
(('driver', 'behind', 'up', '200'), .25),
(('driver', 'behind', 'up', '220'), .50),
(('driver', 'behind', 'up', '240'), .25),
])
},
}
L280 = query(factors, ['S', 'C', 'W', 'M', 'L'], outcomeSpace, ('M',), L='280')
L260 = query(factors, ['S', 'C', 'W', 'M', 'L'], outcomeSpace, ('M',), L='260')
print("P(M=yes|L>=260)=",(L260['table'][('yes',)]+L280['table'][('yes',)])/(L260['table'][('yes',)]+L280['table'][('yes',)]+L260['table'][('no',)]+L280['table'][('no',)]))
| S | W | L | C | Pr | |------+--------+-----+--------+----------| | no | no | 280 | 3W | 0 | | no | no | 280 | driver | 0 | | no | front | 280 | 3W | 0 | | no | front | 280 | driver | 0 | | no | behind | 280 | 3W | 0 | | no | behind | 280 | driver | 0 | | down | no | 280 | 3W | 0 | | down | no | 280 | driver | 0 | | down | front | 280 | 3W | 0 | | down | front | 280 | driver | 0 | | down | behind | 280 | 3W | 0 | | down | behind | 280 | driver | 0.083325 | | up | no | 280 | 3W | 0 | | up | no | 280 | driver | 0 | | up | front | 280 | 3W | 0 | | up | front | 280 | driver | 0 | | up | behind | 280 | 3W | 0 | | up | behind | 280 | driver | 0 | | W | L | C | Pr | |--------+-----+--------+----------| | no | 280 | 3W | 0 | | no | 280 | driver | 0 | | front | 280 | 3W | 0 | | front | 280 | driver | 0 | | behind | 280 | 3W | 0 | | behind | 280 | driver | 0.083325 | | C | M | W | Pr | |--------+-----+--------+------| | 3W | yes | no | 0.01 | | 3W | yes | front | 0.02 | | 3W | yes | behind | 0.02 | | 3W | no | no | 0.49 | | 3W | no | front | 0.48 | | 3W | no | behind | 0.48 | | driver | yes | no | 0.03 | | driver | yes | front | 0.06 | | driver | yes | behind | 0.06 | | driver | no | no | 0.47 | | driver | no | front | 0.44 | | driver | no | behind | 0.44 | | C | M | W | L | Pr | |--------+-----+--------+-----+-----------| | 3W | yes | no | 280 | 0 | | 3W | yes | front | 280 | 0 | | 3W | yes | behind | 280 | 0 | | 3W | no | no | 280 | 0 | | 3W | no | front | 280 | 0 | | 3W | no | behind | 280 | 0 | | driver | yes | no | 280 | 0 | | driver | yes | front | 280 | 0 | | driver | yes | behind | 280 | 0.0049995 | | driver | no | no | 280 | 0 | | driver | no | front | 280 | 0 | | driver | no | behind | 280 | 0.036663 | | M | W | L | Pr | |-----+--------+-----+-----------| | yes | no | 280 | 0 | | yes | front | 280 | 0 | | yes | behind | 280 | 0.0049995 | | no | no | 280 | 0 | | no | front | 280 | 0 | | no | behind | 280 | 0.036663 | | W | M | L | Pr | |--------+-----+-----+------------| | no | yes | 280 | 0 | | no | no | 280 | 0 | | front | yes | 280 | 0 | | front | no | 280 | 0 | | behind | yes | 280 | 0.00166633 | | behind | no | 280 | 0.0122198 | | M | L | Pr | |-----+-----+------------| | yes | 280 | 0.00166633 | | no | 280 | 0.0122198 | | M | Pr | |-----+------------| | yes | 0.00166633 | | no | 0.0122198 | | S | W | L | C | Pr | |------+--------+-----+--------+----------| | no | no | 260 | 3W | 0 | | no | no | 260 | driver | 0 | | no | front | 260 | 3W | 0 | | no | front | 260 | driver | 0 | | no | behind | 260 | 3W | 0 | | no | behind | 260 | driver | 0.083325 | | down | no | 260 | 3W | 0 | | down | no | 260 | driver | 0.083325 | | down | front | 260 | 3W | 0 | | down | front | 260 | driver | 0 | | down | behind | 260 | 3W | 0.083325 | | down | behind | 260 | driver | 0.16665 | | up | no | 260 | 3W | 0 | | up | no | 260 | driver | 0 | | up | front | 260 | 3W | 0 | | up | front | 260 | driver | 0 | | up | behind | 260 | 3W | 0 | | up | behind | 260 | driver | 0 | | W | L | C | Pr | |--------+-----+--------+----------| | no | 260 | 3W | 0 | | no | 260 | driver | 0.083325 | | front | 260 | 3W | 0 | | front | 260 | driver | 0 | | behind | 260 | 3W | 0.083325 | | behind | 260 | driver | 0.249975 | | C | M | W | Pr | |--------+-----+--------+------| | 3W | yes | no | 0.01 | | 3W | yes | front | 0.02 | | 3W | yes | behind | 0.02 | | 3W | no | no | 0.49 | | 3W | no | front | 0.48 | | 3W | no | behind | 0.48 | | driver | yes | no | 0.03 | | driver | yes | front | 0.06 | | driver | yes | behind | 0.06 | | driver | no | no | 0.47 | | driver | no | front | 0.44 | | driver | no | behind | 0.44 | | C | M | W | L | Pr | |--------+-----+--------+-----+------------| | 3W | yes | no | 260 | 0 | | 3W | yes | front | 260 | 0 | | 3W | yes | behind | 260 | 0.0016665 | | 3W | no | no | 260 | 0 | | 3W | no | front | 260 | 0 | | 3W | no | behind | 260 | 0.039996 | | driver | yes | no | 260 | 0.00249975 | | driver | yes | front | 260 | 0 | | driver | yes | behind | 260 | 0.0149985 | | driver | no | no | 260 | 0.0391627 | | driver | no | front | 260 | 0 | | driver | no | behind | 260 | 0.109989 | | M | W | L | Pr | |-----+--------+-----+------------| | yes | no | 260 | 0.00249975 | | yes | front | 260 | 0 | | yes | behind | 260 | 0.016665 | | no | no | 260 | 0.0391627 | | no | front | 260 | 0 | | no | behind | 260 | 0.149985 | | W | M | L | Pr | |--------+-----+-----+-------------| | no | yes | 260 | 0.000833167 | | no | no | 260 | 0.0130529 | | front | yes | 260 | 0 | | front | no | 260 | 0 | | behind | yes | 260 | 0.00555444 | | behind | no | 260 | 0.04999 | | M | L | Pr | |-----+-----+------------| | yes | 260 | 0.00638761 | | no | 260 | 0.0630429 | | M | Pr | |-----+------------| | yes | 0.00638761 | | no | 0.0630429 | P(M=yes|L>=260)= 0.09666666666666665
Question 7 [15 marks]¶
In Lecture 15, we discussed techniques to learn parameters from data, and it became evident that learning in the presence of missing data is significantly more expensive than with complete data. In this question, we ask you to analyse the time complexity of the EM algorithm in more detail.
- [10 marks] The Expectation Maximisation (EM) algorithm requires inference on the Bayesian network. Explain how the jointree algorithm can help to improve the running time of the EM algorithm. In particular, queries of the form $P_{\theta^k}(x\textbf{u}|\textbf{d}_i)$ involve setting evidence $\textbf{d}_i$ for every training example $i$. How does it impact the performance of the jointree algorithm in terms of new messages that need to be exchanged?
- [5 marks] It is a common practice to use random restarts with EM since the algorithm is sensitive to the initial parameter estimate $\theta^0$. The idea is to run the algorithm multiple times and return the best estimate. Explain how you would implement random restarts with EM. How would you initialise the algorithm, and how would you select the best estimates?
Your answer for 1¶
The jointree algorithm will allow computing marginals for clusters $C_i$. As noted in the slide 32, these clusters involve family marginals of the form $x\textbf{u}$ that also appear in the EM queries. Unfortunately, we need to set evidence for each example that will require exchanging new messages. We can expect to exchange most of the jointree messages for the examples that are complete or almost complete. Therefore, a call to exchange messages for the jointree algorithm needs to be placed between the for loops in the algorithm below. That is certainly very expensive. We will need to update the jointree for each example. On the other hand, the jointree will allow us to compute all the queries $P_{\theta^k}(x\textbf{u}|\textbf{d}_i)$. There are several of those queries, one for each family in the network.
For completeness, this is the algorithm of slide 31 modified to incorporate inference with a jointree.
$J \leftarrow$ create_jointree($G$) { create a jointree from the $G$ }
distributefactors($G$, $J$) { distribute factors of $G$ among the clusters of $J$ }
$k \leftarrow 0$
$\theta^k \leftarrow$ initial parameter values
while convergence criterion is not met do
$~~~~$ $c{x\textbf{u}} \leftarrow 0$ for each family instantiation $x\textbf{u}$<br/>
$~~$ **for** $i \leftarrow 1$ to $N$ **do**<br/>
$$$$ set_evidence($J, \textbf{d}_i)$ { set evidence in jointree $J$ according to $\textbf{d}_i$ }<br/>
$$$$ exchange_messages($J, r$) { exchange messages in jointree $J$ according to root node $r$ }<br/>
$$$$ **for** each family instantiation $x\textbf{u}$ **do**<br/>
$$$$$$ $c{x\textbf{u}} \leftarrow c{x\textbf{u}} + P_{\theta^k}(x\textbf{u}|\textbf{d}_i)$ {requires inference on network $(G,\theta^k)$}<br/>
$$ $\theta{x|\textbf{u}}^{k+1} \leftarrow c{x\textbf{u}}/\sum{x^*}c{x^*\textbf{u}}$<br/>
$~~$ $k \leftarrow k + 1$<br/>
**return** $\theta^k$
Your answer for 2¶
We can start with random initializations, so we provide different start points for EM. Another possibility is to make random imputation of missing values and use as different start points. Missing data imputation is a procedure that replaces the missing values by estimates. In this case, the estimates will be random values sampled from the variable outcome space. This second technique has the benefit of providing initial values that naturally respect the probability distribution constraints.
We select as the best estimate the one with the largest log-likelihood.
Part 3 [30 marks]¶
Part 3 is composed of two programming questions of 15 marks each. Use the code cell after each question statement to enter your answer. You can use the code of the tutorials in your answers.
Question 8 [15 marks]¶
Lecture 9 presented the concept of separation, an independence test for Markov networks. The idea is similar, but simpler than d-separation since Markov networks do not have “convergent valves”. The test can be summarized in a single sentence:
Let $\textbf{X}$, $\textbf{Y}$, and $\textbf{Z}$ be three disjoint sets of nodes in a graph $G$. We say that $\textbf{X}$ is separated from $\textbf{Y}$ given $\textbf{Z}$, written $sep_G(\textbf{X},\textbf{Z},\textbf{Y})$, if and only if every path between a node in $\textbf{X}$ and a node in $\textbf{Y}$ pass through a node in $\textbf{Z}$.
An efficient separation test can be implemented by pruning the edges of $\textbf{Z}$ and testing for connectivity between nodes in $\textbf{X}$ and $\textbf{Y}$.
Implement an efficient separation test for Markov networks. The function separation(G, X, Z, Y)
returns true if $\textbf{X}$ is separated of $\textbf{Y}$ given $\textbf{Z}$ in the graph $G$.
# Write our answer for Question 8 here
def dfs_r(G, x, Y, colour):
"""
argument
`G`, an adjacency list representation of a graph
`v`, next vertex to be visited
`colour`, dictionary with the colour of each node
"""
if x in Y:
return True
colour[x] = 'grey'
for w in G[x]:
if colour[w] == 'white':
if dfs_r(G, w, Y, colour):
return True
colour[x] = 'black'
return False
def connected(G, X, Y):
"""
argument
`G`, an adjacency list representation of a graph
`start`, starting vertex
"""
colour = {node: 'white' for node in G.keys()}
for x in X:
if colour[x] == 'white':
if dfs_r(G, x, Y, colour):
return True
return False
def separation(G, X, Z, Y):
G2 = G.copy()
for e in Z:
G2[e] = []
return not connected(G2, X, Y)
# Test code Q8
def time_penalty(t):
if t < 2:
return 0
elif t < 4:
return 1
elif t < 6:
return 2
elif t < 10:
return 3
else:
return 4
def transposeGraph(G):
GT = dict((v, []) for v in G)
for v in G:
for w in G[v]:
GT[w].append(v)
return GT
import numpy as np
res = []
g = {
'A': ['B'],
'B': ['A', 'C', 'D', 'F'],
'C': ['B'],
'D': ['B', 'F'],
'E': ['F'],
'F': ['B', 'D', 'E', 'G'],
'G': ['F'],
}
res.append(separation(g, ['A'], ['D'], ['G']))
res.append(separation(g, ['A'], ['C'], ['G']))
res.append(separation(g, ['A'], ['F'], ['G']))
res.append(separation(g, ['A'], ['D', 'E'], ['G']))
res.append(separation(g, ['A', 'D'], ['C', 'G'], ['E', 'F']))
res.append(separation(g, ['A', 'D'], ['F'], ['E', 'G']))
g = {
0:[1,8],
1:[2,0],
2:[3,1],
3:[4,2],
4:[5,3],
5:[6,4],
6:[7,5],
7:[8,6],
8:[0,7],
}
res.append(separation(g, [0],[],[6]))
res.append(separation(g, [0],[3,8],[6]))
res.append(separation(g, [0],[3],[6]))
res.append(separation(g, [0],[8],[6]))
res.append(separation(g, [1,3,5],[8],[6,2]))
res.append(separation(g, [1,3,5],[6,0],[7,8]))
g = {
0:[1,8],
1:[2,0],
2:[3,1,12],
3:[4,2],
4:[5,3],
5:[6,4],
6:[7,5],
7:[8,6],
8:[0,7,9,11],
9:[10,8],
10:[11,9],
11:[8,10],
12:[2]
}
res.append(separation(g, [0],[],[6]))
res.append(separation(g, [0],[3,8],[6]))
res.append(separation(g, [0],[3],[6]))
res.append(separation(g, [0],[8],[6]))
res.append(separation(g, [1,3,5],[8],[6,2]))
res.append(separation(g, [1,3,5],[6,0],[7,8]))
res.append(separation(g, [4,12,10],[2,5],[6,1]))
res.append(separation(g, [4,12,10],[2,5,11,9],[6,1]))
res.append(separation(g, [4,12,10],[2,5,9],[6,1]))
res.append(separation(g, [9,1],[8,10],[11,7]))
res.append(separation(g, [9,1],[10,8],[11,7]))
res.append(separation(g, [9,1],[10,8,4],[11,7]))
res.append(separation(g, [9,1],[10,8,4,12],[11,7]))
res.append(separation(g, [9,1],[10,8,12],[11,7]))
g = {
0:[1,8],
1:[2,0],
2:[3,1,12],
3:[4,2],
4:[5,3],
5:[6,4],
6:[7,5],
7:[8,6],
8:[0,7,9,11],
9:[10,8],
10:[11,9],
11:[8,10],
12:[2],
'A': ['B'],
'B': ['A', 'C', 'D', 'F'],
'C': ['B'],
'D': ['B', 'F'],
'E': ['F'],
'F': ['B', 'D', 'E', 'G'],
'G': ['F'],
}
res.append(separation(g, ['A'],[],[6]))
res.append(separation(g, ['A'],[1,2,3,4,5,6,7,8,9],['C']))
res.append(separation(g, ['A'],[],[8]))
res.append(separation(g, ['A'],[],[6]))
def make_grid(size):
g = {}
for x in range(size):
for y in range(size):
neighbours = []
for dx,dy in [(1,0),(0,1),(-1,0),(0,-1)]:
if x+dx >= 0 and x+dx < size and y+dy >=0 and y+dy <size:
neighbours.append(str(x+dx)+str(y+dy))
g[str(x)+str(y)] = neighbours
return g
g = make_grid(10)
import time
s = time.time()
res.append(separation(g, ['01','50'],['03','12','22','32','41','52','61','70'],['45','05']))
res.append(separation(g, ['01','50'],['03','12','22','32','41','52','61','40'],['45','05']))
res.append(separation(g, ['01','50'],['03','12','22','32','41','52','61'],['45','05']))
e = time.time()
t8 = e-s
print("Time8", t8)
testResults8 = np.array(res) == [False, False, True, False, False, True, False, True, False, False, False, True, False, True, False, False, False, True, False, True, False, False, False, True, True, False, True, False, True, True, True, False, False]
score8 = max(0, 10*sum(testResults8)/len(testResults8))
if score8 == 10:
# assign 15 marks if perfect score, otherwise assign proportion of test cases passed out of 10
score8 = 15
score8 = max(0,score8-time_penalty(t8))
print("Score8",score8)
# If functions/arguments are passed incorrectly, I removed 4 marks manually
Time8 0.00024580955505371094 Score8 15
Question 9 [15 marks]¶
In Lecture 16, we discussed the importance of measuring model complexity. A common way to do so is to count the number of independent parameters in a Bayesian network.
In this exercise, you will write a function dimension(G, outcomeSpace)
which takes a DAG G
and its associated outcomeSpace
, and returns the dimension of G
.
# Write our answer for Question 8 here
def transposeGraph(G):
GT = dict((v, []) for v in G)
for v in G:
for w in G[v]:
GT[w].append(v)
return GT
def dimension(G, outcomeSpace):
"""
argument
`G`, an adjacency list representation of a graph
`start`, starting vertex
"""
d = 0
GT = transposeGraph(G)
for x in GT:
dx = len(outcomeSpace[x])-1
for u in GT[x]:
dx *= len(outcomeSpace[u])
d += dx
return d
print("Question 9:")
res = []
graph = {
'L': ['S', 'V'],
'H': ['S', 'V'],
'S': ['O'],
'V': ['C', 'O'],
'O': ['B'],
'A': ['T'],
'T': ['B'],
'C': [],
'B': [],
}
outSpace = dict(
H=(0,1),
L=(0,1),
A=(0,1),
V=(0,1),
S=(0,1),
T=(0,1),
C=(0,1,2),
O=(0,1,2),
B=(0,1,2),
)
res.append(dimension(graph, outSpace))
graph = transposeGraph(graph)
res.append(dimension(graph, outSpace))
graph = {
'A': ['B'],
'B': ['A', 'C', 'D', 'F'],
'C': ['B'],
'D': ['B', 'F'],
'E': ['F'],
'F': ['B', 'D', 'E', 'G'],
'G': ['F'],
}
outSpace = {
'A': (0,1,2,3,4),
'B': (0,1,2),
'C': (1,3,4,5,6,7,8,9),
'D': ('a',2,3,4,5,6,7),
'E': (0,3,4,5,6,7,8,6,5,),
'F': (2,),
'G': (3,4),
}
res.append(dimension(graph, outSpace))
graph = transposeGraph(graph)
res.append(dimension(graph, outSpace))
outSpace = {
'A': (0,1,2,3,4),
'B': (0,1,2),
'C': (1,3,4,5,6,7,8,9),
'D': ('a',2,5,6,7),
'E': (0,3,4,5,6,7,8,6,5,),
'F': (2,),
'G': (3,4),
}
res.append(dimension(graph, outSpace))
graph = transposeGraph(graph)
res.append(dimension(graph, outSpace))
graph = {
'A': ['B'],
'B': ['C','D'],
'C': [],
'D': [],
'E': ['F','G'],
'F': [],
'G': [],
}
res.append(dimension(graph, outSpace))
graph = transposeGraph(graph)
res.append(dimension(graph, outSpace))
graph = dict((str(i),[]) for i in range(10000))
graph['0'] = [str(i) for i in range(1,10000)]
outSpace = dict((str(i), tuple(j for j in range(1000))) for i in range(10000))
res.append(dimension(graph, outSpace))
import time
s = time.time()
res.append(dimension(graph, outSpace))
e = time.time()
t9 = e-s
print("Time9", t9)
testResults9 = np.array(res) == [37, 35, 620, 620, 454, 454, 64, 120, 9989001999, 9989001999]
score9 = max(0, 10*sum(testResults9)/len(testResults9))
if score9 == 10:
# assign 15 marks if perfect score, otherwise assign proportion of test cases passed out of 10
score9 = 15
score9 = max(0,score9-time_penalty(t9))
print("Score9",score9)
# If functions/arguments are passed incorrectly, I removed 4 marks manually
# E.g. relying on a global variable called "outcomeSpace", instead of the one that was passed to the function
# If the code raised an exception, the score was decided based on the severity of the mistake. Sometimes a try-except statement was added to the tests.
Question 9: Time9 0.008138656616210938 Score9 15