UNIVERSITY OF ALBERTA CMPUT 397 Winter 2021
(Practice) Midterm Exam Duration: 45 minutes
Total Pages = 6 Page 1 cont’d. . .
Name:
Question 1. [20 marks] Part 1: (10)
In this question, we ask you to give an extension of the law of total probability. The law of total probability applied to an unconditional probability P(B) is given by:
P(B) = P(B|Ak)P(Ak). k
Here we are writing the probability of event B in an expanded form with conditional probabilities where the conditional events Ak is a partition of the sample space Ω, which amounts to the following three conditions:
Ai ̸= {} ∀i; each event is non-empty,
Ai ∩ Aj = {}, i ̸= j, ∀i, j; mutually exclusive,
∪i Ai = Ω; collectively exhaustive.
Now, instead of P(B), if we want apply the law of total probability to conditional probability P(B|C), what will be the formula? Write all the correct options and the corresponding formulas in your answer.
(a) P(B|C) = P(B|Ak)P(Ak) k
(b)P(B|C)=P(B|Ak ∩C)P(Ak) k
(c)P(B|C)=P(B|Ak ∩C)P(Ak|C). k
The conditional events Ak are still a partition of the sample space here.
Part 2: (10)
If two non-empty events D and F are independent, then their probabilities do not change if the other event is given as a condition: P(D) = P(D|F) and P(F|D) = P(F).
In part 1, if the events Ak in the partition are all independent of event C, which of the above options (a), (b) and (c) are correct? Write all the correct options and the corresponding formulas in your answer.
Total Pages = 6 Page 2 cont’d. . .
Name:
Question 2. [20 marks]
In this question, we ask you to derive a formula related to the Bellman equation for action value qπ. Recall that qπ(x,c) is the action value of state action pair x and c under policy π defined as the expected return:
qπ(x,c)= ̇ Eπ [Gt|St = x,At = c],
and vπ(x) is the state value of state x under policy π defined as the expected return:
v π ( x ) = ̇ E π [ G t | S t = x ] .
If g(x, c) is the expected reward:
g(x,c)= ̇ E [Rt+1|St = x,At = c], then derive the following identity:
Use the linearity of expectation (LE), the law of total expectation (LOTE), the law of the un- conscious statistician (LOTUS) and the Markov property (MP) in your derivation. For each step where you use one of these rules, write the name of the rule beside that step as (LE), (LOTE), (LOTUS), and (MP).
qπ(x, c) = g(x, c) + γ p(x′|x, c)vπ(x′), x′
∀x, ∀c,
where p(x′ |x, c) = P (St+1 = x′ |St = x, At = c) is the probability of next state x′ given the current
state-action pair x, c.
Total Pages = 6 Page 3 cont’d. . .
Name:
Question 3. [20 marks] (10+10)
In this question, we ask you to give the Bellman optimality equation and value iteration update rule for action values. The Bellman optimality equation for v∗ is given by:
v∗(y) = max p(y′, w|y, b) w + γv∗(y′) , ∀y, (1)
b
y′,w
where p(y′,w|y,b) is the joint probability of next state y′ and reward w given the current state- action pair y, b. Then the value iteration method can be directly obtained based on the optimality equation by replacing the optimal state value v∗ with estimate vk for the kth iteration and replacing equality=withassignment= ̇:
vk+1(y)= ̇ maxp(y′,w|y,b)w+γvk(y′),∀y. (2)
b
y′,w
Provide the Bellman optimality equation for q∗ and the corresponding value iteration update rule
for action value using the state-action pair y, b.
Total Pages = 6 Page 4 cont’d. . .
Name:
Question 4. [20 marks]
Prove that the discounted sum of rewards is always finite if the rewards are bounded: |Rt+1| ≤ Rmax
foralltforsomefinite0