CS计算机代考程序代写 chain deep learning algorithm CS 188 Introduction to

CS 188 Introduction to
Spring 2017 Artificial Intelligence Final v0
• You have approximately 165 minutes (2 hours 45 minutes).
• The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
• Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. All short answer sections can be successfully answered in a few sentences AT MOST.
• For multiple choice questions with circular bubbles, you should only mark ONE option; for those with checkboxes, you should mark ALL that apply (which can range from zero to all options)
First name
Last name
edX username
For staff use only:
Total
/??
1

THIS PAGE IS INTENTIONALLY LEFT BLANK

Q1. [?? pts] Approximate Q-Learning
Consider the following MDP: We have infinitely many states s P Z and actions a P Z, each represented as an integer. Taking action a from state s deterministically leads to new state s1 “ s ` a and reward r “ s ́ a. For example, takingaction3atstate1resultsinnewstates1 “1`3“4andrewardr“1 ́3“ ́2.
We perform approximate Q-Learning, with features and initialized weights defined below.
(a) [?? pts] Write down Qps,aq in terms of w1, w2, s, and a. Qps,aq“w1 ̊s ́w2 ̊a2
(b) [?? pts] Calculate Qp1, 1q. Qp1,1q“w1f1p1,1q`w2f2p1,1q“1 ̊1 ́2 ̊12 “ ́1
(c) [?? pts] We observe a sample ps, a, r, s1 q of p1, 1, 0, 2q. Assuming a learning rate of α “ 0.5 and discount factor of γ “ 0.5, compute new weights after a single update of approximate Q-Learning.
Feature
Initial Weight
f1ps,aq “ s
w1 “ 1
f2ps, aq “ ́a2
w2 “ 2
w1: 1 ` 0.5 ̊ 2 ̊ 1 “ 2
w2: 2`0.5 ̊2 ̊ ́12 “1
diff “p0`0.5 ̊maxQp2,aqq ́p ́1q a
diff “p0.5maxp2 ́2 ̊a2qq`1 a
diff “ p0.5p2 ` 2 ̊ maxp ́a2qqq ` 1 a
diff “p0.5p2`2 ̊0qq`1 diff “ 1 ` 1 “ 2
(d) [?? pts] Compute the new value for Q(1,1). Q(1,1)=w1 ̊1`w2 ́12 “2 ̊1`1 ̊ ́12 “1
3

Q2. [?? pts] Who Spoke When
We are given a single audio recording (divided into equal and short time slots) and wish to infer when each person speaks. At every time step exactly one of N people is talking. This problem can be modeled using an HMM. Hidden variable Xt P t1, 2, …, N u represents which person is talking at time step t.
(a) For this part, assume that at each time step:
• with probability p, the current speaker will continue to talk in the next time step.
• with probability 1 ́ p, the current speaker will be interrupted by another person. Each other person is equally likely to be the interrupter.
Assume that N “ 3.
(i) [?? pts] Complete the Markov Chain below and write down the probabilities on each transition.
Self transitions for all states with probability p and all other transitions with probability p1 ́ pq{2 (ii) [?? pts] What is the stationary probability distribution of this Markov chain? (Again, assume N “ 3).
PpXinf “1q= 1{3
PpXinf “2q= 1{3
PpXinf “3q= 1{3
1{3 for each state, because of the symmetry.
(b) [?? pts] What is the number of parameters (or degrees of freedom) needed to model the transition probability PpXt|Xt ́1q? Assume N people in the meeting and arbitrary transition probabilities.
NpN ́ 1q PpXt|Xt ́1q is N ˆ N and each row should sum to one. Significant partial credit will be given to the answer N2.
(c) [?? pts] Let’s remove the assumption that people are not allowed to talk simultaneously. Now, hidden state Xt P t0,1uN will be a binary vector of length N. Each element of the vector corresponds to a person, and whether they are speaking.
Now, what is the number of parameters (or degrees of freedom) needed for modeling the transition proba- bility PpXt|Xt ́1q?
4

2N p2N ́ 1q We have 2N different states so we need 2N p2N ́ 1q or roughly 22N parameters.
5

One way to decrease the parameter count is to assume independence. Assume that the transition probability between people is independent. The figure below represents this assumption for N “ 3, where Xt “ rXtp1q, Xtp2q, Xtp3qs.
(d) [?? pts] Write the following in terms of conditional probabilities given from the Bayes Net. Assume N people in the meeting.
Transition Probability PpXt|Xt ́1q PpXt|Xt ́1q “ śNn“1 PpXtpnq|Xt ́1pnqq
Emission Probability PpYt|Xtq
PpYt|Xtq “ PpYt|Xtp1q, ̈ ̈ ̈ ,XtpNqq
(e) [?? pts] What is the number of parameters (or degrees of freedom) needed for modeling transition probability PpXt|Xt ́1q? Assume N people in the meeting.
2N
We need to define a transition matrix for each person which requires 2 parameters and there are N people. Significant partial credit for answer 4N.
6

Q3. [?? pts] Naive Bayes
(a) We use a Naive Bayes classifier to differentiate between Pacmen and Ghosts, trained on the following:
Assume that the distributions generated from these samples perfectly estimate the CPTs. Given features f1, f2, we predict Yˆ P tGhost, P acmanu using the Naive Bayes decision rule. If P pY “ Ghost|F1 “ f1 , F2 “ f2 q “ 0.5, assign Yˆ based on flipping a fair coin.
(i) [?? pts] Compute the table P pYˆ |Y q.
Value P pYˆ “ Ghost|Y “ Ghostq is the probability of correctly classifying a Ghost,
while P pYˆ “ P acman|Y “ Ghostq is the probability of confusing a Ghost for a P acman.
F1
F2
Y
0
1
Ghost
1
0
Ghost
0
0
Pac
1
1
Pac
PpYˆ|Yq
Yˆ “ Ghost
Yˆ “Pacman
Y “ Ghost
1 2
1 2
Y “Pacman
1 2
1 2
For each modification below, recompute table PpYˆ|Yq. The modifications for each part are separate, and do not accumulate.
(ii) [?? pts] Add extra feature F3 “ F1 ` F2, and modify the Naive Bayes classifier appropriately.
PpYˆ|Yq
Yˆ “ Ghost
Yˆ “Pacman
Y “ Ghost
1
0
Y “Pacman
0
1
(iii) [?? pts] Add extra feature F3 “ F1 ˆ F2, and modify the Naive Bayes classifier appropriately.
PpYˆ|Yq
Yˆ “ Ghost
Yˆ “Pacman
Y “ Ghost
1
0
Y “Pacman
1 2
1 2
7

(iv) [?? pts] Add extra feature F3 “ F1 ́ F2, and modify the Naive Bayes classifier appropriately.
PpYˆ|Yq
Yˆ “ Ghost
Yˆ “Pacman
Y “ Ghost
1
0
Y “Pacman
0
1
(v) [?? pts] Perform Laplace Smoothing with k “ 1.
PpYˆ|Yq
Yˆ “ Ghost
Yˆ “Pacman
Y “ Ghost
1 2
1 2
Y “Pacman
1 2
1 2
(b) [?? pts] Now, we reformulate the Naive Bayes classifier so that it can choose more than one class. For example, if we are choosing which genre a book is, we want the ability to say that a romantic comedy is both a romance and a comedy.
To do this, we have multiple label nodes Y “ tY1…Ynu which all point to all features F “ tF1…Fmu.
Select all of the following expressions which are valid Naive Bayes classification rules, i.e., equivalent to
argmaxY1…Yn PpY1,Y2,…,Yn|F1,F2,…,Fmq: «ff
nm
l arg max ś PpYiqśPpFj|Yiq
Y1…Yn i j
«ff
nm
max ś P pYiq ś P pFj |Y1…Ynq
l arg
􏰃 arg max śrPpY qsśrPpF |Y …Y qs
Y1…Yn i j nm
ij1n Y1…Yn i j
«# +ff
nm
l ś argmax PpYiqśPpFj|Yiq
i Yi j
«# +ff
nm
l ś arg max P pYiq ś P pFj |Y1…Ynq
i Yi j
8

Q4. [?? pts] Tracking a cyclist
We are trying to track cyclists as they move around a self-driving car. The car is equipped with 4 “presence detectors” corresponding to:
• Front of the car (F),
• Back of the car (B),
• Left side of the car (L), • Right side of the car (R).
F
LR
B
Figure 1: Autonomous vehicle and detection zones
Unfortunately, the detectors are not perfect and feature the following conditional probabilities for detection D P t0, 1u
(“no detection” or “detection”, respectively) given cyclist presence C P t0, 1u (“no cyclist” or “cyclist”, respectively).
Front detector
Back detector
Left & Right detectors
(a) Detection and dynamics
(i) [?? pts] If you could freely choose any detector to equip all four detection zones, which one would be best?
􏰀 The front detector. 􏰁 The detector at the back. 􏰁 The left/right detector. The front detector features the confusion matrix most concentrated on the diagonal.
Dynamics: We have measured the following transition probabilities for cyclists moving around the car when driving. Assume any dynamics are Markovian. Variable Xt P tf, l, r, bu denotes the location of the cyclist at time t, and can be in front, left, right, or back of the car.
(ii) [?? pts] Which criterion does this table have to satisfy for it to be a well defined CPT? (Select all that apply).
􏰃 Each row should sum to 1. l Each column should sum to 1. l The table should sum to 1. 9
PF pD|Cq
d“1
d“0
c“1
0.8
0.2
c“0
0.1
0.9
PBpD|Cq
d“1
d“0
c“1
0.6
0.4
c“0
0.4
0.6
PLpD|Cq “ PRpD|Cq
d“1
d“0
c“1
0.7
0.3
c“0
0.2
0.8
P pXt`1 |Xt q
Xt`1 “ f
Xt`1 “ l
Xt`1 “ r
Xt`1 “ b
Xt “ f
pff
pfl
pfr
pfb
Xt “ l
plf
pll
plr
plb
Xt “ r
prf
prl
prr
prb
Xt “ b
pbf
pbl
pbr
pbb

(b)
Let’s assume that we have been given a sequence of observations d1,d2,…,dt and computed the posterior probability PpXt|d1,d2,…,dtq, which we represent as a four-dimensional vector.
(i) [?? pts] What is vector P pXt`1|d1, d2, . . . , dtq as a function of P pXt`1|Xtq (a 4 ˆ 4 matrix written above) and PpXt|d1,d2,…,dtq?
PpXt`1|D1,D2,…,Dtq“PpXt`1|XtqT ˆPpXt|D1,D2,…,Dtq ř
orPpXt`1|D1,D2,…,Dtq“ xt PpXt`1|Xt “xtqˆPpXt “xt|D1,D2,…,Dtq
(ii) [?? pts] What is the computational complexity of computing P pXt|D1 “ d1, D2 “ d2, . . . , Dt “ dtq as a
function of t and the number of states S (using big O notation)? Opt ˆ S2q.
Detailed solution: At each time step of the forward algorithm, we need to multiply a vector of size S by a matrix of size S2 to account for the dynamics entailed in PpXt`1|Xtq.
Then, we need to compute the emission probability of each state which here costs 2 ˆ S and normalize (com- plexity is S). Therefore, the cost of propagating beliefs forward in time through the dynamics dominates as a function of S and is OpS2q for each time step. Hence the final answer.
(c) (i)
[?? pts] We now add a radar to the system (random variable E P tf, l, r, bu). Assuming the detection by this device is independent from what happens with the pre-existing detectors, which of the probabilistic models could you use? If several variables are in the same node, the node represents a tuple of random variables, which itself is a random variable.
Answer a)
Dt
Xt
Xt+1
Et Dt+1 Et+1
Select all that apply.
􏰃a) 􏰃b) lc) 􏰃d)
(ii) [?? pts] ERRATUM: Which of the following values for Z are correct?
PpXt`1|D1,…,Dt`1,E1,…,Et`1q “
ÿ Z ̈PpXt “x|D1,…,Dt,E1,…,Etq ̈PpXt`1|Xt “xq
􏰃 Z “ PpEt`1Dt`1|Xt`1,Xt,D1,…,Dt`1,E1,…,Et`1q 􏰃 Z “ PpEt`1|Xt`1qPpDt`1|Xt`1q
l Z“PpEt`1|XtqPpDt`1|Xtq
x“f,l,r,b
PpDt`1,Et`1|D1,…,Dt,E1,…,Etq .
10
Answer b)
Xt Xt+1
Et Dt Et+1Dt+1
Answer c)
Answer d)
Xt Et Dt Xt+1Et+1Dt+1
Xt Xt+1
Et Et+1
Dt Dt+1

Q5. [?? pts] MDP: Left or Right Consider the following MDP:
80
0
The state space S and action space A are
2
S “ tA,B,Tu
A “ tleft, rightu
􏰁 πp ̊pAq “ left ,
􏰁 πp ̊pAq “ left ,
􏰀 πp ̊pAq “ right
􏰁 πp ̊pAq “ right
, ,
πp ̊pBq “ left πp ̊pBq “ right πp ̊pBq “ left
πp ̊pBq “ right
TABT
where T denotes the terminal state (both T states are the same). When in a terminal state, the agent has no more action and gets no more reward. In non-terminal states, the agent can only go left or right, but their action only succeeds (goes in the intended direction) with probability p. If their action fails, then they go the opposite direction. The numbers on the arrows denote the reward associated with going from one state to another.
For example, at state A taking action left:
• with probability p, the next state will be T and the agent will get a reward of 8. The episode is then terminated.
• with probability 1 ́ p, the next state will be B and the reward will be 2.
For this problem, the discount factor γ is 1. Let πp ̊ be the optimal policy, which may or may not depend on the
value of p. Let Qπp ̊ and V πp ̊ be the corresponding Q and V functions of πp ̊. (a) [?? pts] If p “ 1, what is πp ̊? (Select one)
The optimal policy just goes back and forth between A and B getting infinite points.
(b) [?? pts] If p “ 0, what is πp ̊pAq? (Select one)
􏰁 πp ̊pAq “ left ,
􏰀 πp ̊pAq “ left ,
􏰁 πp ̊pAq “ right
􏰁 πp ̊pAq “ right
, ,
πp ̊pBq “ left πp ̊pBq “ right πp ̊pBq “ left
πp ̊pBq “ right
Since p “ 0, it’s the same as p “ 1 but you just need to take the opposite action.
12

(c) [?? pts] Suppose πp ̊pAq “ left. Which of the following statements must be true? (Select all that apply) Hint: Don’tforgetthatifx“y,thenxěyandxďy.
l Qπp ̊pA,leftqďQπp ̊pA,rightq 􏰃 Qπp ̊pA,leftqěQπp ̊pA,rightq l Qπp ̊pA,leftq“Qπp ̊pA,rightq l Vπp ̊pAqďVπp ̊pBq
􏰃 Vπp ̊pAqěVπp ̊pBq
l Vπp ̊pAq“Vπp ̊pBq
􏰃 Vπp ̊pAqďQπp ̊pA,leftq 􏰃 Vπp ̊pAqěQπp ̊pA,leftq 􏰃 Vπp ̊pAq“Qπp ̊pA,leftq l Vπp ̊pAqďQπp ̊pA,rightq 􏰃 Vπp ̊pAqěQπp ̊pA,rightq l Vπp ̊pAq“Qπp ̊pA,rightq
For left to be the optimal action, it must be the case that Qπp ̊ pA, leftq ě Qπp ̊ pA, rightq. Therefore, we also get that V πp ̊ pAq “ maxa Qπp ̊ pA, aq “ Qπp ̊ pA, leftq. Also, it’s always the case that V πp ̊ pAq ě V πp ̊ pBq for this problem.
(d) Assume p ě 0.5 below.
(i) [?? pts] V ̊pBq “ αV ̊pAq ` β. Find α and β in terms of p.
•α“p
•β“ 0
since p ě 0.5, it’s always optimal to go left from B. So
V ̊pBq “ Q ̊pB, leftq “ pp0 ` V ̊pAqq ` p1 ́ pqp0 ` V ̊pT qq “ pV ̊pAq
(ii) [?? pts] Qπp ̊ pA, leftq “ αV ̊pBq ` β. Find α and β in terms of p. •α“ 1-p
•β“ 2+6p
• β “ 8 ́ 6p
π ̊ ̊ ̊
Q p pA,leftq“pp8`V pTqq`p1 ́pqp2`V pBqq
“ p1 ́ pqV ̊pBq ` 2 ` 6p
(iii) [?? pts] Qπp ̊ pA, rightq “ αV ̊pBq ` β. Find α and β in terms of p. •α“ p
π ̊ ̊ ̊
Q p pA,rightq“p1 ́pqp8`V pTqq`pp2`V pBqq
“ pV ̊pBq ` 8 ́ 6p 13

Q6. [?? pts] Take Actions
An agent is acting in the following gridworld MDP, with the following characteristics.
• Discount factor γ ă 1.
• Agent gets reward R ą 0 for entering the terminal state T , and 0 reward for all other transitions.
• When in terminal state T , the agent has no more action and gets no more reward.
• In non-terminal states tA, B, Cu, the agent can take an action tUp, Down, Left, Rightu.
• Assume perfect transition dynamics. For example, taking action Right at state A will always result in state C in the next time step.
• If the agent hits an edge, it stays in the same state in the next time step. For example, after taking action Right at C, the agent remains in state C.
(a) (i) [?? pts] What are all the optimal deterministic policies? Each cell should contain a single action
tU p, Down, Lef t, Rightu. Each row corresponds to a different optimal policy. You may not need all rows.
(ii) [?? pts] Suppose the agent uniformly randomly chooses between the optimal policies in (i). In other words, at each state, the agent picks randomly between the actions in the corresponding column with equal probability. The agent’s location at each time step is then a Markov process where state Xt P tA, B, C, T u. Fill in the following transition probabilities for the Markov process.
• PpXt`1 “B|Xt “Aq“ 0.5 • PpXt`1 “A|Xt “Bq“ 0
• PpXt`1 “T|Xt “Cq“ 1
B
T
A
C
State
A
B
C
Optimal policy 1
Up
Right
Up
Optimal policy 2 (if needed)
Right
Right
Up
Optimal policy 3 (if needed)
15

(b) Suppose the agent is acting in the same gridworld as above, but does not get to observe their exact state Xt. Instead, the agent only observes Ot P tblack,green,pinku. The observation probability as a function of the state PpOt|Xtq is specified in the table below. This becomes a partially-observable Markov decision process (POMDP). The agent is equally likely to start in non-terminal states tA,B,Cu.
B 0.5, black 0.5, green
T
A 0.5, black 0.5, pink
C 0.5, pink 0.5, green
(i) [?? pts] If the agent can only act based on its current observation, what are all deterministic optimal policies? You may not need all rows.
Black
Green
Pink
Optimal policy 1
Right
Right
Up
Optimal policy 2 (if needed)
Right
Up
Up
Optimal policy 3 (if needed)
(ii) [?? pts] Suppose that the agent follows the policy πpBlackq “ Right, πpGreenq “ Right, and πpPinkq “ Up. Let V pSq be the agent’s expected reward from state S. Your answer should be in terms of γ and R. Note that V pSq is the expected value before we know the observation, so you must consider all possible observations at state S.
γp1R`1pR qq 2 22 ́γ
R
R 2 ́γ
VpCq“1R`1γVpCqùñVpCq“ R . 22 2 ́γ
VpAq“γp1VpBq` 1VpCqq 22
Now suppose that the agent’s policy can also depend on all past observations and actions. Assume that when the agent is starting (and has no past observations), it behaves the same as the policy in the previous part: πprBlacksq “ Right, πprGreensq “ Right, πprPinksq “ Up. In all cases where the agent has more than one observation (for example, observed Pink in the previous time step and now observes Green), π acts optimally.
(iii) [?? pts] For each of the following sequences of two observations, write the optimal action that the policy π would take.
•V(A)= • V(B) =
• V(C) =
V pBq “ R because we always go right in state B.
Black Pink
Black Green
Green Pink
Green Green
Pink Black
Pink Green
Up
Up
Up
Up
Right
Right
16

(iv) [?? pts] In this part only, let V pSq refer to the expected sum of discounted rewards following π if we start from state S (and thus have no previous observations yet). As in the previous part, this is the expected value before knowing the observation, so you must consider all possible observations at S.
Hint: since π now depends on sequences of observations, the way we act at states after the first state may be different, and this affects the value at the first state.
• V(A) = γR
• V(B) = R
• V(C)= 1p1`γqR 2
(c) Boba POMDP May is a venture capitalist who knows that Berkeley students love boba. She is picking between investing in Sharetea or Asha. If she invests in the better one, she will make a profit of $1000. If she invests in the worse one, she will make no money.
At the start, she believes both Asha and Sharetea have an equal chance of being better. However, she can pay to have students taste test. At each time step, she can either choose to invest or to pay for a student taste test. Each student has a p “ 0.9 probability of picking the correct place (independent of other students).
(i) [?? pts] What is the expected profit if May invests optimally after one (free) student test? 900
(ii) [?? pts] If she had to invest after one student test, what is the highest May should pay for the test? 400
(iii) [?? pts] Suppose after n student tests, it turns out that all students have chosen the same store. What is her expected profit after after observing these n student tests?
􏰁 1000p0.9nq
􏰀 1000p 0.9n q
􏰁 1000p0.9n ́ 0.1nq
􏰁 1000p0.9n ́0.1n q
0.9n
􏰁 1000p0.9n ́0.1n q
0.9n `0.1n
􏰁 1000p1 ́ 0.1n q
(iv) [?? pts] How many tests should May pay for if each one costs $100?
Hint: Think about the maximum possible value of information. How does this compare to the expected value of information?
1
0.9n `0.1n
17

Q7. [?? pts] Graph Search
You are trying to plan a road trip from city A to city B. You are given an undirected graph of roads of the entire country, together with the distance along each road between any city X and any city Y : lengthpX, Y q (For the rest of this question, ”shortest path” is always in terms of length, not number of edges). You would like to run a search algorithm to find the shortest way to get from A to B (assume no ties).
Suppose C is the capital, and thus you know the shortest paths from city C to every other city, and you would like to be able to use this information.
Let pathoptpX Ñ Y q denote the shortest path from X to Y and let costpX, Y q denote the cost of the shortest path between cities X and Y . Let rpathpX Ñ Y q, pathpY Ñ Zqs denote the concatenation.
(a) [?? pts] Suppose the distance along any edge is 1. You decide to initialize the queue with A, plus a list of all cities X, with pathpA Ñ Xq “ rpathoptpA Ñ Cq,pathoptpC Ñ Xqs . You run BFS with this initial queue (sorted in order of path length). Which of the following is correct? (Select all that apply)
l You always expand the exact same nodes as you would have if you ran standard BFS. l You might expand a different set of nodes, but still find the shortest path.
􏰃 You might expand a different set of nodes, and find the sub-optimal path.
Consider a graph of 5 nodes: A,B,C,D,E and edges (A,C), (C,E), (E,B), (A,D),(D,B). Then our initial queue (in order) is
1. C: A-C
2. E: A-C-E
3. B: A-C-E-B 4. D: A-C-A-D
The path returned will be A-C-E-B
(b) [?? pts] You decide to initialize priority queue with A, plus a list of all cities X, with pathpA Ñ Xq “ rpathoptpA Ñ Cq, pathoptpC Ñ Xqs, and costpA, Xq “ costpA, Cq ` costpC, Xq. You run UCS with this initial priority queue. Which of the following is correct? (Select all that apply)
􏰃 You always expand the exact same nodes as you would have if you ran standard UCS. l You might expand a different set of nodes, but still find the shortest path.
l You might expand a different set of nodes, and find the sub-optimal path.
Regardless of what is on the queue, UCS will explore nodes in order of their shortest-path distance to A, so the set of explored nodes is always {nodes X: dist(A,X) less than dist(A,B)}
18

Q8. [?? pts] Bayes Net Inference
A plate representation is useful for capturing replication in Bayes Nets. For example, Figure ??(a) is an equivalent
representation of Figure ??(b). The N in the lower right corner of the plate stands for the number of replica.
Figure 2
Now consider the Bayes Net in Figure ??. We use X1:N as shorthand for pX1 , ̈ ̈ ̈ , XN q. We would like to compute
the query P pX1:N |Y1:N “ y1:N q. Assume all variables are binary.
Figure 3
(a) [?? pts] What is the number of rows in the largest factor generated by inference by enumeration, for this query?
􏰁 22N 􏰁 23N 􏰁 22N`2 􏰀 23N`2 Ininferencebyenumeration,thefulljointprobabilityPpX1:N,Y1:N “y1:N,W1:N,Z1:N,A,Bqarecomputed,
which has size 23N`2
(b) [?? pts] Mark all of the following variable elimination orderings that are optimal for calculating the answer for the query P pX1:N |Y1:N “ y1:N q. (A variable elimination ordering is optimal if the largest factors generated is smallest among all possible elimination orderings).
􏰃 Z1, ̈ ̈ ̈ ,ZN,W1, ̈ ̈ ̈ ,WN,B,A 􏰃 W1, ̈ ̈ ̈ ,WN,Z1, ̈ ̈ ̈ ,ZN,B,A l A,B,W1, ̈ ̈ ̈ ,WN,Z1, ̈ ̈ ̈ ,ZN l A,B,Z1, ̈ ̈ ̈ ,ZN,W1, ̈ ̈ ̈ ,WN
The only thing that matters is the size of the maximum factor generated during elimination and the final factor PpX1:N|y1:Nq has size 2N. Eliminating anything other than A does not generate a factor which depends on more than one time index i, so as long as A is eliminated last, no factor of size greater than 2N is generated,
19

so the first two orderings are both optimal. (In fact, as long as A is eliminated before B, the ordering will be optimal, but it was not necessary to notice this to distinguish among the given options.)
(c) [?? pts] Which of the following variables can be deleted before running variable elimination, without affecting the inference result? Deleting a variable means not putting its CPT in our initial set of factors when starting the algorithm.
􏰃W1 􏰃Z1 lA􏰃BlNone
B, W1, and Z1 can all be deleted. In general, any variable which has no descendants that are query variables (in this case Xi) or evidence variables (in this case yi) can be deleted. This is because when we eliminate the subgraph of all the descendants of such a variable, we will end up with a factor in which all the entries are equal to 1 and thus does not affect the results whatsoever when joined with other factors.
20

Q9. [?? pts] Deep Learning (a) [?? pts] Data Separability
The plots above show points in feature space (x1, x2), also referred to as feature vectors x “ rx1 x2sT .
For each of the following, we will define a function hpxq as a composition of some functions fi and gi. For each
one, consider the decision rule
#
ypxq “
ˆ hpxqě0 ⃝ hpxq ă 0.
Under each composition of functions h, select the datasets for which there exist some linear functions fi and some nonlinear functions gi such that the corresponding decision rule perfectly classifies the data. (Select all that apply)
A composition of linear functions will always be linear. Parts (i), (ii), (iv) are linear. Plot (b) is linearly separable, and can be separated by linear or nonlinear decision boundaries. Plots (a),(c) require a nonlinear function to perfectly separate them.
(i) hpxq “ f1pxq
(a)l (b)􏰃 (c)l
(ii) hpxq “ f2pf1pxqq
(a)l (b)􏰃 (c)l
(iii) hpxq “ f2pg1pf1pxqqq (a)􏰃 (b)􏰃 (c)􏰃
(iv) hpxq “ f4pf3pf2pf1pxqqqq (a)l (b)􏰃 (c)l
(v) hpxq “ g2pg1pxqqq (a)􏰃 (b)􏰃 (c)􏰃
21

(b) Backpropagation Below is a deep network with input x. Values x, h1, h2, z are all scalars.
h1 “ f1pxq, h2 “ f2pxq, z “ h1h2 (1)
Derive the following gradients in terms of x, h1, h2, Bf1 , Bf2 . Bx Bx
(i) [?? pts] Derive Bz Bh1
h2
When taking the partial derivative of z in terms of h1, we treat h2 as a constant.
(ii) [?? pts] Derive Bz Bh2
h1
When taking the partial derivative of z in terms of h2, we treat h1 as a constant.
(iii) [?? pts] Derive Bz Bx
h Bf1 `h Bf2
2 Bx 1 Bx We use product rule and chain rule.
22

(c) Deep Network Below is a deep network with inputs x1,x2. The internal nodes are computed below. All variables are scalar values.
h1 “w11x1 `w12x2 r1 “ maxph1, 0q
y1 “ exppr1q exppr1q ` expps1q
h2 “w21x1 `w22x2 h3 “w31x1 `w32x2 r2 “ maxph2, 0q r3 “ maxph3, 0q
s1 “ maxpr2, r3q y2 “ expps1q
exppr1q ` expps1q
(2)
z “ y1 ` y2
(i) [?? pts] Forward propagation Now, given x1 “ 1,×2 “ ́2,w11 “ 6,w12 “ 2,w21 “ 4,w22 “ 7,
w31 “ 5, w32 “ 1, and the same values for x1, x2 above, compute the values of the internal nodes. Please simplify any fractions.
h1
h2
h3
r1
r2
2
-10
3
2
0
r3
s
y1
y2
z
3
3
1 1`e
e 1`e
1
(ii) [?? pts] Bounds on variables.
Find the tightest bounds on y1. y1 P p0, 1q
The output of a softmax is a probability distribution. Each element of the output is between 0 and 1.
Find the tightest bounds on z. z “ 1 The sum of the probability distribution is 1.
23

h1 “w11x1 `w12x2 r1 “ maxph1, 0q
y1 “ exppr1q exppr1q ` expps1q
h2 “w21x1 `w22x2 h3 “w31x1 `w32x2 r2 “ maxph2, 0q r3 “ maxph3, 0q
s1 “ maxpr2, r3q y2 “ expps1q
exppr1q ` expps1q
(3)
z “ y1 ` y2
(iii) [?? pts] Backpropagation Compute the following gradients analytically. The answer should be an ex-
pression of any of the nodes in the network (x1, x2, h1, h2, h3, r1, r2, r3, s1, y1, y2, z) or weights w11, w12, w21, w22, w31, w32.
Hint: Recall that for functions of the form gpxq “ 1 , Bg “ gpxq p1 ́ gpxqq. Also, your answer 1`exppa ́xq Bx
may be a constant or a piecewise function.
Bh1 Bw12
Bh1 Bx1
Br1 Bh1
By1 Br1
x2
w11
1rh1 ą 0s
y1p1 ́ y1q
By1 Bs1
Bz By1
Bz Bx1
Bs1 Br2
́y1 y2
1
0
1rr2 ą r3s
Expanded solutions for selected examples below:
r1 “ maxph1, 0q. This is known as a ReLU (rectified linear unit) function. When h1 is positive, r1 “ h1, so the derivative is 1. When h1 is negative, r1 is flat, so the derivative is 0.
y1 “ exppr1q exppr1 q`exppr2 q
“ 1 1`exppr2 ́r1 q
dy1 “ dr1
́1 p1`exppr2 ́r1 qq2
ˆp ́exppr ́r qq, by chain rule 2 1
1 1`exppr2 ́r1 q
ˆ exppr2 ́r1q 1`exppr2 ́r1 q
“
“ y1p1 ́ y1q “ y1y2
dy1 “ ́1
dr2 p1`exppr2 ́r1 qq2 2 1
ˆ pexppr ́ r qq, by chain rule. Notice that this is identical to the case above, but with a negative sign missing on the 2nd term.
24

“ ́1 ˆ exppr2 ́r1q 1`exppr2 ́r1 q 1`exppr2 ́r1 q
“ ́y1p1 ́ y1q “ ́y1y2
No matter how x1,x2 change, z is always 1, so the gradient with respect to x1 is 0. Whenr2 ąr3,s1 “r2,so Bs1 “1. Whenr2 ăr3,s1 “r3,so Bs1 “0.
r2
r2
25

THIS PAGE IS INTENTIONALLY LEFT BLANK
26

Related Posts