CS计算机代考程序代写 chain flex algorithm data structure CS 188 Introduction to

CS 188 Introduction to
Fall 2019 Artificial Intelligence Final Exam
• You have approximately 170 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam
• The exam is closed book, closed calculator, and closed notes except your two-page crib sheet.
• Mark your answers ON THE EXAM IN THE DESIGNATED ANSWER AREAS. Provide a brief explanation if applica- ble.
• In the interest of fairness, we want everyone to have access to the same information. To that end, we will not be answering questions about the content. If a clarification is needed, it will be projected at the front of the room. Make sure to periodically check the clarifications.
• For multiple choice questions,
– □ means mark all options that apply
– 􏰁 means mark a single choice
– When selecting an answer, please fill in the bubble or square completely (􏰀 and ■)
– If need to undo a selection, either erase it completely or lay a giant cross (X) over the box. The staff reserves the
right to subtract points from ambiguous answers
First name
Last name
SID
Student to your right
Student to your left
For staff use only:
Q1. Coin Stars
/10
Q2. Five Card Game
/14
Q3. Vehicle Perception Indication
/10
Q4. Hidden CSP
/14
Q5. Snail Bayes
/15
Q6. We Are Getting Close…
/11
Q7. Generating and classifying text
/10
Q8. Deep “Blackjack”
/16
Total
/100
1

THIS PAGE IS INTENTIONALLY LEFT BLANK
2

Q1. [10 pts] Coin Stars
In a new online game called Coin Stars, all players are walking around an M x N grid to collect hidden coins, which only appear when you’re on top of them. There are also power pellets scattered across the board, which are visible to all players. If you walk onto a square with a power pellet, your power level goes up by 1, and the power pellet disappears. Players will also attack each other if one player enters a square occupied by another player. In an attack, the player with a higher power level will steal all the coins from the other player. If they have equal power levels, nothing happens. Each turn, players go in order to move in one of the following directions: {N, S, E, W}.
In this problem, you and your friend Amy are playing Coin Stars against each other. You are player 1, and your opponent Amy is player 2. Our state space representation includes the locations of the power pellets (𝑥𝑝𝑗 , 𝑦𝑝𝑗 ) and the following player information:
• Each player’s location (𝑥𝑖, 𝑦𝑖) • Each player’s power level 𝑙𝑖
• Each player’s coin count 𝑐𝑖
SID:
(a)
(i) [2 pts] Suppose a player wins by collecting more coins at the end of a number of rounds, so we can formulate this as a minimax problem with the value of the node being 𝑐1 − 𝑐2. Consider the following game tree where you are the maximizing player (maximizing the your net advantage, as seen above) and the opponent is the minimizer. As- suming both players act optimally, if a branch can be pruned, fill in its square completely, otherwise leave the square unmarked.
􏰀 None of the above can be pruned
If you traverse the tree with 𝛼 − 𝛽 pruning, you will see that no branches can be pruned
(ii) [1 pt] Suppose that instead of the player with more coins winning, every player receives payout equal to the number of coins they’ve collected. Can we still use a multi-layer minimax tree (like the one above) to find the optimal action?
􏰁 Yes, because the update in payout policy does not affect the minimax structure of the game.
􏰁 Yes, but not for the reason above
􏰁 No, because we can no longer model the game under the updated payout policy with a game tree. 􏰀 No, but not for the reason above
No, because the game is no longer zero-sum: your opponent obtaining more coins does not necessarily make you worse off, and vice versa. We can still model this game with a game-tree, where each node contains a tuple of two values, instead of a single value. But this means the tree is no longer a minimax tree.
An example of using the minimax tree but not optimizing the number of coins collected: when given a choice between gathering 3 coins or stealing 2 coins from the opponent, the minimax solution with 𝑐1 − 𝑐2 will steal the 2 coins (net gain of 4), even though this will cause it to end up with fewer coins.
3

(b) Suppose we want to train a Reinforcement Learning (RL) agent using Q-learning to play against a wide range of human participants, with the objective to obtain more coins than its opponent like in (a.i).
Your friend Blake discovers that a computer scientist and expert Coin Stars player Oswald has published (open-sourced) hispolicy𝜋𝑒,whichconsistsofasetofexpertfeatures𝐹𝑒 =[𝑓1,𝑓2,…,𝑓𝑛],allofwhichcanbecomputedusingthecurrent state representation mentioned in the problem statement. Oswald has hand-tuned the weights 𝑊𝑒 = [𝑤1, 𝑤2, …, 𝑤𝑛] of the policy 𝜋𝑒 such that 𝜋𝑒 is near optimal assuming the opponent also plays optimally.
You decide to use the same set of features as Oswald’s 𝐹𝑒, and come up with four proposed training procedures, all of which will learn the optimal policy (i.e. the best response) against that particular opponent:
I Playing against an opponent who plays 𝜋𝑒, Oswald’s expert-tuned policy that assumes our agent plays optimally
II Playingagainstanopponentwhoplaysafixedpolicy𝜋𝑝,asub-optimalpolicythatmovestomaximizethecollection
of power pellet, and chases its opponent afterwards.
III Playing against an opponent who plays a fixed policy 𝜋𝑐, a sub-optimal policy that ignores power pellet and its opponent, and moves to maximize the collection of hidden coins.
(i) [1 pt] With which of the training procedures is our agent most likely to learn a set of weights that can achieve the highest win rate against Oswald, an expert human participant who plays according to 𝜋𝑒?
􏰀 (I)
􏰁 (II)
􏰁 (III)
􏰁 There is not enough information to distinguish among the three
Procedure (I) is the direct fit to the expert human players, and thus is selected. Procedure (II)’s and (III)’s opponents are not expert human players, thus it’s unlikely that the policies trained to cope with these opponents can win as frequently as the policy from Procedure (I).
(ii) [1 pt] Is either of (II) or (III) guaranteed to result in an agent who achieves a higher win rate than (I) when playing against novice human participants who play sub-optimally, in an unknown way?
􏰁 Yes, because (II) and (III) are trained with sub-optimal opponents, whereas (I) is trained with expert opponents. 􏰁 Yes, but not for the reason above.
􏰀 No, because the novice human may not be sub-optimal in the same way as the opponents in (II) or (III).
􏰁 No, but not for the reason above.
Procedure (I) could yield a policy that is robust enough and still beat sub-optimal human players, but the policy also makes the incorrect assumption that the human opponent play optimally, which is not true for novice human players. There are essentially so many ways to play sub-optimally, so, similar to Procedure (I), with a high probability, Procedures (II) and (III) make incorrect assumptions about the novice human player’s strategy and therefore cannot guarantee a higher win rate.
(iii) [2 pts] Suppose instead of training an RL agent against an agent with a fixed strategy, we instead train the RL agent against itself. That is, the RL agent plays against an opponent with its weights, and the features are computed from each agent’s point of view.
Which of the following are potential issues in this self-play Reinforcement Learning approach?
■ The RL agent’s weights are not guaranteed to converge because it is playing against itself.
□ Because the opponent you play against can be arbitrarily bad, the RL agent cannot improve at the game at all. □ Because the RL agent is playing against the same opponent (itself), the best (highest reward) policy from its
perspective does not change over the course of training.
□ Because the RL agent is playing against itself, it does not need to do exploration (for example, it can just do the
greedy action with respect to its current Q-values). 􏰁 None of the above
The 1st choice: suppose there exist three strategies in the game A, B, C, such that A beats B beats C beats A. (For an analogy, consider rock-paper-scissors.) If the agent is currently playing B, then it learns that playing C is better, and will change its weights too play C. If the agent is currently playing C, then playing A will look better, and so it will play A. This will cause it to play B, then C, then A, and so forth, ad infinitum. So this choice is true.
The 2nd choice: even if the opponent is arbitrarily bad, you still might learn some basic skills such as how to gather coins. Then the opponent is better in the next iteration, and the RL agent might get better yet.
The 3rd choice: again, consider the rock-paper-scissors example. Over time, the best policy (against the previous iteration of the agent) changes.
The 4th choice: You need to do exploration, as, for example, your initial Q-values can be arbitrarily bad.
4

(c) [1 pt] For this part only, suppose the game randomly reveals some hidden coins throughout the game, making them visible to both you and your friend. Which of the conditions below will both players benefit the most from the feature “Distance to closest visible coins”?
􏰀 Coins are sparse (∼ 1 coin per 50 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves) 􏰁 Coins are sparse (∼ 1 coin per 50 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves) 􏰁 Coins are dense (∼ 1 coin per 2 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves) 􏰁 Coins are dense (∼ 1 coin per 2 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves) 􏰀 Equally useful in all conditions above
Spare coins makes information about their location more valuable. This is because if coins are dense, agents are able to collects lots of coins without knowing their locations, just by walking on the grid. Higher reveal frequency is strictly better because they are able to make their plan more efficient with strictly more information. We also accepted “Equally useful in all conditions above” as an answer since benefit can be interpreted as a better chance to win the game – but this situation equally benefits both players in all coin situations.
(d) Using the original problem setup and, we have the following features and weights for a given state 𝑠:
(i) [1 pt] Calculate 𝑄(𝑠, 𝑎) for the q-state where player 1 is at (1, 1) and player 2 is at (5, 4). Player 1 has power level 7, and player 2 has power level 3. The closest pellet to player 1 is located at (2, 1).
􏰁 24 􏰀25 􏰁 26 􏰁 27 􏰁 28 􏰁 Avaluedifferentfromalloftheabove
􏰁 There is not enough information to calculate this value 1∗(1−5+1−4)+4∗(7−3)+2∗ 1 =1∗7+4∗4+2∗1=25
1
(ii) [1 pt] We observe a sample (𝑠, 𝑎, 𝑟) for the q-state in the previous part with 𝑟 = 23. Assuming a learning rate of 𝛼 = 0.5 and discount factor of 𝛾 = 0.5, calculate the new weight 𝑤1_𝑛𝑒𝑤 after a single update of approximate Q-Learning. This part will be graded independently of the previous part, so please check your work before proceeding.
𝑤1_𝑛𝑒𝑤 =
􏰁 -13 􏰁 -6 􏰁 1 􏰁 8 􏰁 15 􏰁 Avaluedifferentfromalloftheabove
􏰀 There is not enough information to calculate this value
We do not have any information about the values of features for any outgoing state-action pair 𝑠′, 𝑎′ from this current 𝑠, 𝑎 state-action pair.
SID:
Feature
Initial Weight
𝑓1(𝑠,𝑎)=|𝑥1 −𝑥2|+|𝑦1 −𝑦2|
𝑤1 = 1
𝑓2(𝑠, 𝑎) = 𝑙1 − 𝑙2
𝑤2 = 4
𝑓3(𝑠,𝑎) = 1
the Manhattan distance from player 1 to their closest pellet
𝑤3 = 2
5

Q2. [14 pts] Five Card Game
There is a two-player zero-sum card game called Five Card Game. Player A plays odd-numbered turns (and thus plays first), and he initially has the cards labeled 1, 3, 5 in his hand. Player B plays even-numbered turns, and she initially has the cards labeled 2 and 4 in her hand. They take turns to give out one of their cards to the judge based on their policies. The game ends after 5 turns and forms the sequence 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5, where 𝑋1, 𝑋3, 𝑋5 is a permutation of 1, 3, 5 provided by player A, and 𝑋2, 𝑋4 is a permutation of 2, 4 provided by player B.
However, neither of the two players have access to what cards their opponent plays throughout the game. The only hint they receive is from the judge: when 𝑋𝑡 has been determined by either of the players (𝑡 ≥ 2), the judge of the game will compare 𝑋𝑡 and 𝑋𝑡−1, and assign one point to either of the players. With probability 𝑝𝑡 (0 ≤ 𝑝𝑡 ≤ 1), the point will be given to the player whose card is larger, and with probability 1 − 𝑝𝑡, the point will be given to the player whose card is smaller. 𝑝2, 𝑝3, 𝑝4, 𝑝5 are pre-defined before the game and everyone knows their values. Both the players and the audience know the point assignment right after the judge assigns it. The player who wins more points wins the game.
We denote the point that player A earned at each time step as 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 (value can be either 0 or 1), and thus all the variables are determined in the following order: 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴. Note player A determines 𝑋3 after knowing 𝑈2𝐴, and player B determines 𝑋4 after knowing 𝑈2𝐴 and 𝑈3𝐴, and player A determines 𝑋5 after knowing 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴.
As an illustration, if 𝑋1,𝑋2,𝑋3,𝑋4,𝑋5 is 3, 2, 5, 4, 1, and 𝑝𝑡 = 1 for all 𝑡 (so that the bigger card’s owner always gets the point), then 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 is 1, 1, 1, 0 and player A wins this game. In this example, 𝑈2𝐴 = 1 because 𝑋2 < 𝑋1 and hence the point is awarded to player A, while 𝑈5𝐴 = 0 because 𝑋5 < 𝑋4 and hence the point is awarded to player B. (a) Suppose 𝑝𝑡 = 1 for all 𝑡. (i) [1 pt] Each player uses their own optimal deterministic policy and is aware that their opponent is also using the optimal deterministic policy. How many points will Player A obtain? Hint: Drawing a game tree might help you here. Answer: Player A is the maximizer, and player B is the minimizer. Utility of the root node = 2. 6 2 SID: (ii) [1 pt] Player B decides to gives out her cards randomly (with equal probability for each ordering). If Player A becomes aware of this, what is the expected number of points Player A will obtain if he plays optimally? Answer: Player A is again the maximizer, but here player B has the chance nodes. Utility of the root node = 2.5. 2.5 7 Now we assume both the players have their own random policies. Suppose a player’s policy for each 𝑋𝑡 is a conditional proba- bility table, conditioned on the MINIMUM set of DEPENDENT information they know when playing the card 𝑋𝑡. Then at each time step 𝑡, they randomly sample which card to draw according to their probability table. During the game, each player only knows their own policy. In addition, the policies do not change over time. We want to construct a Bayes Net with the minimum number of edges to investigate the relations between the nodes 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴. Hint 1: Drawing the Bayes Net with nodes 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴 while answering part (b) will help you in part (c). But only your response to the questions will be graded. No partial credit for the drawings. Hint 2: Do Player A and Player B know each other’s card sequence exactly? Do they know their own card sequence exactly? Hint 3: As noted in the problem statement, before any player determines his or her turn 𝑋𝑡, he or she knows all the existing hints𝑈𝐴,...,𝑈𝐴 . 2 𝑡−1 The Bayes Net: Special note about 𝑈2𝐴 → 𝑋3: As indicated in the problem, player A determines 𝑋3 after getting the value of 𝑈2𝐴 from the judge. 𝑈2𝐴 would be a valuable indication of 𝑋2 played by player B if 𝑋1 = 3. However, for 𝑋4, player B has no other choice. He only have the card that is different from 𝑋2 at hand. (b) (i) [1 pt] Which of the following are directly reflected in the Bayes net’s probability tables (including prior probability tables and conditional probability tables)? ■ The judge’s point assignment probability. □ The player A’s winning probability. 􏰁 None of the above (ii) [2 pts] For each table, mark the minimal set of variables that each probability table should be conditioned on. Player B’s policy probability table to determine 𝑋2 □ 𝑋1 􏰀 None of the above Player B has no information about 𝑋1, because the judge has not stepped in to give hints. Player A’s policy probability table to determine 𝑋3 ■𝑋1 □𝑋2 ■𝑈2𝐴 􏰁 Noneoftheabove 𝑋1 eliminate the options for 𝑋3, and 𝑈2 gives clue to what 𝑋1 could have been, and thus gives clue to what 𝑋3 would be. Player B’s policy probability table to determine 𝑋4 □𝑋1 ■𝑋2 □𝑋3 □𝑈2𝐴 □𝑈3𝐴 􏰁 Noneoftheabove Note for 𝑋4, it can be precisely determined by 𝑋2, and hence it does not condition on any other variables. 8 Player A’s policy probability table to determine 𝑋5 ■𝑋1 □𝑋2 ■𝑋3 □𝑋4 □𝑈2𝐴 □𝑈3𝐴 □𝑈4𝐴 􏰁 Noneoftheabove 𝑋1 and 𝑋3 eliminate the options for 𝑋5 to one option. (c) Analyze the independency or conditional independency between variables. Hint: Feel free to use the Bayes Net you constructed from part (b), but this part will be graded independently. Please double check your work in part (b) (i) [1 pt] Which of the following pairs of variables are independent, given no observations? □𝑋2 and𝑋3 ■𝑋1 and𝑋4 􏰁 Noneoftheabove (ii) [1 pt] Which of the following pairs of variables are independent, given only 𝑈2𝐴? □𝑋2 and𝑋3 □𝑋1 and𝑋4 􏰀Noneoftheabove By using the D-Separation on the Bayes Net constructed for this question, we can obtain the correct answer. For the second question, although the route 𝑋2 − 𝑈2𝐴 − 𝑋3 is cut off, there is still another path 𝑋2 − 𝑈2𝐴 − 𝑋1 − 𝑋3 that connects. 9 SID: Now the game is over and you have only observed the values of 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴, and want to decode the values of 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5. You want to solve this problem by searching over all the possible states. Each of your states contains the card sequence at some point during the game. For example, one search path from the start state to one goal state could be: () → (3,) → (3, 2) → (3, 2, 5) → (3, 2, 5, 4) → (3, 2, 5, 4, 1) (Goal State). (d) [1 pt] How many goal states are there in your search problem? Answer: There are 12 leaf nodes in the game tree, and hence there are 12 goal states. Which data structure is more similar to the search graph? 􏰀 Your game tree from part (a) 􏰁 Your bayes net from part (b) There is no valid start state or goal state in the Bayes Net. In the game tree, the root node is the start state, while the leaf nodes are the goal states. Now we wish to decode 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5 given 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 by solving the following problem: argmax 𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 ) 𝑋1,𝑋2,𝑋3,𝑋4,𝑋5 212 323 434 545 (e) We then assign proper cost to each edge in the search graph. (i) [2 pts] Which is a correct edge cost configuration for the edge connecting the states (𝑋1, ..., 𝑋𝑡−1) and (𝑋1, ..., 𝑋𝑡) (where 𝑡 ≥ 2)? Note a correct configuration means that, the optimal path of your search problem should always be the correct argmax solution above for any valid player policy probability table values and 𝑝𝑡 values. 􏰁 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 −𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 ln 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰀 − ln 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 𝑒𝑃(𝑈𝑡𝐴|𝑋𝑡−1,𝑋𝑡) −𝑃(𝑈𝐴|𝑋 ,𝑋 ) 􏰁𝑒 𝑡 𝑡−1 𝑡 􏰁 None of the above Turning multiplication (argmax) into summation (total cost in search), we need logarithm. Turning argmax into argmin (search is finding the minimum cost), we need the negative sign. Now suppose we observe that 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 all equal to 1. (ii) [2 pts] What is the value for 𝑃 (𝑈5𝐴|𝑋4, 𝑋5) used to compute the edge cost from the state (3, 2, 5, 4) to the state (3, 2, 5, 4, 1)? Your answer should just be the value of 𝑃 (𝑈5𝐴 |𝑋4 , 𝑋5 ) rather than the edge cost computed using your answer to part (e)(i). 􏰁 𝑝5 􏰀 1 − 𝑝5 􏰁 Cannot be determined only by 𝑝𝑡 - we also need the players’ policy probability table. What is the value for 𝑃 (𝑈3𝐴|𝑋2, 𝑋3) used to compute the edge cost from the state (3, 2) to the state (3, 2, 5)? Your answer should just be the value of 𝑃(𝑈3𝐴|𝑋2,𝑋3) rather than the edge cost computed using your answer to part (e)(i). 􏰀 𝑝3 􏰁 1 − 𝑝3 􏰁 Cannot be determined only by 𝑝𝑡 - we also need the players’ policy probability table. Note we still did not assign the edge cost to the edge connecting () and (𝑋1, ). Let us assign 1 as the cost to all this type of edges. (f) [2 pts] Now you will conduct A* search on your constructed search graph with the proper edge cost from part (e). The heuristic for each node will be set to be the number of time steps remaining to the game end. For example, the heuristic for the state (3, 2, 5) is 2 (2 steps from the game end). Which of the following statements is correct? 􏰁 Neither tree search nor graph search will be complete. 􏰀 Both tree search and graph search will be complete, but neither is guaranteed to be optimal. 􏰁 Both tree search and graph search will be complete, and only the tree search is guaranteed to be optimal. 􏰁 Both the tree search and the graph search will be complete and optimal. In this question, the graph search is exactly the same as the tree search because the search graph is a tree and all the edges are directed from the parent node to the child node. The heurstic is not admissible because if 𝑝𝑡 > 1 , then − ln 𝑝𝑡 < 1. 𝑒 Similarly, if 𝑝𝑡 < 1 − 1 , then − ln(1 − 𝑝𝑡) < 1. So this heuristic can overestimate the true cost. The counter example for 𝑒 10 12 optimality (for both tree search and graph search) can be: 𝑝2 = 𝑝3 = 1, 𝑝4 = 0.5𝑒𝜖, 𝑝5 = 1 ⋅ 𝑒𝛿, where 𝜖 and 𝛿 are very 𝑒 small positive numbers and satisfy 𝛿 > 2𝜖, and we have 𝑈2𝐴 = 𝑈3𝐴 = 𝑈4𝐴 = 1 and 𝑈5𝐴 = 0. Then the optimal path should reach the goal state (1, 2, 3, 4, 5). But the A* search will return the path reaching the goal state (1, 2, 5, 4, 3) or (3, 4, 5, 2, 1) (tie breaker will decide which). It is worth point out that if all the cost of the edges are set to be − log𝑎 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡), and if 1 < 𝑎 < 2, both the tree search and the graph search of the A* search will be both complete and optimal. Try a proof if you are interested. Hint: the edge cost in the same tree depth can only take two values 𝐴 and 𝐵, and we have 𝑎−𝐴 + 𝑎−𝐵 = 1. SID: 11 Q3. [10 pts] Vehicle Perception Indication A vehicle is trying to identify the situation of the world around it using a set of sensors located around the vehicle. Each sensor reading is based off of an object’s location (LOC) and an object’s movement (MOVE). The sensor reading will then produce various values for its predicted location, predicted movement, and image rendered, which is then sent back to the user. (a) Thevehicletakesanaction,andweassignsomeutilitytotheactionbasedontheobject’slocationandmovement.Possible actions are MOVE TOWARDS, MOVE AWAY, and STOP. Suppose the decision network faced by the vehicle is the following. (i) [2 pts] Based on the diagram above, which of the following could possibly be true? ■ VPI (Image) = 0 □ VPI (SRead) < 0 □ VPI (SMove, SRead) > VPI (SRead)
■ VPI (Feedback) = 0
􏰁 None of the above
VPI(Image) = 0 because there is not active path connecting Image and U
VPI cannot be negative, so option 2 is not selected.
VPI(SMove, SRead) = VPI(SMove | SRead) + VPI(SRead), therefore we can cancel VPI(SRead) from both side, and it becomes asking if VPI(SMove | SRead) > 0. And we can see that cannot be true, because shading in SRead, there is no active path connecting SMove and U.
There is an active path connecting Feedback and U, therefore VPI(Feedback) ≥ 0. It could still be 0 because active path only gives the possiblity of > 0.
(ii) [2 pts] Based on the diagram above, which of the following must necessarily be true? ■ VPI (Image) = 0
□ VPI (SRead) = 0
■ VPI (SMove, SRead) = VPI (SRead)
□ VPI (Feedback) = 0
􏰁 None of the above
VPI(Image) = 0 because there is not active path connecting Image and U
VPI(SRead) could be >0 because SRead-MOVE-U is an active path between SRead and U
VPI(SMove, SRead) = VPI(SMove | SRead) + VPI(SRead), therefore we can cancel VPI(SRead) from both side, and it becomes asking if VPI(SMove | SRead) == 0. And we can see that must true, because shading in SRead, there is no active path connecting SMove and U.
12

SID:
VPI(Feedback) could be >0 because feedback-SLoc-SRead-MOVE-U is an active path
Let’s assume that your startup has less money, so we use a simpler sensor network. One possible sensor network can be repre- sented as follows.
You have distributions of 𝑃(LOC),𝑃(MOVE),𝑃(𝑆𝑅𝑒𝑎𝑑|LOC,MOVE),𝑃(𝑆𝐿𝑜𝑐|𝑆𝑅𝑒𝑎𝑑) and utility values 𝑈(𝑎,𝑙,𝑚).
(b) Complete the equation for determining the expected utility for some ACTION 𝑎. ()
𝐸𝑈(𝑎) = (i)
(i) [1 pt]
􏰀 ∑𝑙 𝑃(𝑙)
(ii) (iii) 𝑈(𝑎,𝑙,𝑚)
(ii) [1 pt]
􏰀 ∑𝑚 𝑃 (𝑚)
􏰁 ∑𝑚 𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 1
(iii)[1pt] ∑∑ ∑
􏰁 + 𝑙 𝑚 𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙,𝑚)𝑃(𝑙)𝑃(𝑚)
∑∑ ∑
􏰁 ∑𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙)
􏰁 ∑𝑙 ∑𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙)
􏰁 ∑𝑙 ∑𝑚 ∑𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 1 We can eliminate SRead and SLoc via marginalization, so they don’t need to be included the expression
􏰁 ∗∑ 𝑙∑ 𝑚∑ 𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 + 𝑙 𝑚 􏰀 ∗ 1
𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
𝐸𝑈(𝑎) = ∑𝑙 𝑃(𝑙)∑𝑚 𝑃(𝑚)𝑈(𝑎,𝑙,𝑚)
(c) Your colleague Bob invented a new sensor to observe values of 𝑆𝐿𝑜𝑐.
(i) [1 pt] Suppose that your company had no sensors till this point. Which of the following expression is equivalent to VPI(𝑆𝐿𝑜𝑐)? ∑
■ VPI(𝑆𝐿𝑜𝑐) = ( 𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐) MEU(𝑆𝐿𝑜𝑐 = 𝑠𝑙𝑜𝑐)) − max𝑎 EU(𝑎) ■ VPI(𝑆𝐿𝑜𝑐) = MEU(𝑆𝐿𝑜𝑐) − MEU(∅)
□ VPI(𝑆𝐿𝑜𝑐) = max𝑠𝑙𝑜𝑐 MEU(𝑆𝐿𝑂𝐶 = 𝑠𝑙𝑜𝑐) − MEU(∅)
􏰁 None of the above
Option 2 is correct by definition, and option 1 is the expanded version of option 2
(ii) [2 pts] Gaagle, an established company, wants to sell your startup a device that gives you 𝑆𝑅𝑒𝑎𝑑. Given that you already have Bob’s device (that gives you 𝑆𝐿𝑜𝑐), what is the maximum amount of money you should pay for Gaagle’s device? Suppose you value $1 at 1 utility.
□ VPI(𝑆𝑅𝑒𝑎𝑑)
■ VPI(𝑆𝑅𝑒𝑎𝑑) − VPI(𝑆𝐿𝑜𝑐)
□ VPI(𝑆𝑅𝑒𝑎𝑑, 𝑆𝐿𝑜𝑐)
■ VPI(𝑆𝑅𝑒𝑎𝑑, 𝑆𝐿𝑜𝑐) − VPI(𝑆𝐿𝑜𝑐) ■ VPI(𝑆𝑅𝑒𝑎𝑑|𝑆𝐿𝑜𝑐)
13

􏰁 None of the above
Choices 4, 5, are correct by definition
Choice 2 is true because VPI(SLoc |SRead) = 0, and thus
VPI(SRead) = VPI(SRead) + 0 = VPI(SRead) + VPI(SLoc|SRead) = VPI(SRead, SLoc), which makes choice 2 the same as choice 4
14

Q4. [14 pts] Hidden CSP
We have studied lots of algorithms to solve a CSP with a list of known constraints. But can we solve a CSP without explicitly
knowing the constraints?
The wildfire reaches Berkeley in the middle of CS188 lecture, and all 𝑁 ≥ 1 students need to be evacuated on 𝑀 ≥ 1 buses
(each with capacity 𝑁).
Suppose the only constraints are 𝑘 ≥ 0 pairs of students who have to be on two different buses (for example, “Zelda and Yoda” constitute one pair of students who have to be on two different buses). You don’t know these constraints explicitly, but you can observe the yelling from the buses, which is correlated with whether all constraints are satisfied.
In this question, 𝑆 = +𝑠 means that the current assignment satisfies all the constraints, and 𝑆 = −𝑠 means it violates at least 1 constraint. We also denote 𝑌 = +𝑦 as the event where there is yelling from the buses, and 𝑌 = −𝑦 as the event where there is no yelling from the buses.
(a) Your friend Alice suggests starting with a set of random assignments 𝑆0 (where you randomly assign each of 𝑁 students to one of 𝑀 buses).
(i) [1 pt] What is 𝑝0 = 𝑃 (𝑆0 = +𝑠), the probability that the randomly generated assignment satisfies all the constraints, for any choice of 𝑁, 𝑀, 𝑘. (Hint: try writing down the expression for the probability that it satisfies two different
constraints)
􏰁0􏰁1−1 􏰁1−1 􏰁1−1 􏰁1−1 􏰁1
We are presenting solutions from 3 different angles
()𝑘 ()𝑘 ()𝑘 ()𝑘
SID:
( 𝑀2 ) 𝑀 ( 𝑀2 ) 𝑀 􏰀 None of the above
Solution 1: Observe unwarranted independence assumption in Chain Rule.
First of all, 𝑃 (violating 𝑐𝑖 ) = 1 for all constraint 𝑐𝑖 , because that’s the probability of assigning the two students in
𝑀
conflict to the same bus.
1 − ( 1 )𝑘 is wrong because it is the probability of violating all constraints (which is already wrong), and making
𝑀
unwarranted independent assumptions.
The bogus solution (1 − 1 )𝑘 is wrong because it makes unwarranted independent assumptions that not violations
𝑀
of constraints.
We know 𝑃 (not violating 𝑐𝑖) = 1 − 1 But,
𝑀
𝑃 (not violating 𝑐𝑖 , not violating 𝑐𝑗 , not violating 𝑐h )
= 𝑃 (not violating 𝑐h |not violating 𝑐𝑗 , not violating 𝑐𝑖 ) ∗ 𝑃 (not violating 𝑐𝑗 |not violating 𝑐𝑖 ) ∗ 𝑃 (not violating 𝑐𝑖 ) ≠ 𝑃 (not violating 𝑐h ) ∗ .𝑃 (not violating 𝑐𝑗 ) ∗ 𝑃 (not violating 𝑐𝑖 ) in general
Since𝑃(notviolating𝑐𝑖,notviolating𝑐𝑗,notviolating𝑐h)≠(1− 1 )3 ingeneral,italsoisnottrueforany𝑘≥3 𝑀
——————-
Solution 2: Observe how the probability should intuitively behave.
It’s clear that 0 and 1 are wrong, because if k = 0 then the probability is 1, and if M = 1, k >= 1, then the probability is 0. So the probability must depend on M and k. As k -> infinity with M fixed, the probability should go to 0. Indeed, for M = 1, k >=1, it should be 0! Plugging this in, we immediately eliminate the second and third proobabilities. Now, let’s consider the other two probabilities. Suppose that we have N = 3 student, M = 2 buses, and k = 3 constraints. If the constraints are:
Student A not on same bus as student B. Student B not on same bus as student C. Student A not on same bus as student C.
Then the probability is 0. Yet neither of the probabilities are 0 in this case. Therefore, it has to be none of the above. ———————
Solution 3: Find one good counter example:
With M=2, k=3, we see that none of the expressions work, as the probability depends on the consntraints. For example, if the constraints are:
Student A not on same bus as student B. Student B not on same bus as student C. Student A not on same bus as student C.
Then the probability is 0.
15

If the constraints are: Student A not on same bus as student B. Student B not on same bus as student C. Student C not on same bus as student D.
Then the probability is 1/8 > 0. So we can see that there’s no expression for the probability that depends only on M, N, k.
(ii) [1 pt] Since we don’t know the constraints explicitly, our observation of yelling is the only way to infer whether all constraints are satisfied.
Alice, an expert in the human behavior of yelling, constructs the table 𝑃 (𝑌𝑡|𝑆𝑡) below to describe the relation between yelling (𝑌𝑡) and the hidden assignment being satisfying (𝑆𝑡). What is 𝑃 (𝑆0 = +𝑠|𝑌0 = +𝑦)?
􏰁 0.1
􏰁􏰁 0.1(𝑝0)+0.2(1−𝑝0)
􏰁􏰀 0.1(𝑝0) + 0.8(1 − 𝑝0) 0.1(𝑝0 )
0.1(𝑝0 )+0.2(1−𝑝0 ) 0.1(𝑝0 )
0.1(𝑝0 )+0.8(1−𝑝0 )
􏰁 None of the above
Please note that we can use the law of total probability to get 𝑃 (+𝑦| − 𝑠) = 1 − 𝑃 (−𝑦| − 𝑠) = 1 − 0.2 = 0.8 𝑃(+𝑦)=𝑃(+𝑦,+𝑠)+𝑃(+𝑦,−𝑠)=𝑃(+𝑦|+𝑠)𝑃(+𝑠)+𝑃(+𝑦|−𝑠)𝑃(−𝑠)=0.1∗𝑝0 +0.8∗(1−𝑝0)
Using Bayes Rule, 𝑃 (+𝑠| + 𝑦) = 𝑃 (+𝑦,+𝑠) = 0.1(𝑝0 )
𝑃 (+𝑦) 0.1(𝑝0 )+0.8(1−𝑝0 )
Note: “–” in the tables means the value is not given in the problem
(b) YourfriendBobsuggestsiterativelyupdatingtheassignment:ateachtimestep,werandomlyselectastudentandrandomly
assign them to a different bus. Assume for the remainder of the Q4, there are at least 2 buses (𝑀 ≥ 2).
The resulting transition probability table is listed above, where 𝑃(𝑆𝑡+1|𝑆𝑡) is the probability of transitioning from 𝑆𝑡 to 𝑆𝑡+1 after the iterative improvement at the end of time 𝑡.
(i) [2 pts] Which of the following conditions are sufficient for 𝑟++ = 𝑃 (𝑆𝑡+1 = +𝑠|𝑆𝑡 = +𝑠) = 0? Select all that apply. □ The underlying constraint graph is fully connected
□ The underlying constraint graph does not contain a cycle ■ There exists exactly one satisfying assignment
􏰁 None of the above
Reason for 1 and 2 being unselected: When there are more buses than students (𝑀 > 𝑁), you can always move
one student to any one of the vacant 𝑀 − 𝑁 buses and still satisfies all constraints. Since the probability of doing
so is 1 ∗ 𝑀−𝑁 > 0, 𝑟++ is definitely non-zero. 𝑁 𝑀−1
Reason for 3 being selected: If there is only one satisfying assignment, changing any variable to any other value will violate at least one constraint.
(ii) [2 pts] Ben claims that although 𝑟++ and 𝑟−− can be approximated with constant numbers, they actually fluctuate throughout the execution of the program. Is Ben right and why?
􏰁 Ben is right, because 𝑟++ and 𝑟−− are non-constant functions of time-step 𝑡.
􏰁 Ben is right, because enforcing assigning the randomly selected student to a different bus changes the belief of the hidden states.
􏰁 Ben is right, but not for the reasons above.
􏰁 Ben is wrong, because 𝑟++ and 𝑟−− will only change monotonically (either always going up or always going down), and never fluctuate.
􏰀 Ben is wrong, because 𝑟++ and 𝑟−− are always constants 􏰁 Ben is wrong, but not for the reasons above.
Let𝑎𝑖,𝑎𝑗 ∈𝐴𝑡,thesetofallpossibleassignments,whosedomainis{(1,1,1,1,1,….,1),(1,1,1,1,1,….,2), (1, 1, 1, 1, 1, …., 3), …, (𝑀 , 𝑀 , 𝑀 , 𝑀 , 𝑀 , …., 𝑀 − 1), (𝑀 , 𝑀 , 𝑀 , 𝑀 , 𝑀 , …., 𝑀 )} for all time 𝑡
𝑆𝑡
𝑌𝑡
𝑃 (𝑌𝑡|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑦 −𝑦 +𝑦 −𝑦
0.1 – – 0.2
𝑆𝑡
𝑆𝑡+1
𝑃 (𝑆𝑡+1|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑠 −𝑠 +𝑠 −𝑠
𝑟++ –
–
𝑟−−
𝑆0
𝑃(𝑆0)
+𝑠 −𝑠
𝑝0 1−𝑝0
16

(1) 𝑃(𝐴𝑡 = 𝑎𝑖) is a constant for all t, as the uniform distribution is the stationary distribution, because for all 𝑎𝑖, there are 𝑁 (𝑀 − 1) adjacent states with equal probability. The number 𝑁 (𝑀 − 1) come out of the fact that each of the 𝑁 students could be re-assigned to any of the other 𝑀 − 1 buses it is currently on.
(1.analogy) If we model the distribution of 𝑃 (𝐴𝑡 = 𝑎𝑖) as water level in many connected buckets. The analogy here is that, if all buckets start at the same level and flow to each other the same way, there will be no change in the distribution of water across all bucket.
(2) Since 𝑃(𝐴𝑡+1 = 𝑎𝑗|𝐴𝑡 = 𝑎𝑖) is a constant for all 𝑡 (as the swapping procedure is the same over time), it follows that 𝑃(𝐴𝑡+1 = 𝑎𝑗,𝐴𝑡 = 𝑎𝑖) is a constant for all 𝑡.
(3) Since 𝑆 = +𝑠 is just summing over all values of 𝑎𝑖 that satisfy the constraints, and 𝑆 = −𝑠 is just summing over all the values of 𝑎𝑖 that don’t satisfy them, it follows that
(a) 𝑃 (𝑆𝑡 = +𝑠) is a constant and
(b) 𝑃 (𝑆𝑡+1 = +𝑠, 𝑆𝑡 = +𝑠) is a constant.
So the conditional probability 𝑟++ = 𝑃 (𝑆𝑡+1 = +𝑠|𝑆𝑡 = +𝑠) is also a constant.
(iii) [1pt]Doesyourselectionabovechangetooneoftheother5choices,ifhypothetically,weinitiallyassignallstudents to bus 1 instead?
􏰁 Yes, because “𝑟++ =undefined” is true immediately (at time 𝑡 = 0) if we initially assign all students to bus 1, but not true if the initial assignment is randomly generated.
􏰀 Yes, but not for the reason above.
􏰁 No, because the properties of 𝑟++ and 𝑟−− are independent of how we generate the initial assignment.
􏰁 No, but not for the reason above.
Choice 1 is not selected because if 𝑘 = 0, all assignments are satisfying, and thus 𝑟++ = 1, not undefined. Also, 𝑟++ = undefined can be true even if we assign students uniformly, if no satisfying assignment exists.
The real reason is that “initially assign all students to bus 1” create a huge bias in the distribution of assignment-space 𝐴0 (Now (1, 1, 1, 1, 1, 1, …., 1) has probability 1 in the assignment-space while all other assignments including (2, 1, 1, 1, 1, 1, ……, 1) will have probability 0. This drastic difference will be balanced out after we start to re-assign students to different buses, because now (2, 1, 1, 1, 1, 1, ……, 1) has non-zero probability in the assignment state, and thus the probability of its projection into to +𝑠, −𝑠 space will change, and thus 𝑟++ will change over time.
(analogy) Following the analogy in the previous part, if all buckets are connected and one bucket starts full while others start empty. Overtime, all buckets will reach the same level, but during this process, the distribution of waters in the bucket will fluctuate, instead of staying constant. (In this case, (1, 1, 1, 1, 1, 1, …., 1) starts full and all others start empty)
(c) YourfriendCharliesuggestsapolicythat,sincetheabsenceofyelling(-y)isagoodindicatorthatthecurrentassignment satisfy all constraints (+s), we will not alter any variable when we currently observe no yelling (-y).
SID:
𝑆𝑡
𝑌𝑡
𝑆𝑡+1
𝑃 (𝑆𝑡+1|𝑆𝑡, 𝑌𝑡)
+𝑠 +𝑠 +𝑠 +𝑠 −𝑠 −𝑠 −𝑠 −𝑠
+𝑦 +𝑦 −𝑦 −𝑦 +𝑦 +𝑦 −𝑦 −𝑦
+𝑠 −𝑠 +𝑠 −𝑠 +𝑠 −𝑠 +𝑠 −𝑠
𝑟++ (I)
(II) (III) (IV)
𝑟−− (V)
(VI)
𝑆𝑡
𝑌𝑡
𝑃 (𝑌𝑡|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑦 −𝑦 +𝑦 −𝑦
0.1 – – 0.2
𝑆0
𝑃(𝑆0)
+𝑠 −𝑠
𝑝0 1−𝑝0
Note: “–” in the tables means the value is not given in the problem
(i) [1 pt] Which of the quantities in the 𝑃 (𝑆𝑡+1|𝑆𝑡, 𝑌𝑡) table are guaranteed to be 1, per Charlie’s policy. □ (I) ■ (II) □ (III) □ (IV) □ (V) ■ (VI)
􏰁 None of the above
Both (II) and (VI) correspond to −𝑦 and maintain 𝑠 on the same side, which is guaranteed to be 1. Also (I) and (V) are guaranteed to be 0 since we don’t swap values.
(ii) [1 pt] We are following Charlie’s algorithm, and for the first 𝑇 time-steps, all observations are no yelling (-y), and thus we have not altered any variables. Conditioned on 𝑌0 = 𝑌1 = … = 𝑌𝑇 −1 = −𝑦, what’s the probability that the initial assignment indeed satisfy all constraints (+𝑠)?
17

􏰁 𝑝0
􏰁 0.9(𝑝0) 􏰁􏰁 (0.9)𝑇𝑝0
􏰁 (0.9(𝑝0))𝑇 0.9(𝑝0 )
0.9(𝑝0 )+0.2(1−𝑝0 ) (0.9)𝑇 𝑝0
􏰀 (0.9)𝑇 𝑝0+(0.2)𝑇 (1−𝑝0) (0.9(𝑝0 ))𝑇
􏰁 (0.9(𝑝0))𝑇 +(0.2(1−𝑝0))𝑇
􏰁 None of the above
Key observation: since we never changed the assignments, 𝑆0 = 𝑆1 = …𝑆𝑇 = +𝑠 or 𝑆0 = 𝑆1 = …𝑆𝑇 = −𝑠
𝑃(𝑆0 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−1 =−𝑦)
= 𝑃(𝑆𝑇−1 = +𝑠,𝑌0 = 𝑌1 = … = 𝑌𝑇−1 = −𝑦) (because the assignment is never alter)
=𝑃(𝑆𝑇−2 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−2 =−𝑦)∗𝑃(𝑌𝑇−1 =−𝑦|𝑆𝑇−2 =+𝑠)∗𝑃(𝑆𝑇−1 =+𝑠|𝑆𝑇−2 =+𝑠,𝑌𝑇−2 = −𝑦)+𝑃(𝑆𝑇−2 = −𝑠,𝑌0 = 𝑌1 = … = 𝑌𝑇−2 = −𝑦) ∗ 𝑃(𝑌𝑇−1 = −𝑦|𝑆𝑇−2 = −𝑠) ∗ 𝑃(𝑆𝑇−1 = +𝑠|𝑆𝑇−2 = −𝑠, 𝑌𝑇 −2 = −𝑦)
=𝑃(𝑆𝑇−2 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−2 =−𝑦)∗0.9∗1+𝑃(𝑆𝑇−2 =−𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−2 =−𝑦)∗0.2∗0
= 0.9 ∗ 𝑃(𝑆𝑇−2 = +𝑠,𝑌0 = 𝑌1 = … = 𝑌𝑇−2 = −𝑦) (because the second term has a 0 factor)
=(0.9)2 ∗𝑃(𝑆𝑇−3 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−3 =−𝑦)
= …
= (0.9)𝑇−1 ∗ 𝑃(𝑌0 = −𝑦|𝑃(𝑆0 = +𝑠) ∗ 𝑃(𝑆0 = +𝑠)
= (0.9)𝑇−1 ∗ 0.9 ∗ 𝑝0
= (0.9)𝑇 ∗ 𝑝0
Similarly,𝑃(𝑆0 =−𝑠,𝑌0 =𝑌1 =…=𝑌𝑇−1 =−𝑦)=(0.2)𝑇(1−𝑝0)
Therefore, 𝑃(𝑆0 = +𝑠|𝑌0 = 𝑌1 = … = 𝑌𝑇−1 = −𝑦) = 𝑃 (𝑆0 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇 −1 =−𝑦)
𝑃 (𝑆0 =+𝑠,𝑌0 =𝑌1 =…=𝑌𝑇 −1 =−𝑦)+𝑃 (𝑆0 =−𝑠,𝑌0 =𝑌1 =…=𝑌𝑇 −1 =−𝑦) = (0.9)𝑇 𝑝0
(0.9)𝑇 𝑝0+(0.2)𝑇 (1−𝑝0)
18

(iii) [1 pt] In which of the following way does Charlie’s suggestion improve on Bob’s algorithm?
□ When the assignment is likely to violate a lot of constraints, Charlie’s suggestion helps reduce the number of
violated constraints better then Bob’s
■ When the assignment is likely to satisfy all constraints, Charlie’s suggestion helps retain that state better than
Bob’s
􏰁 None of the above
Because 𝑃 (−𝑦| − 𝑠) = 0.2 is not negligible, and we preserve the assignment when observing −𝑦, Charlie’s algorithm actually make it harder to get out of −𝑠 than Bob’s does.
(iv) [3 pts] Charlie’s algorithm is likely to work well to find a satisfying assignment (if one exists) in which of the following scenarios?
■ When the constraint graph is sparse □ When the constraint graph is dense ■ When 𝑟++ is close to 1
□ When 𝑟++ is close to 0
□ When 𝑟−− is close to 1 ■ When 𝑟−− is close to 0 􏰁 None of the above
Since this algorithm is largely random and it will stop swapping when yelling start to not be observed, it heavily relies on 2 aspects:
1. When we have a non-satisfying assignment (−𝑠), the algorithm has high probability of jumping out of it (and thus low probability of returning to another non-satisfying assignment (−𝑠), indicated by low 𝑟−−
2. When we have a satisfying assignment (+𝑠), the algorithm is not going to shortly fall to another non-satisfying assignment (−𝑠) as a result of accidentally observing the noise (𝑃 (+𝑦| + 𝑠) = 0.1 > 0). Therefore, having a high retaining rate for satisfying assignment is also important (indicated by high 𝑟++)
A sparse constrain graph tends to have the two properties above, while dense graph tends to have the opposite of the two properties.
(d) [1 pt] The approach used in this question (especially in Charlie’s algorithm) is the most similar to which of the following concepts?
􏰁 AC-3 Algorithm
􏰀 Local Search
􏰁 Forward Checking 􏰁 Likelihood Weighing
19
SID:

Q5. [15 pts] Snail Bayes
Celebrating the near-end of the semester, the CS188 TAs have gathered around the staff aquarium to check up on the snails and their search for love. To our excitement, two snails decided to go on a date! We don’t know who the snails are, but we spent enough time around the terrarium to know that the first one (𝑆1) is either Alex (a) or Bubbles (b), and the second one (𝑆2) is either Cuddles (c) or Scorpblorg (s). On the date, the snails will eat some dinner (D), which can be a beautiful flower (+d) or a poisonous mushroom (-d), and they will walk (W) around wonderful rocks (+w) or some treacherous puddle (-w). The snails are in the quest for love (L), which, depending on how the date goes, they can find (+l) or not (-l).
(a) [1 pt] What is the probability of an outcome (𝑆1 = 𝑎, 𝑆2 = 𝑐, 𝐷 = −𝑑, 𝑊 = +𝑤, 𝐿 = +𝑙), the probability that Cuddles and Alex are on a date, where they share a poisonous mushroom, walk around the wonderful rocks and find love?
􏰁 0.5∗0.4∗0.7∗0.5∗0.4 􏰁 0.4∗0.6∗0.7∗0.5∗0.4 􏰀0.6∗0.6∗0.7∗0.5∗0.4
􏰁 None of the above
𝑃(𝑎,𝑐,−𝑑,+𝑤,+𝑙) = 𝑃(+𝑙 ∣ 𝑎,𝑐) ∗ 𝑃(𝑎 ∣ −𝑑,+𝑤) ∗ 𝑃(𝑐 ∣ −𝑑,+𝑤) ∗ 𝑃(−𝑑) ∗ 𝑃(+𝑤) = 0.6 ∗ 0.6 ∗ 0.7 ∗ 0.5 ∗ 0.4 =
0.0504
(b) [2pts]WhichofthefollowingindependencestatementsareguaranteedtobetruebytheSnailBayes’Netgraphstructure? □ 𝑆1 ⫫ 𝑆2 ∣ 𝐿
■𝐷⫫𝑊 □𝐷⫫𝑊 ∣𝐿
■ 𝑆1 ⫫ 𝑆2 ∣ 𝐷, 𝑊 ■𝐿⫫𝑊 ∣𝑆1,𝑆2 □𝐷⫫𝑊 ∣𝑆1,𝑆2
20

􏰁 None of the above
We can use D-Separation Algorithm to check for independence guarantees
The date is about to start and people are making guesses for what’s going to happen. One TA notices how adorable it would be if the snails were Bubbles and Cuddles.
(c) IfBubblesandCuddlesareonthedate,wewanttocomputetheprobabilityofthemeatingabeautifulflowerandwalking around the wonderful rocks.
(i) [1 pt] What is the equivalent expression for this probability?
􏰁 𝑃(𝑏,𝑐,+𝑑,+𝑤) 􏰁 𝑃(𝑏,𝑐 ∣ +𝑑,+𝑤) 􏰀 𝑃(+𝑑,+𝑤 ∣ 𝑏,𝑐) 􏰁 𝑃(+𝑑,+𝑤)
(ii) [1 pt] What minimal set of probability tables will we use in calculating this probability?
■𝑃(𝐷) ■𝑃(𝑊) ■𝑃(𝑆1 ∣𝐷,𝑊) ■𝑃(𝑆2 ∣𝐷,𝑊) □𝑃(𝐿∣𝑆1,𝑆2) 􏰁 None of the above
We need all the conditional probability table containing 𝐷, 𝑊 , 𝑆1, 𝑆2 on the unconditioned side, which are the first four. The last table is left out because 𝐿 is not part of the query and none of 𝐷, 𝑊 , 𝑆1, 𝑆2 is conditioned on 𝐿
The snails are starving, so the date begins with a delightful dinner. The snails start sharing a mushroom as their dish of choice.
(d) Given their choice of dinner, what is 𝑃 (𝑆1 ∣ −𝑑), the belief over which snail is S1? Please answer in decimal numbers. You should not need a calculator.
(i) [1pt]𝑃(𝑆1 =𝑎∣𝐷=−𝑑)=
𝑃 (𝑆1 = 𝑎 ∣ −𝑑) = ∑𝑤 𝑃 (𝑎 ∣ −𝑑, 𝑤) ∗ 𝑃 (𝑤) = 0.6 ∗ 0.4 + 0.2 ∗ 0.6 = 0.36.
A late TA rushes in and is thrilled to see us around the aquarium. You see, he spent many hours befriending the little ones and immediately recognizes the second snail as Cuddles! Apparently, Cuddles is the ooziest and is easiest to identify.
(e) [1 pt] What is 𝑃 (𝑆1 = 𝑏 ∣ 𝑆2 = 𝑐, 𝐷 = −𝑑), the probability that the first snail is Bubbles given that the second is Cuddles and they ate a mushroom?
􏰁 ∑𝑤𝑃(𝑏∣−𝑑,𝑤)∗𝑃(𝑤)∗𝑃(𝑐∣−𝑑,𝑤)∗𝑃(𝑤) 𝑃(𝑐∣−𝑑)
∑𝑤 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (−𝑑)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤) 𝑃(𝑐)
􏰀 ∑𝑤 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤) 𝑃(𝑐∣−𝑑)
􏰁 None of the above
𝑃 (𝑏,𝑐∣−𝑑) ∑𝑤 𝑃 (𝑏,𝑐∣−𝑑,𝑤)∗𝑃 (𝑤)
0.36
SID:
∑ 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (𝑤)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤)
􏰁􏰁 𝑤 ∑
𝑤 𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (−𝑑)
𝑃(𝑏 ∣ 𝑐,−𝑑) = 𝑃(𝑐∣−𝑑) = 𝑃(𝑐∣−𝑑) =
where we used 𝑆1 ⫫ 𝑆2 ∣ 𝐷, 𝑊 from part b) for the numerator.
= 0.6526
∑𝑤 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤) 𝑃(𝑐∣−𝑑)
0.4∗0.7∗0.4+0.8∗0.8∗0.6 = 0.76
21

(f) [2 pts] What is 𝑃 (𝐿 = +𝑙 ∣ 𝑆2 = 𝑐, 𝐷 = −𝑑), the probability that the snails will find love given that the second snail is Cuddles and they ate a mushroom?
□ ∑𝑠1 𝑃(+𝑙∣𝑐,𝑠1)∗𝑃(−𝑑∣𝑐)∗𝑃(𝑠1∣−𝑑,𝑐) 𝑃(𝑐∣−𝑑)
□ ∑𝑠1 𝑃(+𝑙∣𝑐,𝑠1)∗𝑃(𝑐∣−𝑑)∗𝑃(𝑠1∣−𝑑,𝑐) 𝑃(−𝑑∣𝑐)
■ ∑𝑠1 𝑃(+𝑙 ∣ 𝑐,𝑠1) ∗ 𝑃(𝑠1 ∣ −𝑑,𝑐) ■ ∑𝑠1 𝑃(+𝑙,−𝑑∣𝑐,𝑠1)∗𝑃(𝑠1∣𝑐)
𝑃(−𝑑∣𝑐)
􏰁 Noneoftheabove ∑
∑
= 𝑠1 where we used 𝐿 ⫫ 𝑊 ∣ 𝑆1, 𝑆2 and the answer from part e).
∑
= 𝑠1
𝑃(+𝑙∣𝑐,−𝑑)=𝑃(+𝑙,−𝑑∣𝑐)= 𝑠1 𝑃 (−𝑑∣𝑐)
𝑃(+𝑙,−𝑑∣𝑐,𝑠 )∗𝑃(𝑠 ∣𝑐) 1 1
𝑃(+𝑙∣𝑐,𝑠 )∗𝑃(−𝑑∣𝑐,𝑠 )∗𝑃(𝑠 ∣𝑐) 1 1 1
𝑃 (−𝑑∣𝑐)
𝑃(+𝑙∣𝑐,𝑠 )∗𝑃(−𝑑∣𝑐)∗𝑃(𝑠 ∣−𝑑,𝑐)
1 1 =
𝑃 (−𝑑∣𝑐)
𝑃 (−𝑑∣𝑐) 0.6 ∗ 0.3474 + 0.5 ∗ 0.6526 = 0.53474
The snails found love! We are now trying to find the probability that the other snail was Bubbles given all evidence so far, 𝑃(𝑏 ∣ 𝑐,−𝑑,+𝑙). The TAs are tired of multiplying probabilities, so they instead try another way. The late TA actually wrote down memories of previous dates he has witnessed in a notebook. He can sample some of his memories from the notebook and help us learn probabilities.
(g) [1 pt] If the TA uses prior sampling, what is the probability of obtaining the sample [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = −𝑙]?
􏰁 0.5*0.4*0.6*0.3*0.2
􏰁 0.6*0.1*0.7*0.1*0.2
􏰁 0.25*0.24*0.9*0.1*0.5
􏰁 0.4*0.4*0.9*0.2*0.8
􏰀 0.5*0.4*0.4*0.7*0.5
􏰁 0.4*0.5*0.24*0.21*0.25
􏰁 None of the above
Prior sampling samples without taking the evidence into account, so the probability of the sample is 𝑃 (−𝑑)𝑃 (+𝑤)𝑃 (𝑏 ∣
−𝑑,+𝑤)𝑃(𝑐 ∣ −𝑑,+𝑤)𝑃(−𝑙 ∣ 𝑏,𝑐).
(h) [1 pt] If the TA samples [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = −𝑙], would rejection sampling discard the memory?
􏰀Yes 􏰁No
Rejection sampling discards samples inconsistent with the evidence. In this case, −𝑙 is inconsistent with the fact that our
snails did in fact find love.
(i) [1 pt] Assuming that the TA actually sampled using likelihood weighing and obtained [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = +𝑙], what is the weight of this sample?
􏰁 0.5*0.5*0.5
􏰁 0.4*0.9*0.6
􏰁 0.4*0.24*0.6
􏰀 0.5*0.7*0.5 􏰁 0.5*0.3*0.5 􏰁 0.6*0.3*0.6
􏰁 None of the above
The weight of a sample in likelihood weighing is the probability of the evidence given their parents: 𝑃(−𝑑)𝑃(𝑐 ∣
−𝑑, +𝑤)𝑃 (+𝑙 ∣ 𝑏, 𝑐).
(j) [1 pt] Sampling using likelihood weighting will systematically underestimate the probability of a variable conditioned on
22

one of its ancestors.
􏰁 Yes, because likelihood weighting does not sample all the variables, and thus creates a bias 􏰁 Yes, but not for the reason above
􏰀 No, because likelihood weighting is unbiased
􏰁 No, but not for the reason above
Likelihood weighting is an unbiased sampling procedure.
(k) [2 pts] To estimate 𝑃 (𝑏 ∣ 𝑐, −𝑑, +𝑙), the TA samples five memories in a row: [𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑎,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑎,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙].
Could these memories have been generated using Gibbs sampling?
􏰁 Yes, because all evidence variables are consistent with their values in the query 𝑃 (𝑏 ∣ 𝑐, −𝑑, +𝑙).
􏰀 Yes, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous
choice but cannot be generated by Gibbs sampling.
􏰁 No, because the sequence of samples only differs by 𝑆1, the query variable. The values of 𝑊 , a variable that is not part of the query, never changes throughout the sequence.
􏰁 No, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous choice but cannot be generated by Gibbs sampling.
Since each neighboring sample differs by at most one variable value AND the evidence variables are not change, this sequence could be generated via Gibbs sampling, and thus eliminate choice 3 and 4. [0->1: no change, 1->2: no change, 2->3: only changes 𝑆1, 3->4: no change, 4->5: no change]
But choice one “all evidence variables are consistent with their values in the query 𝑃 (𝑏 ∣ 𝑐, −𝑑, +𝑙)” is insufficient, because the following sequence fits the condition, yet cannot be generated by Gibbs Sampling:
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙], [𝐷=−𝑑,𝑊 =−𝑤,𝑆1 =𝑎,𝑆2 =𝑐,𝐿=+𝑙], [𝐷=−𝑑,𝑊 =−𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙].
Because 0->1: changes both 𝑊 and 𝑆1
Just a quick note about choice 3: it is totally possible that when it is 𝑊 ’s turn to be re-sampled from 𝑃 (𝑊 |everything else),
the result just so happens to remain +𝑤 all times, throughout the process of the 5 consecutive samples.
23
SID:

Q6. [11 pts] We Are Getting Close…
The CS 188 TAs have built an autonomous vehicle, and it’s finally on the street! Approaching a crossroad, our vehicle must avoid bumping into pedestrians. However, how close are we?
X is the signal received from sensors on our vehicle. We have a estimation model E, which estimates the current distance of any object in our view. Our vehicle also needs a model to detect objects and label their classes as one of {pedestrian, stop sign, road, other}. The TAs trained a detection model D that does the above and with a simple classification, outputs one of {no pedestrian, pedestrian on the road, pedestrian beside the stop sign}. Our vehicle has a control operator C, which determines the velocity by changing the acceleration.
(a) [5 pts] For the above Dynamic Bayes Net, complete the equations for performing updates. (Hint: think about the predic- tion update and observation update equations in the forward algorithm for HMMs.)
Time elapse:
(i) 􏰁
(ii) 􏰁
(iv) 􏰁 𝑃 (𝑥𝑡−1 |𝑥𝑡−2 ) 􏰁 𝑃 (𝑥𝑡|𝑥𝑡−1)
􏰀 𝑃 (𝑥 |𝑒 , 𝑑 , 𝑐 ) 􏰁 𝑡 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
𝑃 (𝑒 , 𝑑 , 𝑐 |𝑒 , 𝑑 , 𝑐 ) 𝑡 𝑡 𝑡 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
(i) = (ii) (iii) (iv) 𝑃(𝑥 |𝑒 ,𝑑 ,𝑐 ) 𝑡−1 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
𝑃(𝑥) 𝑡
𝑃 (𝑐0∶𝑡−1 )
𝑃 (𝑒0∶𝑡 , 𝑑0∶𝑡 , 𝑐0∶𝑡 )
􏰁 𝑃 (𝑥0∶𝑡−1, 𝑐0∶𝑡−1) 􏰀1
􏰁 max𝑥𝑡−1 􏰁 max𝑥𝑡 􏰁 𝑃(𝑥𝑡−1,𝑥𝑡−2)
􏰁 𝑃 (𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 )
􏰁
(iii) 􏰀Σ𝑥𝑡−1 􏰁Σ𝑥𝑡
􏰁 1
􏰁 𝑃 (𝑥𝑡 |𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 )
􏰀𝑃(𝑥𝑡|𝑥𝑡−1,𝑐𝑡−1)
Recall the prediction update of forward algorithm: 𝑃 (𝑥 |𝑜
􏰁 1
𝑃 (𝑥 |𝑥
Also note that 𝑋 is independent of 𝐷𝑡−1, 𝐸𝑡−1 given 𝐶𝑡−1, 𝑋𝑡−1. Update to incorporate new evidence at time 𝑡:
𝑃 (𝑥 |𝑒 ,𝑑 ,𝑐 ) = (v) 𝑡 0∶𝑡 0∶𝑡 0∶𝑡
(v) 􏰁 (𝑃 (𝑐 |𝑐 ))−1 ( ( 𝑡 0∶𝑡−1
(vi)
(vii)
􏰀 𝑃 𝑒𝑡, 𝑑𝑡, 𝑐𝑡|𝑒0∶𝑡−1, 𝑑0∶𝑡−1, 𝑐0∶𝑡−1
􏰁 𝑃 𝑒0∶𝑡−1 |𝑒𝑡 􏰁1
􏰁 max𝑥𝑡−1
𝑃 𝑑0∶𝑡−1 |𝑑𝑡 􏰁 max𝑥𝑡
𝑃
𝑐0∶𝑡−1 |𝑐𝑡 􏰀 1
􏰁 (𝑃(𝑒 ,𝑑 ,𝑐
0∶𝑡−1 0∶𝑡−1 0∶𝑡−1 𝑡 𝑡 𝑡
(vi) 􏰁 Σ𝑥𝑡−1 􏰁 Σ𝑥𝑡
􏰁 Σ𝑥𝑡−1,𝑥𝑡 ■ 𝑃 (𝑒𝑡, 𝑑𝑡|𝑥𝑡)𝑃 (𝑐𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡−1)
□ 𝑃 (𝑥𝑡, 𝑒𝑡, 𝑑𝑡, 𝑐𝑡)
□ 𝑃 (𝑥𝑡, 𝑒𝑡, 𝑑𝑡, 𝑐𝑡, 𝑐𝑡−1) □𝑃(𝑒𝑡,𝑑𝑡,𝑐𝑡|𝑥𝑡)
(vii) □ 𝑃 (𝑥𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡)
□ 𝑃 (𝑥𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡, 𝑐𝑡−1)
􏰁 𝑃(𝑥𝑡,𝑥𝑡−1)
􏰁 𝑃(𝑥𝑡,𝑥𝑡−1,𝑐𝑡−1)
􏰁 𝑃 (𝑥𝑡 , 𝑒0∶𝑡−1 ,
𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 )
), where o is the obser- vation. Here it is similar, despite that there are several observations at each time, which means 𝑜𝑡 corresponds to 𝑒𝑡, 𝑑𝑡, 𝑐𝑡 for each t, and that X is dependent on the C value of the previous time, so we need 𝑃 (𝑥𝑡|𝑥𝑡−1, 𝑐𝑡−1) instead of 𝑃 (𝑥𝑡|𝑥𝑡−1).
􏰁 1
Recall the observation update of forward algorithm: 𝑃 (𝑥𝑡|𝑜0∶𝑡) ∝ 𝑃 (𝑥𝑡, 𝑜𝑡|𝑜0∶𝑡−1) = 𝑃 (𝑜𝑡|𝑥𝑡)𝑃 (𝑥𝑡|𝑜0∶𝑡−1).
Here the observations 𝑜 corresponds to 𝑒 , 𝑑 , 𝑐 for each t. Apply the Chain Rule, we are having ()𝑡(𝑡𝑡𝑡)
𝑃 𝑥𝑡|𝑒0∶𝑡, 𝑑0∶𝑡, 𝑐0∶𝑡 ∝ 𝑃 𝑥𝑡, 𝑒𝑡, 𝑑𝑡, 𝑐𝑡|𝑒0∶𝑡−1, 𝑑0∶𝑡−1, 𝑐0∶𝑡−1 = 𝑃 (𝑒𝑡, 𝑑𝑡, 𝑐𝑡|𝑥𝑡, 𝑐𝑡−1)𝑃 (𝑥𝑡|𝑒0∶𝑡−1, 𝑑0∶𝑡−1, 𝑐0∶𝑡−1) 24
𝑡 0∶𝑡−1
) = Σ
𝑥𝑡−1 𝑡 𝑡−1
)𝑃 (𝑥 Your choice for (i)
|𝑜
𝑡−1 0∶𝑡−1
))−1 |𝑒,𝑑,𝑐))−1
􏰁 (𝑃(𝑒|𝑒 )𝑃(𝑑|𝑑 )𝑃(𝑐|𝑐 ))−1 ( ( 𝑡 0∶𝑡−1) ( 𝑡 0∶𝑡−1) ( 𝑡 0∶𝑡−1))−1

= 𝑃 (𝑒𝑡, 𝑑𝑡|𝑥𝑡)𝑃 (𝑐𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡−1)𝑃 (𝑥𝑡|𝑒0∶𝑡−1, 𝑑0∶𝑡−1, 𝑐0∶𝑡−1).
Note that in 𝑃 (𝑒𝑡, 𝑑𝑡, 𝑐𝑡|𝑥𝑡, 𝑐𝑡−1), we cannot omit 𝑐𝑡−1 due to the arrow between 𝑐𝑡 and 𝑐𝑡−1.
To calculate the normalizing constant, use Bayes Rule: 𝑃 (𝑥 |𝑒 , 𝑑 , 𝑐 ) = 𝑃 (𝑥𝑡 ,𝑒𝑡 ,𝑑𝑡 ,𝑐𝑡 |𝑒0∶𝑡−1 ,𝑑0∶𝑡−1 ,𝑐0∶𝑡−1 ) .
(viii) Suppose we want to do the above updates in one step and use normalization to reduce computation. Select all the terms that are not explicitly calculated in this implementation.
DO NOT include the choices if their values are 1.
□(ii) □(iii) □(iv) ■(v) □(vi) □(vii) 􏰁 Noneoftheabove
(v) is a constant, so we don’t calculate it during implementation and simply do a normalization instead. Everything else is necessary.
(b) Suppose X outputs 1024 × 1024 greyscale images and our vehicle stays stationary. As before, E includes precise estima- tion of the distance between our vehicle and the pedestrian evaluated from outputs of X. Unfortunately, a power outage happened, and before the power is restored, E will not be available for our vehicle. But we still have the detection model D, which outputs one of {no pedestrian, pedestrian on the road, pedestrian beside the stop sign} for each state.
(i) [1 pt] During the power outage, it is best to
􏰁 do particle filtering because the particles are easier to track for D than for both D and E
􏰀 do particle filtering because of memory constraints 􏰁 do particle filtering, but not for the reasons above 􏰁 do exact inference because it saves computation 􏰁 do exact inference, but not for the reason above
E is unavailable and C does not change its value since our vehicle stays stationary, so we only considers X and D. Although D has only 3 possible values, X is huge and it is impossible to store the belief distribution.
(ii) [1pt]Thepoweroutagewaslongerthanexpected.AsthesensoroutputsofXhavedegradedto2×2binaryimages, it is best to
􏰁 do particle filtering because the particles are easier to track for D than for both D and E 􏰁 do particle filtering because of memory constraints
􏰁 do particle filtering, but not for the reasons above
􏰁 do exact inference because it saves computation
􏰀 do exact inference, but not for the reason above
In this case we do not have the “X is huge” problem in (i), and we can do exact inference, which is always more accurate than particle filtering and thus more favorable in this setting.
(iii) [1 pt] After power is restored and we have E, it is reasonable to 􏰀 do particle filtering because of memory constraints
􏰁 do particle filtering, but not for the reason above
􏰁 do exact inference because E gives more valuable information than D 􏰁 do exact inference because it’s impractical to do particle filtering for E 􏰁 do exact inference, but not for the reasons above
The belief distribution is too big to store in memory.
𝑡 0∶𝑡 0∶𝑡 0∶𝑡
𝑃 (𝑒𝑡 ,𝑑𝑡 ,𝑐𝑡 |𝑒0∶𝑡−1 ,𝑑0∶𝑡−1 ,𝑐0∶𝑡−1 )
SID:
25

(c) NowweformulatetheDynamicBayesNetforthisquestionintoanon-deterministictwo-playergame(analogoustoMDP in single-player setting). Each state 𝑆 = (𝑋, 𝐸, 𝐷).
There are 2 agents in the game: our vehicle (with action set A), and a pedestrian (with action set B). The vehicle and the pedestrian take turns to perform their actions.
The TAs implemented 3 modes for the autonomous vehicle to act in the same space with the kind pedestrian, the confused pedestrian, and the naughty pedestrian. In each round of testing, a TA will be the pedestrian, and one of the modes will be tested. The vehicle and the pedestrian are both in the corresponding mode.
Below, 𝑄∗𝑣 is the Q-function for the autonomous vehicle. For each subquestion, given the standard notation for an MDP, select the expression 𝑓𝑛 that would complete the blank part of the Q-Value Iteration under the specified formulation.
𝑄∗𝑣(𝑠,𝑎)=∑𝑠′ 𝑇(𝑠,𝑎,𝑠′)[𝑅(𝑠,𝑎,𝑠′)+𝛾 ]
𝑓 =∑ ∑′′(𝑇(𝑠′,𝑏,𝑠′′)[𝑅(𝑠′,𝑏,𝑠′′)+𝛾max′ 𝑄∗(𝑠′′,𝑎′)])
1 𝑏∈𝐵𝑠 𝑎∈𝐴𝑣
𝑓 =max ∑′′(𝑇(𝑠′,𝑏,𝑠′′)[𝑅(𝑠′,𝑏,𝑠′′)+𝛾max′ 𝑄∗(𝑠′′,𝑎′)]) 2 𝑏∈𝐵∑𝑠(( )[( ) 𝑎∈𝐴𝑣( )])
𝑓3 =min𝑏∈𝐵 𝑠′′ 𝑇 𝑠′,𝑏,𝑠′′ 𝑅 𝑠′,𝑏,𝑠′′ +𝛾max𝑎′∈𝐴𝑄∗𝑣 𝑠′′,𝑎′ ∑∑((′ ′′)[(′ ′′) 1 ∗(′′′)])
𝑓4= 𝑏∈𝐵 𝑠′′ 𝑇𝑠,𝑏,𝑠 𝑅𝑠,𝑏,𝑠 +𝛾|𝐵|max𝑎′∈𝐴𝑄𝑣𝑠,𝑎 𝑓 =max′ 𝑄∗(𝑠′,𝑎′)
5 𝑎∈𝐴𝑣() 𝑓6=min𝑎′∈𝐴𝑄∗𝑣 𝑠′,𝑎′
𝑓 = 1 ∑′ 𝑄∗(𝑠′,𝑎′) 7 |𝐴| 𝑎∈𝐴 𝑣
(i) [1 pt] The kind pedestrian that acts friendly, maximizing the vehicle’s utility.
□𝑓1 ■𝑓2 □𝑓3 □𝑓4 □𝑓5 □𝑓6 □𝑓7 􏰁 Noneoftheabove
(ii) [1 pt] The confused pedestrian that acts randomly.
□𝑓1 □𝑓2 □𝑓3 □𝑓4 ■𝑓5 □𝑓6 □𝑓7 􏰁 Noneoftheabove
(iii) [1 pt] The naughty pedestrian that performs adversarial actions, minimizing the vehicle’s utility.
□𝑓 □𝑓 ■𝑓 □𝑓 □𝑓 □𝑓 □𝑓 􏰁 Noneoftheabove 1234567
Recall the q-value iteration formula: 𝑄𝑘+1(𝑠, 𝑎) ← 𝑠′ 𝑇 𝑠, 𝑎, 𝑠′ 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾 max𝑎′ 𝑄𝑘 𝑠′, 𝑎′ .
Here it is similar, but in addition to the vehicle, there’s also a pedestrian taking actions from a different set, so we need to do something similar for the pedestrian, as in 𝑓1, 𝑓2, 𝑓3, 𝑓4. That is, instead of using the maximum 𝑄𝑣 right away, we substitute that with the q-value iteration formula for the pedestrian with respect to 𝑄𝑣.
The value of this formula is maximized (as in 𝑓2) in the case of the kind pedestrian,
minimized (as in 𝑓3) in the case of the naughty pedestrian,
and averaged in the case of the naughty pedestrian, which would be
1 ∑ ∑ ′′ (𝑇 (𝑠′, 𝑏, 𝑠′′) [𝑅 (𝑠′, 𝑏, 𝑠′′) + 𝛾 max ′ 𝑄∗ (𝑠′′, 𝑎′)]). |𝐵| 𝑏∈𝐵 𝑠 𝑎 ∈𝐴 𝑣
∑()[() ()]
Hence 𝑓1 and 𝑓4 are incorrect. Since the pedestrian is acting completely randomly, we can include the pedestrian in the transition dynamics and just use regular q-value iteration, which is 𝑓5.
26

(ii) [1 pt] Which of these decompositions reflect the naive assumption in Naive Bayes? 􏰁 𝑃(𝐶) = 𝑃(𝐶1) ⋅ 𝑃(𝐶2)
􏰀 𝑃(𝑋|𝐶) = 𝑃(𝑋1|𝐶) ⋅ 𝑃(𝑋2|𝐶) ⋅ 𝑃(𝑋3|𝐶) ⋅ 𝑃(𝑋4|𝐶) ⋅ …
􏰁 𝑃(𝑋)=𝑃(𝑋1)⋅𝑃(𝑋2|𝑋1)⋅𝑃(𝑋3|𝑋2,𝑋1)…
􏰁 𝑃(𝑋1)=(𝑃(𝑋1)+1)∕(∑𝑥𝑖 𝑃(𝑋𝑖))
􏰁 None of the above
The naive assumption is that each feature is independent of all other features given the class label. That is, 𝑋𝑖s are conditionally independent given C, which is reflected in the second choice.
(iii) [1 pt] Which of these are modelled directly in Logistic Regression? Select all that apply. ■ 𝑃(𝐶|𝑋)
□ 𝑃(𝑋|𝐶)
□ 𝑃(𝐶)
□ 𝑃(𝑋)
􏰁 None of the above
(b) In this part, we will consider different scenarios about the data we have.
(i) [1 pt] Assume we only have two points in our training dataset which are linearly separable using the feature 𝑋1. Which of the following methods will be able to achieve zero training error? Select all that apply.
■ Naive Bayes
■ Bayes classifier (i.e. naive bayes with no naive assumption) ■ Logistic Regression
■ Perceptron
■ A very large neural network with many nonlinearities
􏰁 None of the above
(ii) [1pt]Assumewenowhaveaverylargetrainingdatasetwhichisnotlinearlyseparableusinganyindividualvariable, but is separable given 𝑋1 ⋅ 𝑋2. Which of the following methods will be able to achieve zero training error (without augmenting the data set with additional features)? Select all that apply.
□ Naive Bayes
■ Bayes classifier (i.e. naive bayes with no naive assumption) □ Logistic Regression
□ Perceptron (with no softmax on the output)
■ A very large neural network with many nonlinearities
􏰁 None of the above
Bayes classifier is able to model 𝑃 (𝑋1, 𝑋2), which lets us classify correctly. A neural network is able to model the interaction between 𝑋1 and 𝑋2
(iii) [1 pt] Now assume that the true dataset is linearly separable but our training set has a single mis-labeled data point. Which of the following are true? Select all that apply.
□ Logistic regression may output probabilities greater than 1 27
SID:
Q7. [10 pts] Generating and classifying text
In this question, we are interested in modelling sentences. Assume each sentence is represented using a bag-of-words (i.e. a vector which contains counts for each word in a sentence). We are interested in classifying whether a sentence (represented as a bag of words 𝑋) is a positive-sounding (class 𝐶 = 1) or negative-sounding (𝐶 = −1) sentence. 𝑋1, 𝑋2, …𝑋𝑝 represent the entries for individual words in the bag-of-words representation 𝑋.
(a) In this question, we are interested in the basics of Naive Bayes and logistic regression. (i) [1 pt] Which of these are modelled explicitly in Naive Bayes? Select all that apply.
□ 𝑃(𝐶|𝑋)
■ 𝑃(𝑋|𝐶)
■ 𝑃(𝐶)
􏰁 None of the above

■ Perceptron (with no softmax on the output) may not converge
􏰁 None of the above
Perceptron updates may continue to oscillate due to the misclassified outlier.
(iv) [1pt]Assumeweinitiallyhaveamodeltrainedonaverylargedatasetwiththesamenumberofpositiveandnegative examples. Then, we duplicate all the positive examples in our training set and re-add them (resulting in a training set 1.5 times the size of the original training set). We then train a second model on the new training set. Which of the following is true? Select all that apply.
■ In logistic regression, 𝑃 (𝐶 = 1|𝑋) will generally be higher for the retrained model than the original model □ In naive bayes, 𝑃 (𝑋 = 𝑥|𝐶 = 1) will be higher for some 𝑥 for the retrained model than the original model ■ In naive bayes, 𝑃 (𝐶 = 1) will generally be higher for the retrained model than the original model
□ In naive bayes, 𝑃 (𝑋 = 1) will generally to be higher for the retrained model than the original model
􏰁 None of the above
Duplicating all the positive samples does not change the distribution of X given that the example is positive.
(c) We are now interested in generating text (still in the form of bag-of-words) from each of our classes.
(i) [1 pt] If we have already fit naive bayes for text classification, which distribution can we use to generate text for the positive class?
􏰁 𝑃(𝐶)
􏰁 𝑃(𝑋)
􏰀 𝑃(𝑋|𝐶)
􏰁 𝑃(𝐶|𝑋)
􏰁 None of the above
(ii) [1pt]Assumingwehaveaninfiniteamountofdata,whichofthefollowingmodelingassumptionisgenerallyableto more accurately model the true distribution of 𝑃 (𝑋) (and thus to generate more realistic bag-of-words sentences)?
􏰁 𝑃(𝑋)=𝑃(𝑋1)⋅𝑃(𝑋2)⋅𝑃(𝑋3)⋅𝑃(𝑋4)⋅… 􏰀 𝑃 (𝑋) = 𝑃 (𝑋1) ⋅ 𝑃 (𝑋2|𝑋1) ⋅ 𝑃 (𝑋3|𝑋2, 𝑋1) ⋅ … 􏰁 They are the same
(iii) [1 pt] Which of the following will best help us generate text with words in the correct order? 􏰁 Using Laplace smoothing
􏰁 Predicting 𝑃(𝑋|𝐶) using a neural network
􏰀 Changing the representation of the text (to use something other than bag-of-words)
􏰁 None of the above
Regardless of the model, bag of words cannot represent the order of words.
28

Q8. [16 pts] Deep “Blackjack”
To celebrate the end of the semester, you visit Las Vegas and decide to play a good, old fashioned game of “Blackjack”!
Recall that the game has states 0,1,…,8, corresponding to dollar amounts, and a 𝐷𝑜𝑛𝑒 state where the game ends. The player starts with $2, i.e. at state 2. The player has two actions: Stop (𝑎 = 0) and Roll (𝑎 = 1), and is forced to take the Stop action at states 0,1,and 8.
When the player takes the Stop action (𝑎 = 0), they transition to the 𝐷𝑜𝑛𝑒 state and receive reward equal to the amount of dollars of the state they transitioned from: e.g. taking the stop action at state 3 gives the player $3. The game ends when the player transitions to 𝐷𝑜𝑛𝑒.
The Roll action (𝑎 = 1) is available from states 2-7. The player rolls a biased 6-sided die. If the player Rolls from state s and the die lands on outcome 𝑜, the player transitions to state 𝑠 + 𝑜 − 2, as long as 𝑠 + 𝑜 − 2 ≤ 8 (𝑠 is the amount of dollars of the current state, 𝑜 is the amount rolled, and the negative 2 is the price to roll). If 𝑠 + 𝑜 − 2 > 8, the player busts, i.e. transitions to Done and does NOT receive reward.
As the bias of the dice is unknown, you decided to perform some good-old fashioned reinforcement learning (RL) to solve the game. However, unlike in the midterm, you have decided to flex and solve the game using approximate Q-learning. Not only that, you decided not to design any features – the features for the Q-value at (𝑠, 𝑎) will simply be the vector [𝑠 𝑎], where 𝑠 is the state and 𝑎 is the action.
(a) First, we will investigate how your choice of features impacts whether or not you can learn the optimal policy. Suppose the unique optimal policy in the MDP is the following:
For each of the cases below, select “Possible with large neural net” if the policy can be expressed by using a large neural net to represent the Q-function using the features specified as input. (That is, the greedy policy with respect to some Q-function representable with a large neural network is the optimal policy: 𝑄(𝑠,𝜋∗(𝑠)) > 𝑄(𝑠,𝑎) for all states 𝑠 and actions 𝑎 ≠ 𝜋∗(𝑠).) Select “Possible with weighted sum” if the policy can be expressed by using a weighted linear sum to represent the Q-function. Select “Not Possible” if expressing the policy with given features is impossible no matter the function.
(i) [1 pt] Suppose we decide to use the state 𝑠 and action 𝑎 as the features for 𝑄(𝑠, 𝑎).
■ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible
A sufficiently large neural network could represent the true optimal 𝑄-function using this feature representation. The optimal 𝑄-function satisfies the desired property (there are no ties as the optimal policy is unique). Alternatively, a sufficiently large neural network could represent a function that is 1 for the optimal action in each state, and 0 otherwise, which also suffices.
No linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be 𝑤0𝑠 + 𝑤1𝑎 + 𝑤3. We need 𝑄(4, 1) > 𝑄(4, 0) and 𝑄(5, 0) > 𝑄(5, 1). Plugging in the expression for Q, the former inequality requires that 𝑤1 > 0. The second inequality requires that 𝑤1 < 0, a contradiction. So we cannot represent the policy with a weighted sum of features. (ii) [1 pt] Now suppose we decide to use 𝑠 + 𝑎 as the feature for 𝑄(𝑠, 𝑎). ■ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible Indeed, it’s possible that no neural network can represent the optimal 𝑄-function with this feature representation, as 𝑄(4, 1) does not have to equal 𝑄(5, 0). However, the question is not asking about representing the optimal 𝑄-function, but instead the optimal policy, which merely requires that 𝑄(𝑠, 1) > 𝑄(𝑠, 0) for 𝑠 ≤ 4 and 𝑄(𝑠, 0) > 𝑄(𝑠, 1) for 𝑠 ≥ 5. This can be done with the feature representation. For example the neural network in part (d) can represent the following function (using 𝑤0 = −5 and 𝑤1 = −2) that represents the optimal policy:
Again, no linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be 𝑤0𝑠+𝑤0𝑎+𝑤1. This is a special case of the linear weighted sums in part (i), which we know cannot represent the optimal policy.
SID:
State
2
3
4
5
6
7
𝜋∗(𝑠)
Roll
Roll
Roll
Stop
Stop
Stop
29

Figure 1: A Q-function from 𝑠 + 𝑎, that represents the optimal policy.
(iii) [1 pt] Now suppose we decide to use 𝑎 as the feature for 𝑄(𝑠, 𝑎).
□ Possible with large neural network □ Possible with linear weighted sum of features 􏰀 Not possible
This isn’t possible, regardless of what function you use. With this representation, we will have 𝑄(4, 1) = 𝑄(5, 1) and 𝑄(4, 0) = 𝑄(5, 0). So we cannot both have 𝑄(4, 1) > 𝑄(4, 0) and 𝑄(5, 0) > 𝑄(5, 1).
(iv) [1 pt] Now suppose we decide to use sign(𝑠 − 4.5) ⋅ 𝑎 as the feature for 𝑄(𝑠, 𝑎), where sign(𝑥) is −1 if 𝑥 < 0, 1 if 𝑥 > 0, and 0 if 𝑥 = 0.
■ Possible with large neural network ■ Possible with linear weighted sum of features 􏰁 Not possible
𝑄(𝑠, 𝑎) = −sign(𝑠 − 4.5) ⋅ 𝑎 is sufficient to represent the optimal policy, as we have 𝑄(𝑠, 0) = 0 for all 𝑠, 𝑄(𝑠, 1) = 1 > 𝑄(𝑠,0)for𝑠 ≤ 4,and𝑄(𝑠,1) = −1 < 𝑄(𝑠,0)for𝑠 ≥ 5. Thisisalinearfunctionoftheinput,andsocanalsobe represented using a neural network. 30 SID: (b) [4 pts] Next, we investigate the effect of different neural network architectures on your ability to learn the optimal policy. Recall that our features for the Q-value at (𝑠, 𝑎) will simply be the vector [𝑠 𝑎], where 𝑠 is the state and 𝑎 is the action. In addition, suppose that the unique optimal policy is the following: Which of the following neural network architectures can express Q-values that represent the optimal policy? That is, the greedy policy with respect to some Q-function representable with the given neural network is the optimal policy: State 2 3 4 5 6 7 𝜋∗(𝑠) Roll Roll Roll Stop Stop Stop 𝑄(𝑠,𝜋∗(𝑠)) > 𝑄(𝑠,𝑎) for all states 𝑠 and actions 𝑎 ≠ 𝜋∗(𝑠). Hint: Recall that 𝑅𝑒𝐿𝑈(𝑥) = □ Neural Network 1:
□ Neural Network 2:
□ Neural Network 3:
{
𝑥 𝑥>0 0 𝑥≤0
31

□ Neural Network 4:
■ Neural Network 5:
􏰁 None of the above.
Recall from the previous question that no linear function of the features [𝑠 𝑎] can represent the optimal policy. So network
1, which is linear (as it has no activation function), cannot represent the optimal policy.
Network 2 cannot represent the optimal function, as it does not take as input the action. So 𝑄(𝑠, 0) = 𝑄(𝑠, 1) for all states
𝑠.
Network 3 cannot simultaneously satisfy 𝑄(4, 0) < 𝑄(4, 1) and 𝑄(5, 0) > 𝑄(5, 1). This is because the rectified linear unit is a monotonic function: if 𝑥 ≥ 𝑦, then 𝑅𝑒𝐿𝑈 (𝑥) ≥ 𝑅𝑒𝐿𝑈 (𝑦). Since we cannot represent the optimal policy using a linear function of 𝑠, 𝑎, we cannot represent it with a ReLU of a linear function of 𝑠, 𝑎.
Network 4 is always 0, so it cannot represent the (unique) optimal policy.
Network 5 can represent the optimal policy. For example, 𝑤0 = −4, 𝑤1 = −2 represents the optimal policy.
(c) [1 pt] As with the linear approximate q-learning, you decide to minimize the squared error of the Bellman residual. Let 𝑄𝐰(𝑠, 𝑎) be the approximate 𝑄-values of 𝑠, 𝑎. After taking action 𝑎 in state 𝑠 and transitioning to state 𝑠′ with reward 𝑟, you first compute the target target = 𝑟 + 𝛾 max𝑎′ 𝑄𝐰(𝑠′, 𝑎′). Then your loss is:
loss(𝐰) = 1 (𝑄𝐰(𝑠, 𝑎) − target)2 2
You then perform gradient descent to minimize this loss. Note that we will not take the gradient through the target – we treat it as a fixed value.
Which of the following updates represents one step of gradient descent on the weight parameter 𝑤𝑖 with learning rate 𝛼 ∈ (0, 1) after taking action 𝑎 in state 𝑠 and transitioning to state 𝑠′ with reward 𝑟? [Hint: which of these is equivalent to the normal approximate Q-learning update when 𝑄𝐰,(𝑠, 𝑎) = 𝐰 ⋅ 𝐟(𝑠, 𝑎)?]
􏰁 𝑤𝑖 =𝑤𝑖 +𝛼(𝑄𝐰(𝑠,𝑎)−(𝑟+𝛾max𝑎′ 𝑄𝐰(𝑠′,𝑎′)))𝜕𝑄𝐰(𝑠,𝑎)
(d) and (e) are on the next page.
′ ′ 𝜕𝑄𝐰(𝑠,𝑎) 􏰀 𝑤 = 𝑤 − 𝛼 (𝑄 (𝑠, 𝑎) − (𝑟 + 𝛾 max ′ 𝑄 (𝑠 , 𝑎 ))) 𝜕𝑤𝑖
𝑖 𝑖 (𝐰 ( 𝑎𝐰 ))𝜕𝑤𝑖 􏰁 𝑤𝑖 = 𝑤𝑖 + 𝛼 (𝑄𝐰(𝑠, 𝑎) − (𝑟 + 𝛾 max𝑎′ 𝑄𝐰(𝑠′, 𝑎′))) 𝑠 􏰁𝑤𝑖=𝑤𝑖−𝛼 𝑄𝐰(𝑠,𝑎)− 𝑟+𝛾max𝑎′𝑄𝐰(𝑠′,𝑎′) 𝑠 􏰁 None of the above.
32

Note that the gradient of the loss with respect to the parameter, via the chain rule, is:
( ( ′ ′))𝜕𝑄𝐰(𝑠,𝑎) 𝑄𝐰(𝑠,𝑎)− 𝑟+𝛾max𝑄𝐰(𝑠,𝑎)
𝑎′ 𝜕𝑤𝑖
The second option performs gradient descent, the first option is gradient ascent, and the other options compute the gradient incorrectly.
SID:
33

(d) While programming the neural network, you’re getting some bizarre errors. To debug these, you decide to calculate the gradients by hand and compare them to the result of your code.
Suppose your neural network is the following:
Figure 2: Neural Network 6 Thatis,𝑄𝐰(𝑠,𝑎)=𝑠+𝑎+𝑤1 𝑅𝑒𝐿𝑈(𝑤0 +𝑠+𝑎).
{
You are able to recall that 𝑑 𝑅𝑒𝐿𝑈(𝑥) = 𝑑𝑥
1 𝑥 ≥ 0. 0 𝑥<0 (i) [1 pt] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is 𝑄𝐰(5, 0)? 𝑄𝐰(5, 0) = Plugging in values, we get 𝑄𝐰 (5, 0) = 5 − 𝑅𝑒𝐿𝑈 (−4 + 5) = 4. (ii) [2 pts] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is the gradient with respect to 𝑤0, evaluated at 𝑠 = 5, 𝑎 = 0? 𝜕 𝑄𝐰(5,0)= 𝑤0 4 -1 SincetheinputtotheReLUispositive,thegradientoftheReLUis1.Applyingthechainrule,wegetthat 𝜕 𝑄𝐰(5,0)= 𝑤0 (iii) [2 pts] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is the gradient with respect to 𝑤0, evaluated at 𝑠 = 3, 𝑎 = 0? 𝑤1 = −1 𝜕 𝑄𝐰(3,0)= 0 𝑤0 Since the input to the ReLU is negative, the gradient of the ReLU is 0. Applying the chain rule, the gradient with respect to𝑤0 hastobe0. (e) After picking a feature representation, neural network architecture, and update rule, as well as calculating the gradients, it’s time to turn to the age old question... will this even work? (i) [1 pt] Without any other assumptions, is it guaranteed that your approximate 𝑄-values will converge to the optimal policy, if each 𝑠, 𝑎 pair is observed an infinite amount of times? 􏰁 Yes 􏰀No (ii) [1 pt] Without any other assumptions, is it guaranteed that your approximate 𝑄-values will converge to the optimal policy, if each 𝑠, 𝑎 pair is observed an infinite amount of times and there exists some 𝑤 such that 𝑄𝐰(𝑠, 𝑎) = 𝑄∗(𝑠, 𝑎)? 􏰁 Yes 􏰀No Note that there’s no guarantee that your neural network will converge in this case. For example, the learning rate can be too large! (As in the RL Blackjack question on midterm.) 34

Related Posts