CS计算机代考程序代写 flex algorithm data structure CS 188 Introduction to

CS 188 Introduction to
Fall 2019 Artificial Intelligence Final Exam
• You have approximately 170 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam
• The exam is closed book, closed calculator, and closed notes except your two-page crib sheet.
• Mark your answers ON THE EXAM IN THE DESIGNATED ANSWER AREAS. Provide a brief explanation if applica- ble.
• In the interest of fairness, we want everyone to have access to the same information. To that end, we will not be answering questions about the content. If a clarification is needed, it will be projected at the front of the room. Make sure to periodically check the clarifications.
• For multiple choice questions,
– □ means mark all options that apply
– 􏰁 means mark a single choice
– When selecting an answer, please fill in the bubble or square completely (􏰀 and ■)
– If need to undo a selection, either erase it completely or lay a giant cross (X) over the box. The staff reserves the
right to subtract points from ambiguous answers
First name
Last name
SID
Student to your right
Student to your left
For staff use only:
Q1. Coin Stars
/10
Q2. Five Card Game
/14
Q3. Vehicle Perception Indication
/10
Q4. Hidden CSP
/14
Q5. Snail Bayes
/15
Q6. We Are Getting Close…
/11
Q7. Generating and classifying text
/10
Q8. Deep “Blackjack”
/16
Total
/100
1

THIS PAGE IS INTENTIONALLY LEFT BLANK
2

Q1. [10 pts] Coin Stars
In a new online game called Coin Stars, all players are walking around an M x N grid to collect hidden coins, which only appear when you’re on top of them. There are also power pellets scattered across the board, which are visible to all players. If you walk onto a square with a power pellet, your power level goes up by 1, and the power pellet disappears. Players will also attack each other if one player enters a square occupied by another player. In an attack, the player with a higher power level will steal all the coins from the other player. If they have equal power levels, nothing happens. Each turn, players go in order to move in one of the following directions: {N, S, E, W}.
In this problem, you and your friend Amy are playing Coin Stars against each other. You are player 1, and your opponent Amy is player 2. Our state space representation includes the locations of the power pellets (𝑥𝑝𝑗 , 𝑦𝑝𝑗 ) and the following player information:
• Each player’s location (𝑥𝑖, 𝑦𝑖) • Each player’s power level 𝑙𝑖
• Each player’s coin count 𝑐𝑖
SID:
(a)
(i) [2 pts] Suppose a player wins by collecting more coins at the end of a number of rounds, so we can formulate this as a minimax problem with the value of the node being 𝑐1 − 𝑐2. Consider the following game tree where you are the maximizing player (maximizing the your net advantage, as seen above) and the opponent is the minimizer. As- suming both players act optimally, if a branch can be pruned, fill in its square completely, otherwise leave the square unmarked.
􏰁 None of the above can be pruned
(ii) [1 pt] Suppose that instead of the player with more coins winning, every player receives payout equal to the number of coins they’ve collected. Can we still use a multi-layer minimax tree (like the one above) to find the optimal action?
􏰁 Yes, because the update in payout policy does not affect the minimax structure of the game.
􏰁 Yes, but not for the reason above
􏰁 No, because we can no longer model the game under the updated payout policy with a game tree. 􏰁 No, but not for the reason above
3

(b) Suppose we want to train a Reinforcement Learning (RL) agent using Q-learning to play against a wide range of human participants, with the objective to obtain more coins than its opponent like in (a.i).
Your friend Blake discovers that a computer scientist and expert Coin Stars player Oswald has published (open-sourced) hispolicy𝜋𝑒,whichconsistsofasetofexpertfeatures𝐹𝑒 =[𝑓1,𝑓2,…,𝑓𝑛],allofwhichcanbecomputedusingthecurrent state representation mentioned in the problem statement. Oswald has hand-tuned the weights 𝑊𝑒 = [𝑤1, 𝑤2, …, 𝑤𝑛] of the policy 𝜋𝑒 such that 𝜋𝑒 is near optimal assuming the opponent also plays optimally.
You decide to use the same set of features as Oswald’s 𝐹𝑒, and come up with four proposed training procedures, all of which will learn the optimal policy (i.e. the best response) against that particular opponent:
I Playing against an opponent who plays 𝜋𝑒, Oswald’s expert-tuned policy that assumes our agent plays optimally
II Playingagainstanopponentwhoplaysafixedpolicy𝜋𝑝,asub-optimalpolicythatmovestomaximizethecollection
of power pellet, and chases its opponent afterwards.
III Playing against an opponent who plays a fixed policy 𝜋𝑐, a sub-optimal policy that ignores power pellet and its opponent, and moves to maximize the collection of hidden coins.
(i) [1 pt] With which of the training procedures is our agent most likely to learn a set of weights that can achieve the highest win rate against Oswald, an expert human participant who plays according to 𝜋𝑒?
􏰁 (I)
􏰁 (II)
􏰁 (III)
􏰁 There is not enough information to distinguish among the three
(ii) [1 pt] Is either of (II) or (III) guaranteed to result in an agent who achieves a higher win rate than (I) when playing against novice human participants who play sub-optimally, in an unknown way?
􏰁 Yes, because (II) and (III) are trained with sub-optimal opponents, whereas (I) is trained with expert opponents. 􏰁 Yes, but not for the reason above.
􏰁 No, because the novice human may not be sub-optimal in the same way as the opponents in (II) or (III).
􏰁 No, but not for the reason above.
(iii) [2 pts] Suppose instead of training an RL agent against an agent with a fixed strategy, we instead train the RL agent against itself. That is, the RL agent plays against an opponent with its weights, and the features are computed from each agent’s point of view.
Which of the following are potential issues in this self-play Reinforcement Learning approach?
□ The RL agent’s weights are not guaranteed to converge because it is playing against itself.
□ Because the opponent you play against can be arbitrarily bad, the RL agent cannot improve at the game at all. □ Because the RL agent is playing against the same opponent (itself), the best (highest reward) policy from its
perspective does not change over the course of training.
□ Because the RL agent is playing against itself, it does not need to do exploration (for example, it can just do the
greedy action with respect to its current Q-values). 􏰁 None of the above
(c) [1 pt] For this part only, suppose the game randomly reveals some hidden coins throughout the game, making them visible to both you and your friend. Which of the conditions below will both players benefit the most from the feature “Distance to closest visible coins”?
􏰁 Coins are sparse (∼ 1 coin per 50 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves) 􏰁 Coins are sparse (∼ 1 coin per 50 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves) 􏰁 Coins are dense (∼ 1 coin per 2 squares), reveal frequency high (∼ 1 coin becomes visible per 1 moves) 􏰁 Coins are dense (∼ 1 coin per 2 squares), reveal frequency low (∼ 1 coin becomes visible per 50 moves) 􏰁 Equally useful in all conditions above
(d) Using the original problem setup and, we have the following features and weights for a given state 𝑠:
Feature
Initial Weight
𝑓1(𝑠,𝑎)=|𝑥1 −𝑥2|+|𝑦1 −𝑦2|
𝑤1 = 1
𝑓2(𝑠, 𝑎) = 𝑙1 − 𝑙2
𝑤2 = 4
𝑓3(𝑠,𝑎) = 1
the Manhattan distance from player 1 to their closest pellet
𝑤3 = 2
4

(i) [1 pt] Calculate 𝑄(𝑠, 𝑎) for the q-state where player 1 is at (1, 1) and player 2 is at (5, 4). Player 1 has power level 7, and player 2 has power level 3. The closest pellet to player 1 is located at (2, 1).
􏰁 24 􏰁 25 􏰁 26 􏰁 27 􏰁 28 􏰁 Avaluedifferentfromalloftheabove
􏰁 There is not enough information to calculate this value
(ii) [1 pt] We observe a sample (𝑠, 𝑎, 𝑟) for the q-state in the previous part with 𝑟 = 23. Assuming a learning rate of 𝛼 = 0.5 and discount factor of 𝛾 = 0.5, calculate the new weight 𝑤1_𝑛𝑒𝑤 after a single update of approximate Q-Learning. This part will be graded independently of the previous part, so please check your work before proceeding.
𝑤1_𝑛𝑒𝑤 =
􏰁 -13 􏰁 -6 􏰁 1 􏰁 8 􏰁 15 􏰁 Avaluedifferentfromalloftheabove
􏰁 There is not enough information to calculate this value
5
SID:

Q2. [14 pts] Five Card Game
There is a two-player zero-sum card game called Five Card Game. Player A plays odd-numbered turns (and thus plays first), and he initially has the cards labeled 1, 3, 5 in his hand. Player B plays even-numbered turns, and she initially has the cards labeled 2 and 4 in her hand. They take turns to give out one of their cards to the judge based on their policies. The game ends after 5 turns and forms the sequence 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5, where 𝑋1, 𝑋3, 𝑋5 is a permutation of 1, 3, 5 provided by player A, and 𝑋2, 𝑋4 is a permutation of 2, 4 provided by player B.
However, neither of the two players have access to what cards their opponent plays throughout the game. The only hint they receive is from the judge: when 𝑋𝑡 has been determined by either of the players (𝑡 ≥ 2), the judge of the game will compare 𝑋𝑡 and 𝑋𝑡−1, and assign one point to either of the players. With probability 𝑝𝑡 (0 ≤ 𝑝𝑡 ≤ 1), the point will be given to the player whose card is larger, and with probability 1 − 𝑝𝑡, the point will be given to the player whose card is smaller. 𝑝2, 𝑝3, 𝑝4, 𝑝5 are pre-defined before the game and everyone knows their values. Both the players and the audience know the point assignment right after the judge assigns it. The player who wins more points wins the game.
We denote the point that player A earned at each time step as 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 (value can be either 0 or 1), and thus all the variables are determined in the following order: 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴. Note player A determines 𝑋3 after knowing 𝑈2𝐴, and player B determines 𝑋4 after knowing 𝑈2𝐴 and 𝑈3𝐴, and player A determines 𝑋5 after knowing 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴.
As an illustration, if 𝑋1,𝑋2,𝑋3,𝑋4,𝑋5 is 3, 2, 5, 4, 1, and 𝑝𝑡 = 1 for all 𝑡 (so that the bigger card’s owner always gets the point), then 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 is 1, 1, 1, 0 and player A wins this game. In this example, 𝑈2𝐴 = 1 because 𝑋2 < 𝑋1 and hence the point is awarded to player A, while 𝑈5𝐴 = 0 because 𝑋5 < 𝑋4 and hence the point is awarded to player B. (a) Suppose 𝑝𝑡 = 1 for all 𝑡. (i) [1 pt] Each player uses their own optimal deterministic policy and is aware that their opponent is also using the optimal deterministic policy. How many points will Player A obtain? Hint: Drawing a game tree might help you here. Answer: (ii) [1 pt] Player B decides to gives out her cards randomly (with equal probability for each ordering). If Player A becomes aware of this, what is the expected number of points Player A will obtain if he plays optimally? Answer: 6 Now we assume both the players have their own random policies. Suppose a player’s policy for each 𝑋𝑡 is a conditional proba- bility table, conditioned on the MINIMUM set of DEPENDENT information they know when playing the card 𝑋𝑡. Then at each time step 𝑡, they randomly sample which card to draw according to their probability table. During the game, each player only knows their own policy. In addition, the policies do not change over time. We want to construct a Bayes Net with the minimum number of edges to investigate the relations between the nodes 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴. Hint 1: Drawing the Bayes Net with nodes 𝑋1, 𝑋2, 𝑈2𝐴, 𝑋3, 𝑈3𝐴, 𝑋4, 𝑈4𝐴, 𝑋5, 𝑈5𝐴 while answering part (b) will help you in part (c). But only your response to the questions will be graded. No partial credit for the drawings. Hint 2: Do Player A and Player B know each other’s card sequence exactly? Do they know their own card sequence exactly? Hint 3: As noted in the problem statement, before any player determines his or her turn 𝑋𝑡, he or she knows all the existing hints𝑈𝐴,...,𝑈𝐴 . 2 𝑡−1 (b) (i) [1 pt] Which of the following are directly reflected in the Bayes net’s probability tables (including prior probability tables and conditional probability tables)? □ The judge’s point assignment probability. □ The player A’s winning probability. 􏰁 None of the above (ii) [2 pts] For each table, mark the minimal set of variables that each probability table should be conditioned on. Player B’s policy probability table to determine 𝑋2 □ 𝑋1 Player A’s policy probability table to determine 𝑋3 □𝑋1 □𝑋2 □𝑈2𝐴 Player B’s policy probability table to determine 𝑋4 □𝑋1 □𝑋2 □𝑋3 □𝑈2𝐴 □𝑈3𝐴 Player A’s policy probability table to determine 𝑋5 􏰁 None of the above 􏰁 Noneoftheabove 􏰁 Noneoftheabove 􏰁 Noneoftheabove (c) Analyze the independency or conditional independency between variables. □𝑋1 □𝑋2 □𝑋3 □𝑋4 □𝑈2𝐴 □𝑈3𝐴 □𝑈4𝐴 Hint: Feel free to use the Bayes Net you constructed from part (b), but this part will be graded independently. Please double check your work in part (b) (i) [1 pt] Which of the following pairs of variables are independent, given no observations? □𝑋2 and𝑋3 □𝑋1 and𝑋4 􏰁 Noneoftheabove (ii) [1 pt] Which of the following pairs of variables are independent, given only 𝑈2𝐴? □𝑋2 and𝑋3 □𝑋1 and𝑋4 􏰁 Noneoftheabove 7 SID: Now the game is over and you have only observed the values of 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴, and want to decode the values of 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5. You want to solve this problem by searching over all the possible states. Each of your states contains the card sequence at some point during the game. For example, one search path from the start state to one goal state could be: () → (3,) → (3, 2) → (3, 2, 5) → (3, 2, 5, 4) → (3, 2, 5, 4, 1) (Goal State). (d) [1 pt] How many goal states are there in your search problem? Answer: Which data structure is more similar to the search graph? 􏰁 Your game tree from part (a) 􏰁 Your bayes net from part (b) Now we wish to decode 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋5 given 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 by solving the following problem: argmax 𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 )𝑃(𝑈𝐴|𝑋 ,𝑋 ) 𝑋1,𝑋2,𝑋3,𝑋4,𝑋5 212 323 434 545 (e) We then assign proper cost to each edge in the search graph. (i) [2 pts] Which is a correct edge cost configuration for the edge connecting the states (𝑋1, ..., 𝑋𝑡−1) and (𝑋1, ..., 𝑋𝑡) (where 𝑡 ≥ 2)? Note a correct configuration means that, the optimal path of your search problem should always be the correct argmax solution above for any valid player policy probability table values and 𝑝𝑡 values. 􏰁 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 −𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 ln 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 − ln 𝑃 (𝑈𝑡𝐴|𝑋𝑡−1, 𝑋𝑡) 􏰁 𝑒𝑃(𝑈𝑡𝐴|𝑋𝑡−1,𝑋𝑡) −𝑃(𝑈𝐴|𝑋 ,𝑋 ) 􏰁𝑒 𝑡 𝑡−1 𝑡 􏰁 None of the above Now suppose we observe that 𝑈2𝐴, 𝑈3𝐴, 𝑈4𝐴, 𝑈5𝐴 all equal to 1. (ii) [2 pts] What is the value for 𝑃 (𝑈5𝐴|𝑋4, 𝑋5) used to compute the edge cost from the state (3, 2, 5, 4) to the state (3, 2, 5, 4, 1)? Your answer should just be the value of 𝑃 (𝑈5𝐴 |𝑋4 , 𝑋5 ) rather than the edge cost computed using your answer to part (e)(i). 􏰁 𝑝5 􏰁 1 − 𝑝5 􏰁 Cannot be determined only by 𝑝𝑡 - we also need the players’ policy probability table. What is the value for 𝑃 (𝑈3𝐴|𝑋2, 𝑋3) used to compute the edge cost from the state (3, 2) to the state (3, 2, 5)? Your answer should just be the value of 𝑃(𝑈3𝐴|𝑋2,𝑋3) rather than the edge cost computed using your answer to part (e)(i). 􏰁 𝑝3 􏰁 1 − 𝑝3 􏰁 Cannot be determined only by 𝑝𝑡 - we also need the players’ policy probability table. Note we still did not assign the edge cost to the edge connecting () and (𝑋1, ). Let us assign 1 as the cost to all this type of edges. (f) [2 pts] Now you will conduct A* search on your constructed search graph with the proper edge cost from part (e). The heuristic for each node will be set to be the number of time steps remaining to the game end. For example, the heuristic for the state (3, 2, 5) is 2 (2 steps from the game end). Which of the following statements is correct? 􏰁 Neither tree search nor graph search will be complete. 􏰁 Both tree search and graph search will be complete, but neither is guaranteed to be optimal. 􏰁 Both tree search and graph search will be complete, and only the tree search is guaranteed to be optimal. 􏰁 Both the tree search and the graph search will be complete and optimal. 8 Q3. [10 pts] Vehicle Perception Indication A vehicle is trying to identify the situation of the world around it using a set of sensors located around the vehicle. Each sensor reading is based off of an object’s location (LOC) and an object’s movement (MOVE). The sensor reading will then produce various values for its predicted location, predicted movement, and image rendered, which is then sent back to the user. (a) Thevehicletakesanaction,andweassignsomeutilitytotheactionbasedontheobject’slocationandmovement.Possible actions are MOVE TOWARDS, MOVE AWAY, and STOP. Suppose the decision network faced by the vehicle is the following. (i) [2 pts] Based on the diagram above, which of the following could possibly be true? □ VPI (Image) = 0 □ VPI (SRead) < 0 □ VPI (SMove, SRead) > VPI (SRead)
□ VPI (Feedback) = 0
􏰁 None of the above
(ii) [2 pts] Based on the diagram above, which of the following must necessarily be true?
□ VPI (Image) = 0
□ VPI (SRead) = 0
□ VPI (SMove, SRead) = VPI (SRead) □ VPI (Feedback) = 0
􏰁 None of the above
Let’s assume that your startup has less money, so we use a simpler sensor network. One possible sensor network can be repre- sented as follows.
SID:
9

You have distributions of 𝑃(LOC),𝑃(MOVE),𝑃(𝑆𝑅𝑒𝑎𝑑|LOC,MOVE),𝑃(𝑆𝐿𝑜𝑐|𝑆𝑅𝑒𝑎𝑑) and utility values 𝑈(𝑎,𝑙,𝑚).
(b) Complete the equation for determining the expected utility for some ACTION 𝑎. ()
𝐸𝑈(𝑎) = (i)
(i) [1 pt]
􏰁 ∑𝑙 𝑃(𝑙)
(ii) [1 pt]
􏰁 ∑𝑚 𝑃 (𝑚)
(ii) (iii) 𝑈(𝑎,𝑙,𝑚) 􏰁 ∑𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙)
􏰁 ∑𝑙 ∑𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙)
􏰁 ∑𝑙 ∑𝑚 ∑𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 ∑𝑚 𝑃 (𝑠𝑙𝑜𝑐|𝑚) 􏰁 ∗∑ 𝑙∑ 𝑚∑ 𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 1
𝑚 𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐|𝑙)𝑃 (𝑠𝑙𝑜𝑐|𝑚)
􏰁 1
(iii)[1pt] ∑∑ ∑
􏰁 + 𝑙 𝑚 𝑠𝑙𝑜𝑐 𝑃(𝑠𝑙𝑜𝑐|𝑙,𝑚)𝑃(𝑙)𝑃(𝑚)
(c) Your colleague Bob invented a new sensor to observe values of 𝑆𝐿𝑜𝑐.
􏰁 + 𝑙 􏰁 ∗ 1
(i) [1 pt] Suppose that your company had no sensors till this point. Which of the following expression is equivalent to VPI(𝑆𝐿𝑜𝑐)? ∑
□ VPI(𝑆𝐿𝑜𝑐) = ( 𝑠𝑙𝑜𝑐 𝑃 (𝑠𝑙𝑜𝑐) MEU(𝑆𝐿𝑜𝑐 = 𝑠𝑙𝑜𝑐)) − max𝑎 EU(𝑎) □ VPI(𝑆𝐿𝑜𝑐) = MEU(𝑆𝐿𝑜𝑐) − MEU(∅)
□ VPI(𝑆𝐿𝑜𝑐) = max𝑠𝑙𝑜𝑐 MEU(𝑆𝐿𝑂𝐶 = 𝑠𝑙𝑜𝑐) − MEU(∅)
􏰁 None of the above
(ii) [2 pts] Gaagle, an established company, wants to sell your startup a device that gives you 𝑆𝑅𝑒𝑎𝑑. Given that you already have Bob’s device (that gives you 𝑆𝐿𝑜𝑐), what is the maximum amount of money you should pay for Gaagle’s device? Suppose you value $1 at 1 utility.
□ VPI(𝑆𝑅𝑒𝑎𝑑)
□ VPI(𝑆𝑅𝑒𝑎𝑑) − VPI(𝑆𝐿𝑜𝑐)
□ VPI(𝑆𝑅𝑒𝑎𝑑, 𝑆𝐿𝑜𝑐)
□ VPI(𝑆𝑅𝑒𝑎𝑑, 𝑆𝐿𝑜𝑐) − VPI(𝑆𝐿𝑜𝑐) □ VPI(𝑆𝑅𝑒𝑎𝑑|𝑆𝐿𝑜𝑐)
􏰁 None of the above
10
∑∑ ∑

Q4. [14 pts] Hidden CSP
We have studied lots of algorithms to solve a CSP with a list of known constraints. But can we solve a CSP without explicitly
knowing the constraints?
The wildfire reaches Berkeley in the middle of CS188 lecture, and all 𝑁 ≥ 1 students need to be evacuated on 𝑀 ≥ 1 buses
(each with capacity 𝑁).
Suppose the only constraints are 𝑘 ≥ 0 pairs of students who have to be on two different buses (for example, “Zelda and Yoda” constitute one pair of students who have to be on two different buses). You don’t know these constraints explicitly, but you can observe the yelling from the buses, which is correlated with whether all constraints are satisfied.
In this question, 𝑆 = +𝑠 means that the current assignment satisfies all the constraints, and 𝑆 = −𝑠 means it violates at least 1 constraint. We also denote 𝑌 = +𝑦 as the event where there is yelling from the buses, and 𝑌 = −𝑦 as the event where there is no yelling from the buses.
(a) Your friend Alice suggests starting with a set of random assignments 𝑆0 (where you randomly assign each of 𝑁 students to one of 𝑀 buses).
(i) [1 pt] What is 𝑝0 = 𝑃 (𝑆0 = +𝑠), the probability that the randomly generated assignment satisfies all the constraints, for any choice of 𝑁, 𝑀, 𝑘. (Hint: try writing down the expression for the probability that it satisfies two different
constraints)
􏰁0􏰁1−1 􏰁1−1 􏰁1−1 􏰁1−1 􏰁1
(ii) [1 pt] Since we don’t know the constraints explicitly, our observation of yelling is the only way to infer whether all constraints are satisfied.
Alice, an expert in the human behavior of yelling, constructs the table 𝑃(𝑌𝑡|𝑆𝑡) below to describe the relation be- tween yelling (𝑌𝑡) and the hidden assignment being satisfying (𝑆𝑡). What is 𝑃 (𝑆0 = +𝑠|𝑌0 = +𝑦)?
􏰁 0.1
􏰁􏰁 0.1(𝑝0)+0.2(1−𝑝0)
􏰁􏰁 0.1(𝑝0)+0.8(1−𝑝0) 0.1(𝑝0 )
0.1(𝑝0 )+0.2(1−𝑝0 ) 0.1(𝑝0 )
0.1(𝑝0 )+0.8(1−𝑝0 )
􏰁 None of the above
Note: “–” in the tables means the value is not given in the problem
(b) YourfriendBobsuggestsiterativelyupdatingtheassignment:ateachtimestep,werandomlyselectastudentandrandomly
assign them to a different bus. Assume for the remainder of the Q4, there are at least 2 buses (𝑀 ≥ 2).
The resulting transition probability table is listed above, where 𝑃(𝑆𝑡+1|𝑆𝑡) is the probability of transitioning from 𝑆𝑡 to 𝑆𝑡+1 after the iterative improvement at the end of time 𝑡.
(i) [2 pts] Which of the following conditions are sufficient for 𝑟++ = 𝑃 (𝑆𝑡+1 = +𝑠|𝑆𝑡 = +𝑠) = 0? Select all that apply. □ The underlying constraint graph is fully connected
□ The underlying constraint graph does not contain a cycle □ There exists exactly one satisfying assignment
􏰁 None of the above
()𝑘 ()𝑘 ()𝑘 ()𝑘
SID:
( 𝑀2 ) 𝑀 ( 𝑀2 ) 𝑀 􏰁 None of the above
𝑆𝑡
𝑌𝑡
𝑃 (𝑌𝑡|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑦 −𝑦 +𝑦 −𝑦
0.1 – – 0.2
𝑆𝑡
𝑆𝑡+1
𝑃 (𝑆𝑡+1|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑠 −𝑠 +𝑠 −𝑠
𝑟++ –

𝑟−−
𝑆0
𝑃(𝑆0)
+𝑠 −𝑠
𝑝0 1−𝑝0
11

(ii) [2 pts] Ben claims that although 𝑟++ and 𝑟−− can be approximated with constant numbers, they actually fluctuate throughout the execution of the program. Is Ben right and why?
􏰁 Ben is right, because 𝑟++ and 𝑟−− are non-constant functions of time-step 𝑡.
􏰁 Ben is right, because enforcing assigning the randomly selected student to a different bus changes the belief of the hidden states.
􏰁 Ben is right, but not for the reasons above.
􏰁 Ben is wrong, because 𝑟++ and 𝑟−− will only change monotonically (either always going up or always going down), and never fluctuate.
􏰁 Ben is wrong, because 𝑟++ and 𝑟−− are always constants
􏰁 Ben is wrong, but not for the reasons above.
(iii) [1pt]Doesyourselectionabovechangetooneoftheother5choices,ifhypothetically,weinitiallyassignallstudents to bus 1 instead?
􏰁 Yes, because “𝑟++ =undefined” is true immediately (at time 𝑡 = 0) if we initially assign all students to bus 1, but not true if the initial assignment is randomly generated.
􏰁 Yes, but not for the reason above.
􏰁 No, because the properties of 𝑟++ and 𝑟−− are independent of how we generate the initial assignment. 􏰁 No, but not for the reason above.
(c) YourfriendCharliesuggestsapolicythat,sincetheabsenceofyelling(-y)isagoodindicatorthatthecurrentassignment satisfy all constraints (+s), we will not alter any variable when we currently observe no yelling (-y).
𝑆𝑡
𝑌𝑡
𝑆𝑡+1
𝑃 (𝑆𝑡+1|𝑆𝑡, 𝑌𝑡)
+𝑠 +𝑠 +𝑠 +𝑠 −𝑠 −𝑠 −𝑠 −𝑠
+𝑦 +𝑦 −𝑦 −𝑦 +𝑦 +𝑦 −𝑦 −𝑦
+𝑠 −𝑠 +𝑠 −𝑠 +𝑠 −𝑠 +𝑠 −𝑠
𝑟++ (I)
(II) (III) (IV)
𝑟−− (V)
(VI)
𝑆𝑡
𝑌𝑡
𝑃 (𝑌𝑡|𝑆𝑡)
+𝑠 +𝑠 −𝑠 −𝑠
+𝑦 −𝑦 +𝑦 −𝑦
0.1 – – 0.2
𝑆0
𝑃(𝑆0)
+𝑠 −𝑠
𝑝0 1−𝑝0
Note: “–” in the tables means the value is not given in the problem
(i) [1 pt] Which of the quantities in the 𝑃 (𝑆𝑡+1|𝑆𝑡, 𝑌𝑡) table are guaranteed to be 1, per Charlie’s policy. □ (I) □ (II) □ (III) □ (IV) □ (V) □ (VI)
􏰁 None of the above
(ii) [1 pt] We are following Charlie’s algorithm, and for the first 𝑇 time-steps, all observations are no yelling (-y), and thus we have not altered any variables. Conditioned on 𝑌0 = 𝑌1 = … = 𝑌𝑇 −1 = −𝑦, what’s the probability that the initial assignment indeed satisfy all constraints (+𝑠)?
􏰁 􏰁
􏰁 􏰁􏰁
􏰁
􏰁 􏰁
𝑝0 0.9(𝑝0) (0.9)𝑇𝑝0
(0.9(𝑝0))𝑇 0.9(𝑝0 )
0.9(𝑝0 )+0.2(1−𝑝0 ) (0.9)𝑇 𝑝0
(0.9)𝑇 𝑝0+(0.2)𝑇 (1−𝑝0) (0.9(𝑝0 ))𝑇 (0.9(𝑝0))𝑇 +(0.2(1−𝑝0))𝑇
None of the above
12

(iii) [1 pt] In which of the following way does Charlie’s suggestion improve on Bob’s algorithm?
□ When the assignment is likely to violate a lot of constraints, Charlie’s suggestion helps reduce the number of
violated constraints better then Bob’s
□ When the assignment is likely to satisfy all constraints, Charlie’s suggestion helps retain that state better than
Bob’s
􏰁 None of the above
(iv) [3 pts] Charlie’s algorithm is likely to work well to find a satisfying assignment (if one exists) in which of the following scenarios?
□ When the constraint graph is sparse □ When the constraint graph is dense □ When 𝑟++ is close to 1
□ When 𝑟++ is close to 0
□ When 𝑟−− is close to 1 □ When 𝑟−− is close to 0 􏰁 None of the above
(d) [1 pt] The approach used in this question (especially in Charlie’s algorithm) is the most similar to which of the following concepts?
􏰁 AC-3 Algorithm
􏰁 Local Search
􏰁 Forward Checking 􏰁 Likelihood Weighing
13
SID:

Q5. [15 pts] Snail Bayes
Celebrating the near-end of the semester, the CS188 TAs have gathered around the staff aquarium to check up on the snails and their search for love. To our excitement, two snails decided to go on a date! We don’t know who the snails are, but we spent enough time around the terrarium to know that the first one (𝑆1) is either Alex (a) or Bubbles (b), and the second one (𝑆2) is either Cuddles (c) or Scorpblorg (s). On the date, the snails will eat some dinner (D), which can be a beautiful flower (+d) or a poisonous mushroom (-d), and they will walk (W) around wonderful rocks (+w) or some treacherous puddle (-w). The snails are in the quest for love (L), which, depending on how the date goes, they can find (+l) or not (-l).
(a) [1 pt] What is the probability of an outcome (𝑆1 = 𝑎, 𝑆2 = 𝑐, 𝐷 = −𝑑, 𝑊 = +𝑤, 𝐿 = +𝑙), the probability that Cuddles and Alex are on a date, where they share a poisonous mushroom, walk around the wonderful rocks and find love?
􏰁 0.5∗0.4∗0.7∗0.5∗0.4 􏰁 0.4∗0.6∗0.7∗0.5∗0.4 􏰁 0.6∗0.6∗0.7∗0.5∗0.4 􏰁 None of the above
(b) [2pts]WhichofthefollowingindependencestatementsareguaranteedtobetruebytheSnailBayes’Netgraphstructure? □ 𝑆1 ⫫ 𝑆2 ∣ 𝐿
□𝐷⫫𝑊 □𝐷⫫𝑊 ∣𝐿
□ 𝑆1 ⫫ 𝑆2 ∣ 𝐷, 𝑊 □𝐿⫫𝑊 ∣𝑆1,𝑆2 □𝐷⫫𝑊 ∣𝑆1,𝑆2 􏰁 None of the above
14

The date is about to start and people are making guesses for what’s going to happen. One TA notices how adorable it would be if the snails were Bubbles and Cuddles.
(c) IfBubblesandCuddlesareonthedate,wewanttocomputetheprobabilityofthemeatingabeautifulflowerandwalking around the wonderful rocks.
(i) [1 pt] What is the equivalent expression for this probability?
􏰁 𝑃(𝑏,𝑐,+𝑑,+𝑤) 􏰁 𝑃(𝑏,𝑐 ∣ +𝑑,+𝑤) 􏰁 𝑃(+𝑑,+𝑤 ∣ 𝑏,𝑐) 􏰁 𝑃(+𝑑,+𝑤)
(ii) [1 pt] What minimal set of probability tables will we use in calculating this probability?
□𝑃(𝐷) □𝑃(𝑊) □𝑃(𝑆1 ∣𝐷,𝑊) □𝑃(𝑆2 ∣𝐷,𝑊) □𝑃(𝐿∣𝑆1,𝑆2) 􏰁 None of the above
The snails are starving, so the date begins with a delightful dinner. The snails start sharing a mushroom as their dish of choice.
(d) Given their choice of dinner, what is 𝑃 (𝑆1 ∣ −𝑑), the belief over which snail is S1? Please answer in decimal numbers. You should not need a calculator.
(i) [1pt]𝑃(𝑆1 =𝑎∣𝐷=−𝑑)=
A late TA rushes in and is thrilled to see us around the aquarium. You see, he spent many hours befriending the little ones and
immediately recognizes the second snail as Cuddles! Apparently, Cuddles is the ooziest and is easiest to identify.
(e) [1 pt] What is 𝑃 (𝑆1 = 𝑏 ∣ 𝑆2 = 𝑐, 𝐷 = −𝑑), the probability that the first snail is Bubbles given that the second is Cuddles and they ate a mushroom?
􏰁 ∑𝑤𝑃(𝑏∣−𝑑,𝑤)∗𝑃(𝑤)∗𝑃(𝑐∣−𝑑,𝑤)∗𝑃(𝑤) 𝑃(𝑐∣−𝑑)
∑𝑤 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (−𝑑)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤) 𝑃(𝑐)
􏰁 ∑𝑤𝑃(𝑏∣−𝑑,𝑤)∗𝑃(𝑐∣−𝑑,𝑤)∗𝑃(𝑤) 𝑃(𝑐∣−𝑑)
􏰁 None of the above
SID:
∑ 𝑃 (𝑏∣−𝑑,𝑤)∗𝑃 (𝑤)∗𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (𝑤)
􏰁􏰁 𝑤 ∑
𝑤 𝑃 (𝑐∣−𝑑,𝑤)∗𝑃 (−𝑑)
15

(f) [2 pts] What is 𝑃 (𝐿 = +𝑙 ∣ 𝑆2 = 𝑐, 𝐷 = −𝑑), the probability that the snails will find love given that the second snail is Cuddles and they ate a mushroom?
□ ∑𝑠1 𝑃(+𝑙∣𝑐,𝑠1)∗𝑃(−𝑑∣𝑐)∗𝑃(𝑠1∣−𝑑,𝑐) 𝑃(𝑐∣−𝑑)
□ ∑𝑠1 𝑃(+𝑙∣𝑐,𝑠1)∗𝑃(𝑐∣−𝑑)∗𝑃(𝑠1∣−𝑑,𝑐) 𝑃(−𝑑∣𝑐)
□ ∑𝑠1 𝑃(+𝑙 ∣ 𝑐,𝑠1) ∗ 𝑃(𝑠1 ∣ −𝑑,𝑐) □ ∑𝑠1 𝑃(+𝑙,−𝑑∣𝑐,𝑠1)∗𝑃(𝑠1∣𝑐)
𝑃(−𝑑∣𝑐)
􏰁 None of the above
The snails found love! We are now trying to find the probability that the other snail was Bubbles given all evidence so far, 𝑃(𝑏 ∣ 𝑐,−𝑑,+𝑙). The TAs are tired of multiplying probabilities, so they instead try another way. The late TA actually wrote down memories of previous dates he has witnessed in a notebook. He can sample some of his memories from the notebook and help us learn probabilities.
(g) [1 pt] If the TA uses prior sampling, what is the probability of obtaining the sample [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = −𝑙]?
􏰁 0.5*0.4*0.6*0.3*0.2 􏰁 0.6*0.1*0.7*0.1*0.2 􏰁 0.25*0.24*0.9*0.1*0.5 􏰁 None of the above
􏰁 0.4*0.4*0.9*0.2*0.8
􏰁 0.5*0.4*0.4*0.7*0.5
􏰁 0.4*0.5*0.24*0.21*0.25
(h) [1 pt] If the TA samples [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = −𝑙], would rejection sampling discard the memory? 􏰁 Yes 􏰁 No
(i) [1 pt] Assuming that the TA actually sampled using likelihood weighing and obtained [𝐷 = −𝑑, 𝑊 = +𝑤, 𝑆1 = 𝑏, 𝑆2 = 𝑐, 𝐿 = +𝑙], what is the weight of this sample?
􏰁 0.5*0.5*0.5
􏰁 0.4*0.9*0.6
􏰁 0.4*0.24*0.6
􏰁 None of the above
􏰁 0.5*0.7*0.5 􏰁 0.5*0.3*0.5 􏰁 0.6*0.3*0.6
(j) [1 pt] Sampling using likelihood weighting will systematically underestimate the probability of a variable conditioned on one of its ancestors.
􏰁 Yes, because likelihood weighting does not sample all the variables, and thus creates a bias 􏰁 Yes, but not for the reason above
􏰁 No, because likelihood weighting is unbiased
􏰁 No, but not for the reason above
(k) [2 pts] To estimate 𝑃 (𝑏 ∣ 𝑐, −𝑑, +𝑙), the TA samples five memories in a row: [𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑎,𝑆2 =𝑐,𝐿=+𝑙],
16

[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑎,𝑆2 =𝑐,𝐿=+𝑙],
[𝐷=−𝑑,𝑊 =+𝑤,𝑆1 =𝑏,𝑆2 =𝑐,𝐿=+𝑙].
Could these memories have been generated using Gibbs sampling?
􏰁 Yes, because all evidence variables are consistent with their values in the query 𝑃 (𝑏 ∣ 𝑐, −𝑑, +𝑙).
􏰁 Yes, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous
choice but cannot be generated by Gibbs sampling.
􏰁 No, because the sequence of samples only differs by 𝑆1, the query variable. The values of 𝑊 , a variable that is not part of the query, never changes throughout the sequence.
􏰁 No, but the reason above is incorrect because there exist other samples sequences that fit the condition in the previous choice but cannot be generated by Gibbs sampling.
17
SID:

Q6. [11 pts] We Are Getting Close…
The CS 188 TAs have built an autonomous vehicle, and it’s finally on the street! Approaching a crossroad, our vehicle must avoid bumping into pedestrians. However, how close are we?
X is the signal received from sensors on our vehicle. We have a estimation model E, which estimates the current distance of any object in our view. Our vehicle also needs a model to detect objects and label their classes as one of {pedestrian, stop sign, road, other}. The TAs trained a detection model D that does the above and with a simple classification, outputs one of {no pedestrian, pedestrian on the road, pedestrian beside the stop sign}. Our vehicle has a control operator C, which determines the velocity by changing the acceleration.
(a) [5 pts] For the above Dynamic Bayes Net, complete the equations for performing updates. (Hint: think about the predic- tion update and observation update equations in the forward algorithm for HMMs.)
Time elapse:
(i) 􏰁
(ii) 􏰁
􏰁
(iii) 􏰁
(iv) 􏰁 􏰁 􏰁
(i) = (ii) (iii) (iv) 𝑃(𝑥 |𝑒 ,𝑑 ,𝑐 ) 𝑡−1 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
𝑃(𝑥) 𝑡
𝑃 (𝑐0∶𝑡−1 )
𝑃 (𝑒0∶𝑡 , 𝑑0∶𝑡 , 𝑐0∶𝑡 )
Σ𝑥𝑡−1 􏰁 Σ𝑥𝑡 𝑃 (𝑥𝑡−1 |𝑥𝑡−2 )
𝑃 (𝑥𝑡 |𝑥𝑡−1 )
𝑃 (𝑥𝑡|𝑥𝑡−1, 𝑐𝑡−1)
􏰁 𝑃(𝑥|𝑒 ,𝑑 ,𝑐 ) 𝑡 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
􏰁 𝑃 (𝑥0∶𝑡−1 , 𝑐0∶𝑡−1 ) 􏰁 1
􏰁 max𝑥𝑡−1 􏰁 max𝑥𝑡 􏰁 𝑃 (𝑥𝑡−1 , 𝑥𝑡−2 )
􏰁 𝑃(𝑒,𝑑,𝑐|𝑒 ,𝑑 ,𝑐 ) 𝑡 𝑡 𝑡 0∶𝑡−1 0∶𝑡−1 0∶𝑡−1
􏰁 𝑃 (𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 ) 􏰁 1
􏰁 𝑃 (𝑥𝑡 |𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 ) 􏰁 𝑃 (𝑥𝑡 , 𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 )
Update to incorporate new evidence at time 𝑡:
𝑃 (𝑥 |𝑒 ,𝑑 ,𝑐 ) = (v) (vi) (vii)
𝑡 0∶𝑡 0∶𝑡 0∶𝑡
(v) 􏰁 (𝑃 (𝑐 |𝑐 ))−1
􏰁 (𝑃(𝑒|𝑒 )𝑃(𝑑|𝑑 )𝑃(𝑐|𝑐 ))−1 ( ( 𝑡 0∶𝑡−1) ( 𝑡 0∶𝑡−1) ( 𝑡 0∶𝑡−1))−1
􏰁 𝑃 (𝑥𝑡, 𝑥𝑡−1)
􏰁 𝑃 (𝑥𝑡, 𝑥𝑡−1, 𝑐𝑡−1)
􏰁 1 Your choice for (i)
( ( 𝑡 0∶𝑡−1 ))−1 􏰁 (𝑃 (𝑒𝑡 , 𝑑𝑡 , 𝑐𝑡 |𝑒0∶𝑡−1 , 𝑑0∶𝑡−1 , 𝑐0∶𝑡−1 ))−1 􏰁 𝑃 𝑒0∶𝑡−1, 𝑑0∶𝑡−1, 𝑐0∶𝑡−1|𝑒𝑡, 𝑑𝑡, 𝑐𝑡
􏰁 𝑃 𝑒0∶𝑡−1 |𝑒𝑡 􏰁1
􏰁 max𝑥𝑡−1
𝑃 𝑑0∶𝑡−1 |𝑑𝑡 𝑃 𝑐0∶𝑡−1 |𝑐𝑡 􏰁 max𝑥𝑡 􏰁 1
(vi) 􏰁 Σ𝑥𝑡−1 􏰁 Σ𝑥𝑡
􏰁 Σ𝑥𝑡−1,𝑥𝑡 □ 𝑃 (𝑒𝑡, 𝑑𝑡|𝑥𝑡)𝑃 (𝑐𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡−1)
□𝑃(𝑥𝑡,𝑒𝑡,𝑑𝑡,𝑐𝑡)
□ 𝑃 (𝑥𝑡, 𝑒𝑡, 𝑑𝑡, 𝑐𝑡, 𝑐𝑡−1) □ 𝑃 (𝑒𝑡, 𝑑𝑡, 𝑐𝑡|𝑥𝑡)
(vii) □ 𝑃 (𝑥𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡)
□ 𝑃 (𝑥𝑡|𝑒𝑡, 𝑑𝑡, 𝑐𝑡, 𝑐𝑡−1)
􏰁 1
(viii) Suppose we want to do the above updates in one step and use normalization to reduce computation. Select all the terms that are not explicitly calculated in this implementation.
DO NOT include the choices if their values are 1.
□(ii) □(iii) □(iv) □(v) □(vi) □(vii) 􏰁 Noneoftheabove
18

(b) Suppose X outputs 1024 × 1024 greyscale images and our vehicle stays stationary. As before, E includes precise estima- tion of the distance between our vehicle and the pedestrian evaluated from outputs of X. Unfortunately, a power outage happened, and before the power is restored, E will not be available for our vehicle. But we still have the detection model D, which outputs one of {no pedestrian, pedestrian on the road, pedestrian beside the stop sign} for each state.
(i) [1 pt] During the power outage, it is best to
􏰁 do particle filtering because the particles are easier to track for D than for both D and E 􏰁 do particle filtering because of memory constraints
􏰁 do particle filtering, but not for the reasons above
􏰁 do exact inference because it saves computation
􏰁 do exact inference, but not for the reason above
(ii) [1pt]Thepoweroutagewaslongerthanexpected.AsthesensoroutputsofXhavedegradedto2×2binaryimages, it is best to
􏰁 do particle filtering because the particles are easier to track for D than for both D and E 􏰁 do particle filtering because of memory constraints
􏰁 do particle filtering, but not for the reasons above
􏰁 do exact inference because it saves computation
􏰁 do exact inference, but not for the reason above
(iii) [1 pt] After power is restored and we have E, it is reasonable to
􏰁 do particle filtering because of memory constraints
􏰁 do particle filtering, but not for the reason above
􏰁 do exact inference because E gives more valuable information than D 􏰁 do exact inference because it’s impractical to do particle filtering for E 􏰁 do exact inference, but not for the reasons above
19
SID:

(c) NowweformulatetheDynamicBayesNetforthisquestionintoanon-deterministictwo-playergame(analogoustoMDP in single-player setting). Each state 𝑆 = (𝑋, 𝐸, 𝐷).
There are 2 agents in the game: our vehicle (with action set A), and a pedestrian (with action set B). The vehicle and the pedestrian take turns to perform their actions.
The TAs implemented 3 modes for the autonomous vehicle to act in the same space with the kind pedestrian, the confused pedestrian, and the naughty pedestrian. In each round of testing, a TA will be the pedestrian, and one of the modes will be tested. The vehicle and the pedestrian are both in the corresponding mode.
Below, 𝑄∗𝑣 is the Q-function for the autonomous vehicle. For each subquestion, given the standard notation for an MDP, select the expression 𝑓𝑛 that would complete the blank part of the Q-Value Iteration under the specified formulation.
𝑄∗𝑣(𝑠,𝑎)=∑𝑠′ 𝑇(𝑠,𝑎,𝑠′)[𝑅(𝑠,𝑎,𝑠′)+𝛾 ]
𝑓 =∑ ∑′′(𝑇(𝑠′,𝑏,𝑠′′)[𝑅(𝑠′,𝑏,𝑠′′)+𝛾max′ 𝑄∗(𝑠′′,𝑎′)])
1 𝑏∈𝐵𝑠 𝑎∈𝐴𝑣
𝑓 =max ∑′′(𝑇(𝑠′,𝑏,𝑠′′)[𝑅(𝑠′,𝑏,𝑠′′)+𝛾max′ 𝑄∗(𝑠′′,𝑎′)]) 2 𝑏∈𝐵∑𝑠(( )[( ) 𝑎∈𝐴𝑣( )])
𝑓3 =min𝑏∈𝐵 𝑠′′ 𝑇 𝑠′,𝑏,𝑠′′ 𝑅 𝑠′,𝑏,𝑠′′ +𝛾max𝑎′∈𝐴𝑄∗𝑣 𝑠′′,𝑎′ ∑∑((′ ′′)[(′ ′′) 1 ∗(′′′)])
𝑓4= 𝑏∈𝐵 𝑠′′ 𝑇𝑠,𝑏,𝑠 𝑅𝑠,𝑏,𝑠 +𝛾|𝐵|max𝑎′∈𝐴𝑄𝑣𝑠,𝑎 𝑓 =max′ 𝑄∗(𝑠′,𝑎′)
5 𝑎∈𝐴𝑣() 𝑓6=min𝑎′∈𝐴𝑄∗𝑣 𝑠′,𝑎′
𝑓 = 1 ∑′ 𝑄∗(𝑠′,𝑎′) 7 |𝐴| 𝑎∈𝐴 𝑣
(i) [1 pt] The kind pedestrian that acts friendly, maximizing the vehicle’s utility.
□𝑓1 □𝑓2 □𝑓3 □𝑓4 □𝑓5 □𝑓6 □𝑓7 􏰁 Noneoftheabove
(ii) [1 pt] The confused pedestrian that acts randomly.
□𝑓1 □𝑓2 □𝑓3 □𝑓4 □𝑓5 □𝑓6 □𝑓7 􏰁 Noneoftheabove
(iii) [1 pt] The naughty pedestrian that performs adversarial actions, minimizing the vehicle’s utility. □𝑓1 □𝑓2 □𝑓3 □𝑓4 □𝑓5 □𝑓6 □𝑓7 􏰁 Noneoftheabove
20

(ii) [1 pt] Which of these decompositions reflect the naive assumption in Naive Bayes? 􏰁 𝑃(𝐶) = 𝑃(𝐶1) ⋅ 𝑃(𝐶2)
􏰁 𝑃(𝑋|𝐶)=𝑃(𝑋1|𝐶)⋅𝑃(𝑋2|𝐶)⋅𝑃(𝑋3|𝐶)⋅𝑃(𝑋4|𝐶)⋅…
􏰁 𝑃(𝑋)=𝑃(𝑋1)⋅𝑃(𝑋2|𝑋1)⋅𝑃(𝑋3|𝑋2,𝑋1)…
􏰁 𝑃(𝑋1)=(𝑃(𝑋1)+1)∕(∑𝑥𝑖 𝑃(𝑋𝑖))
􏰁 None of the above
(iii) [1 pt] Which of these are modelled directly in Logistic Regression? Select all that apply.
□ 𝑃(𝐶|𝑋)
□ 𝑃(𝑋|𝐶)
□ 𝑃(𝐶)
□ 𝑃(𝑋)
􏰁 None of the above
(b) In this part, we will consider different scenarios about the data we have.
(i) [1 pt] Assume we only have two points in our training dataset which are linearly separable using the feature 𝑋1. Which of the following methods will be able to achieve zero training error? Select all that apply.
□ Naive Bayes
□ Bayes classifier (i.e. naive bayes with no naive assumption) □ Logistic Regression
□ Perceptron
□ A very large neural network with many nonlinearities
􏰁 None of the above
(ii) [1pt]Assumewenowhaveaverylargetrainingdatasetwhichisnotlinearlyseparableusinganyindividualvariable, but is separable given 𝑋1 ⋅ 𝑋2. Which of the following methods will be able to achieve zero training error (without augmenting the data set with additional features)? Select all that apply.
□ Naive Bayes
□ Bayes classifier (i.e. naive bayes with no naive assumption) □ Logistic Regression
□ Perceptron (with no softmax on the output)
□ A very large neural network with many nonlinearities
􏰁 None of the above
(iii) [1 pt] Now assume that the true dataset is linearly separable but our training set has a single mis-labeled data point. Which of the following are true? Select all that apply.
□ Logistic regression may output probabilities greater than 1 □ Perceptron (with no softmax on the output) may not converge 􏰁 None of the above
21
SID:
Q7. [10 pts] Generating and classifying text
In this question, we are interested in modelling sentences. Assume each sentence is represented using a bag-of-words (i.e. a vector which contains counts for each word in a sentence). We are interested in classifying whether a sentence (represented as a bag of words 𝑋) is a positive-sounding (class 𝐶 = 1) or negative-sounding (𝐶 = −1) sentence. 𝑋1, 𝑋2, …𝑋𝑝 represent the entries for individual words in the bag-of-words representation 𝑋.
(a) In this question, we are interested in the basics of Naive Bayes and logistic regression. (i) [1 pt] Which of these are modelled explicitly in Naive Bayes? Select all that apply.
□ 𝑃(𝐶|𝑋)
□ 𝑃(𝑋|𝐶)
□ 𝑃(𝐶)
􏰁 None of the above

(iv) [1pt]Assumeweinitiallyhaveamodeltrainedonaverylargedatasetwiththesamenumberofpositiveandnegative examples. Then, we duplicate all the positive examples in our training set and re-add them (resulting in a training set 1.5 times the size of the original training set). We then train a second model on the new training set. Which of the following is true? Select all that apply.
□ In logistic regression, 𝑃 (𝐶 = 1|𝑋) will generally be higher for the retrained model than the original model □ In naive bayes, 𝑃 (𝑋 = 𝑥|𝐶 = 1) will be higher for some 𝑥 for the retrained model than the original model □ In naive bayes, 𝑃 (𝐶 = 1) will generally be higher for the retrained model than the original model
□ In naive bayes, 𝑃 (𝑋 = 1) will generally to be higher for the retrained model than the original model
􏰁 None of the above
(c) We are now interested in generating text (still in the form of bag-of-words) from each of our classes.
(i) [1 pt] If we have already fit naive bayes for text classification, which distribution can we use to generate text for the positive class?
􏰁 𝑃(𝐶)
􏰁 𝑃(𝑋)
􏰁 𝑃(𝑋|𝐶)
􏰁 𝑃(𝐶|𝑋)
􏰁 None of the above
(ii) [1pt]Assumingwehaveaninfiniteamountofdata,whichofthefollowingmodelingassumptionisgenerallyableto more accurately model the true distribution of 𝑃 (𝑋) (and thus to generate more realistic bag-of-words sentences)?
􏰁 𝑃(𝑋)=𝑃(𝑋1)⋅𝑃(𝑋2)⋅𝑃(𝑋3)⋅𝑃(𝑋4)⋅… 􏰁 𝑃(𝑋)=𝑃(𝑋1)⋅𝑃(𝑋2|𝑋1)⋅𝑃(𝑋3|𝑋2,𝑋1)⋅… 􏰁 They are the same
(iii) [1 pt] Which of the following will best help us generate text with words in the correct order? 􏰁 Using Laplace smoothing
􏰁 Predicting 𝑃(𝑋|𝐶) using a neural network
􏰁 Changing the representation of the text (to use something other than bag-of-words)
􏰁 None of the above
22

Q8. [16 pts] Deep “Blackjack”
To celebrate the end of the semester, you visit Las Vegas and decide to play a good, old fashioned game of “Blackjack”!
Recall that the game has states 0,1,…,8, corresponding to dollar amounts, and a 𝐷𝑜𝑛𝑒 state where the game ends. The player starts with $2, i.e. at state 2. The player has two actions: Stop (𝑎 = 0) and Roll (𝑎 = 1), and is forced to take the Stop action at states 0,1,and 8.
When the player takes the Stop action (𝑎 = 0), they transition to the 𝐷𝑜𝑛𝑒 state and receive reward equal to the amount of dollars of the state they transitioned from: e.g. taking the stop action at state 3 gives the player $3. The game ends when the player transitions to 𝐷𝑜𝑛𝑒.
The Roll action (𝑎 = 1) is available from states 2-7. The player rolls a biased 6-sided die. If the player Rolls from state s and the die lands on outcome 𝑜, the player transitions to state 𝑠 + 𝑜 − 2, as long as 𝑠 + 𝑜 − 2 ≤ 8 (𝑠 is the amount of dollars of the current state, 𝑜 is the amount rolled, and the negative 2 is the price to roll). If 𝑠 + 𝑜 − 2 > 8, the player busts, i.e. transitions to Done and does NOT receive reward.
As the bias of the dice is unknown, you decided to perform some good-old fashioned reinforcement learning (RL) to solve the game. However, unlike in the midterm, you have decided to flex and solve the game using approximate Q-learning. Not only that, you decided not to design any features – the features for the Q-value at (𝑠, 𝑎) will simply be the vector [𝑠 𝑎], where 𝑠 is the state and 𝑎 is the action.
(a) First, we will investigate how your choice of features impacts whether or not you can learn the optimal policy. Suppose the unique optimal policy in the MDP is the following:
For each of the cases below, select “Possible with large neural net” if the policy can be expressed by using a large neural net to represent the Q-function using the features specified as input. (That is, the greedy policy with respect to some Q-function representable with a large neural network is the optimal policy: 𝑄(𝑠,𝜋∗(𝑠)) > 𝑄(𝑠,𝑎) for all states 𝑠 and actions 𝑎 ≠ 𝜋∗(𝑠).) Select “Possible with weighted sum” if the policy can be expressed by using a weighted linear sum to represent the Q-function. Select “Not Possible” if expressing the policy with given features is impossible no matter the function.
(i) [1 pt] Suppose we decide to use the state 𝑠 and action 𝑎 as the features for 𝑄(𝑠, 𝑎).
□ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible
(ii) [1 pt] Now suppose we decide to use 𝑠 + 𝑎 as the feature for 𝑄(𝑠, 𝑎).
□ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible
(iii) [1 pt] Now suppose we decide to use 𝑎 as the feature for 𝑄(𝑠, 𝑎).
□ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible
(iv) [1 pt] Now suppose we decide to use sign(𝑠 − 4.5) ⋅ 𝑎 as the feature for 𝑄(𝑠, 𝑎), where sign(𝑥) is −1 if 𝑥 < 0, 1 if 𝑥 > 0, and 0 if 𝑥 = 0.
□ Possible with large neural network □ Possible with linear weighted sum of features 􏰁 Not possible
SID:
State
2
3
4
5
6
7
𝜋∗(𝑠)
Roll
Roll
Roll
Stop
Stop
Stop
23

(b) [4 pts] Next, we investigate the effect of different neural network architectures on your ability to learn the optimal policy. Recall that our features for the Q-value at (𝑠, 𝑎) will simply be the vector [𝑠 𝑎], where 𝑠 is the state and 𝑎 is the action. In addition, suppose that the unique optimal policy is the following:
Which of the following neural network architectures can express Q-values that represent the optimal policy? That is, the greedy policy with respect to some Q-function representable with the given neural network is the optimal policy:
State
2
3
4
5
6
7
𝜋∗(𝑠)
Roll
Roll
Roll
Stop
Stop
Stop
𝑄(𝑠,𝜋∗(𝑠)) > 𝑄(𝑠,𝑎) for all states 𝑠 and actions 𝑎 ≠ 𝜋∗(𝑠). Hint: Recall that 𝑅𝑒𝐿𝑈(𝑥) = □ Neural Network 1:
□ Neural Network 2:
□ Neural Network 3:
{
𝑥 𝑥>0 0 𝑥≤0
24

SID:
□ Neural Network 4:
□ Neural Network 5:
􏰁 None of the above.
(c) [1 pt] As with the linear approximate q-learning, you decide to minimize the squared error of the Bellman residual. Let 𝑄𝐰(𝑠, 𝑎) be the approximate 𝑄-values of 𝑠, 𝑎. After taking action 𝑎 in state 𝑠 and transitioning to state 𝑠′ with reward 𝑟, you first compute the target target = 𝑟 + 𝛾 max𝑎′ 𝑄𝐰(𝑠′, 𝑎′). Then your loss is:
loss(𝐰) = 1 (𝑄𝐰(𝑠, 𝑎) − target)2 2
You then perform gradient descent to minimize this loss. Note that we will not take the gradient through the target – we treat it as a fixed value.
Which of the following updates represents one step of gradient descent on the weight parameter 𝑤𝑖 with learning rate 𝛼 ∈ (0, 1) after taking action 𝑎 in state 𝑠 and transitioning to state 𝑠′ with reward 𝑟? [Hint: which of these is equivalent to the normal approximate Q-learning update when 𝑄𝐰,(𝑠, 𝑎) = 𝐰 ⋅ 𝐟(𝑠, 𝑎)?]
􏰁 𝑤𝑖 =𝑤𝑖 +𝛼(𝑄𝐰(𝑠,𝑎)−(𝑟+𝛾max𝑎′ 𝑄𝐰(𝑠′,𝑎′)))𝜕𝑄𝐰(𝑠,𝑎)
′ ′ 𝜕𝑄𝐰(𝑠,𝑎) 􏰁𝑤=𝑤−𝛼(𝑄(𝑠,𝑎)−(𝑟+𝛾max′𝑄(𝑠,𝑎))) 𝜕𝑤𝑖
𝑖 𝑖 (𝐰 ( 𝑎𝐰 ))𝜕𝑤𝑖
􏰁 𝑤𝑖 = 𝑤𝑖 + 𝛼 (𝑄𝐰(𝑠, 𝑎) − (𝑟 + 𝛾 max𝑎′ 𝑄𝐰(𝑠′, 𝑎′))) 𝑠 􏰁𝑤𝑖=𝑤𝑖−𝛼 𝑄𝐰(𝑠,𝑎)− 𝑟+𝛾max𝑎′𝑄𝐰(𝑠′,𝑎′) 𝑠 􏰁 None of the above.
(d) and (e) are on the next page.
25

(d) While programming the neural network, you’re getting some bizarre errors. To debug these, you decide to calculate the gradients by hand and compare them to the result of your code.
Suppose your neural network is the following:
Figure 1: Neural Network 6 Thatis,𝑄𝐰(𝑠,𝑎)=𝑠+𝑎+𝑤1 𝑅𝑒𝐿𝑈(𝑤0 +𝑠+𝑎).
{
You are able to recall that 𝑑 𝑅𝑒𝐿𝑈(𝑥) = 𝑑𝑥
1 𝑥 ≥ 0. 0 𝑥<0 (i) [1 pt] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is 𝑄𝐰(5, 0)? 𝑄𝐰(5, 0) = (ii) [2 pts] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is the gradient with respect to 𝑤0, evaluated at 𝑠 = 5, 𝑎 = 0? 𝜕 𝑄𝐰(5,0)= 𝑤0 (iii) [2 pts] Suppose 𝑤0 = −4, and 𝑤1 = −1. What is the gradient with respect to 𝑤0, evaluated at 𝑠 = 3, 𝑎 = 0? 𝜕 𝑄𝐰(3,0)= 𝑤0 (e) After picking a feature representation, neural network architecture, and update rule, as well as calculating the gradients, it’s time to turn to the age old question... will this even work? (i) [1 pt] Without any other assumptions, is it guaranteed that your approximate 𝑄-values will converge to the optimal policy, if each 𝑠, 𝑎 pair is observed an infinite amount of times? 􏰁 Yes 􏰁 No (ii) [1 pt] Without any other assumptions, is it guaranteed that your approximate 𝑄-values will converge to the optimal policy, if each 𝑠, 𝑎 pair is observed an infinite amount of times and there exists some 𝑤 such that 𝑄𝐰(𝑠, 𝑎) = 𝑄∗(𝑠, 𝑎)? 􏰁 Yes 􏰁 No 26