UC Berkeley – Computer Science
CS188: Introduction to Artificial Intelligence Josh Hug and Adam Janin
Final, Fall 2016
This test has 10 questions worth a total of 100 points, to be completed in 170 minutes. The exam is closed book, except that you are allowed to use three two-sided hand written cheat sheets. No calculators or other electronic devices are permitted. Give your answers and show your work in the space provided.
Write the statement out below in the blank provided and sign. You may do this before the exam begins.
Any plagiarism, no matter how minor, will result in an F.
“I have neither given nor received any assistance in the taking of this exam.” _________________________________________________________________________________________ _________________________________________________________________________________________
Signature: _______________________________
Name: _____________________________ Your EdX Login: ___________________________ SID: _____________________________ Name of person to left:_______________________ Exam Room: _______________________ Name of person to right: _____________________ Primary TA: _______________________
Tips:
● ◯ indicates that only one circle should be filled in.
● ▢ indicates that more than one box may be filled in.
● Be sure to fill in the ◯ and ▢ boxes completely and erase fully if you change your answer.
● There may be partial credit for incomplete answers. Write as much of the solution as you can, but bear in
mind that we may deduct points if your answers are much more complicated than necessary.
● There are a lot of problems on this exam. Work through the ones with which you are comfortable first. Do
not get overly captivated by interesting problems or complex corner cases you’re not sure about.
● Not all information provided in a problem may be useful.
● There are some problems on this exam with a slow brute force approach and a faster, clever
approach. Think before you start calculating with lots of numbers!
● Write the last four digits of your SID on each page in case pages get shuffled during scanning.
Optional. Mark along the line to show your feelings Before exam: [:( ____________________☺]. on the spectrum between :( and ☺. After exam: [:( ____________________☺].
1
Problem
1
2
3
4
5
6
7
8
9
10
Points
7
6
8.5
8.5
17
14
9
9
10.5
10.5
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
1. (7 pts) Naive Bayes – Josh Detector
Suppose we want to find if Josh is in this office (Y = 1) or not (Y = 0) by analyzing the measurements of the following three sensors deployed in his office:
– a chair pressure sensor informs you if someone is on his chair (Xc = 1) or not (Xc = 0)
– a motion detector tells you if someone moves in the room (Xm = 1) or not (Xm = 0)
– a power meter indicates if someone consumes electricity in the room ( X p = 1 ) or not ( X p = 0)
Each sensor above only reveals partial information about Josh’s presence. For instance, if Josh is in his office but not sitting on the chair, then the chair pressure sensor can not reliably indicate Josh’s presence. Suppose we have some historical data logs of sensor measurements and Josh’s presence status:
Day
1
2
3
4
5
6
7
8
9
10
Xc
0
0
0
0
0
1
0
1
0
1
Xm
0
0
0
0
0
0
0
0
1
1
Xp
1
1
0
0
1
1
1
1
0
0
Y
0
0
0
0
1
1
1
1
1
1
i. (2 pts) Let us use Naive Bayes to model the sensor measurements and Josh’s presence status as shown in the Bayes Net model below. Fill in the maximum likelihood estimate (MLE) of each entry in the probability tables.
(B1)
Y
P (Y )
0
2/5
1
3/5
Y
Xc
P(Xc|Y )
0
0
1
0
1
0
1
0
1/2
1
1
1/2
Y
Xm
P(Xm|Y )
0
0
1
0
1
0
1
0
2/3
1
1
1/3
Y
Xp
P(Xp|Y )
0
0
1/2
0
1
1/2
1
0
1/3
1
1
2/3
2
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
ii. (1 pt) Suppose we get a new set of sensor observations: Xc = 0, Xm = 0 , Xp = 1 . According to the probability tables’ maximum likelihood estimate, is Josh more likely to be present or absent?
◯ Present ⬤ Absent ◯ Equally Likely ◯ Not Enough Information Explanation: We compare:
P(Y = 0,Xc = 0,Xm = 0,Xp = 1) = 2/5*1*1*1/2 = 1/5
P(Y = 1,Xc = 0,Xm = 0,Xp = 1) = 3/5*2/3*2/3*1/3 = 4/45
The former is larger, so we predict that Josh is more likely to be absent.
Notice that this is true even though Y =0,Xc =0,Xm =0,Xp =1 and Y =1,Xc =0,Xm =0,Xp =1 appeared equally frequently in our samples.
iii. (1 pt) Suppose that instead of maximum likelihood, we now use Laplace Smoothing to estimate each entry in the following probability tables. Assume the prior strength k = 1 .
Y
P (Y )
0
5/12
1
7/12
Y
Xc
P(Xc|Y )
0
0
5/6
0
1
1/6
1
0
1/2
1
1
1/2
iv. (1 pt) What happens to our CPTs as we increase k, i.e. what sort of distribution do we get in the large k limit? Solution: In the large k limit, we get uniform distributions in each CPT.
v. (1 pt) Suppose we want to generate exactly two samples from the distribution P(Y, Xc, Xm, Xp) given by the CPTs you computed from MLE estimates in part i. Suppose we run Gibbs sampling for a long time so that the samples are indeed drawn from the correct distribution, and get the resulting sample GS1: (Y = 1, Xc = 0, Xm = 1, Xp = 0). Suppose we then generate a new sample GS2 by resampling from P(Xp | Y = 1, Xc = 0, Xm = 1). What are the different possible values for GS2, and how likely is each?
Possible Value #1 for GS2: Y = __1__, Xc = __0__, Xm = __1__, Xp = __0__, probability: ___1/3___ Possible Value #2 for GS2: Y = __1__, Xc = __0__, Xm = __1__, Xp = __1__, probability: ___2/3___ Explanation: The probabilities utilize P (Xp|Y , Xc, Xm) = P (Xp|Y ) .
3
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
vi. (1 pt) Suppose we instead use prior sampling to generate samples PS1 and PS2. Which samples are of higher quality, the two samples generated by Gibbs Sampling or the two samples generated by Prior Sampling? We have intentionally left “higher quality” undefined. Use your best judgment. For partial credit in the event of a wrong answer, very briefly explain your answer below.
⬤ GS1 and GS2 are higher quality ⬤ PS1 and PS2 are higher quality ◯ They are the same quality
Intended solution: When we have many samples, prior sampling and Gibbs sampling perform equally well. However, when there are only a few samples, prior sampling is better than Gibbs sampling, because the samples generated by Gibbs sampling are correlated with one another (each successive sample only differs by a single variable). For example, if there are only two samples for four variables, Gibbs sampling will only see one value for three of the variables, while prior sampling avoids this correlation bias.
Alternate interpretation: Some students didn’t realize that the samples are for the same query as in the previous part (v) (where no is evidence), and considered the possibility that we have a query that does have evidence. In that case, Gibbs sampling is better than prior sampling, since the latter generates samples without paying attention to the evidence.
4
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
2. MDPs [6 pts]
Consider the following Markov Decision Process. Unlike most MDP models where actions have many potential outcomes of varying probability, assume that the transitions are deterministic, i.e. occur with T(s, a, s’)=100% probability as shown in the graph below. The reward for each transition is displayed adjacent to each edge:
i. (3 pts) Compute the following quantities, assuming a discount factor of 1. Recall that Vk( s) r epresents the expected utility of the state s under the optimal policy assuming the game ends in k more time steps, and V*(s) is the expected utility starting in state s if we play optimally forever.
V 2(A) = R(A,C) + R(C,D) = 13 V 5(A) = R(A,C) + R(C,D) + R(D,B) + R(B,E) + R(E,E) = 26 V *(A) = ∞ from taking the cycle C − D − B repeatedly.
V2(B) = R(B,E)+R(E,E) = 9 V5(B) = R(B,C)+R(C,D)+R(D,B)+R(B,E)+R(E,E) = 17 V *(B) = ∞ from taking the cycle B − C − D repeatedly.
ii. (1 pt) Consider using feature-based Q-learning to learn what the optimal policy should be. Suppose we use the feature f1( s, a) equal to the smallest number of edges from s to E, and feature f2( s, a) equal to the smallest number of edges from a’s target to E. For example f1( A, →B) is 2, and f2( C, →D) is 1.
Let the current weight for these features be [3, 2]. What is the current prediction for Q(C, →D) given these weights for these two features?
Q(C,→D)= f1(C,→D)*w1 +f2(C,→D)*w2 =2*3+1*2=8
iii. (2 pts) Suppose we update w based on a single observation of a transition from C to D, assuming α = 0.5 . What is w after updating based on this observation?
w = [6, 3.5]
Explanation:
The difference between the actual reward and the predicted reward is:
Δ = (R(C, →D) + max Q(D, a′)) − Q(C, →D) = (6 + 5) − 8 = 3 . a′
We thus updates our new weights to be: w+Δ*α*f =[3,2]+3*0.5*[2,1]=[6,3.5]
5
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
3. Probability (8.5 pts)
Consider the following CPT, to be used for part i only!
A
B
C
P(A, B, C)
+a
+b
+c
0.15
+a
+b
-c
0.2
+a
-b
+c
0.3
+a
-b
-c
0.1
-a
+b
+c
0.15
-a
+b
-c
0
-a
-b
+c
0.05
-a
-b
-c
0.05
i. (2 pts) Compute the following expressions for the table above. Write each answer as a single numerical value. It is ok to leave your answer as a fraction of decimal numbers.
● P(+a | -c) = P (+a,−c) = 0.2+0.1 = 6/7 P (−c) 0.2+0.1+0.+0.05
● P(+b|-a,-c)= P(−a,+b,−c) = 0 =0
P (−a,−c)
● P(-a | +b, +c) = P (−a,+b,+c) =
0.+0.05
P (+b, +c) ● P(-b, +c | -a) = P (−a,−b,+c)
0.15 = 1/2 0.15.+0.15
P (−a)
= 0.05 0.15+0+0.05+0.05
= 1/5
ii. (1 pt) Let A, B, and C be binary random variables. Select all of the following CPTs that are guaranteed to have 4 rows for ANY distribution, not the table above.
P(+a|B,C) P(B,C|+a) P(B|A) ⃞ P(+a,C|+b)
Explanation: Since we have binary random variables, the number of rows that a distribution has is equal to 2 to (the number of unspecified variables in the distribution).
iii. (2 pts) Select all of the following that are guaranteed to equal 1 for ANY distribution, not the table above.
P(+a) + P(-a)
P(+a | -c)P(-c) / (P(+a, -c, +b) + P(+a, -c, -b)) ⃞ P(+a | +b)P(+b) + P(-a | -b)P(-b)
⃞ P(-c | -a)P(-a) + P(-c | +a)P(+a)
Explanation:
The first is true by the law of total probability.
The second’s numerator is equal to P (+ a,− c) by the product rule, and the denominator is equal to summing B to also get P(+ a,− c).
The third is equal to P (+ a,+ b) + P (− a,− b) , which is not simplifiable.
Thefourthisequalto P(−a,−c)+P(+a,−c)=P(−c).
6
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
For parts iv – vi: It’s winter break, and you’re staying at Sherdil’s Casino. Every morning, you receive (as a gift) a random odd number of gambling chips, C, in the range 1 to k, where k is some odd integer constant, according to the distribution:
Assume the number of chips you receive on a particular day is independent of the number of chips you received on any and all prior days.
iv. (1 pt) Let the number of chips you receive on days 1, 2, and 3 be C1, C2, and C3 respectively. Is C3 conditionally independent of C1 given C2? (Answer yes or no. No explanation is needed.)
Yes No
Explanation: Depends on if we know k.
If we don’t know k, then the answer is no because C1 gives us information about k , which C3 is dependent on. If we know k, then the answer is yes.
v. (1 pt) Suppose k is 5. On average, how many chips can you expect to receive per day? E[# chips] = 3
Explanation: E[C] = 1/3*1+1/3*3+1/3*5 = 3.
vi. (1.5 pts) Suppose you stay for five days and receive the following chip amounts: {5, 11, 1, 9, 9}. Based on your five samples, compute the maximum likelihood estimate for k.
MLE of k = 11
Explanation:
Intuitively, as long as k is at least 11, the smaller the k the larger the probability of getting any particular outcome.
Formally, the log likelihood of the outcome is:
ll(k|C)=P(C|k)=log∏ 2 =∑(log2−log(k+1))=5log2−5log(k+1) if k≥11 else 0 k+1
This function is strictly decreasing as k increases, so k = 11 is the maximum likelihood estimate.
Common Mistakes:
● The question says that k is an odd integer. Any answer that is not an odd integer is not a plausible answer; this can be a good sanity check on the answer.
● The most common incorrect answer was 13. This is presumably from using a technique like method of moments (i.e. finding k such that E[k] is equal to the average of the sample), since if k = 13 , then
E[k] = 7 . Method of moments sometimes, but does not always, find the maximum likelihood estimate.
2016 Most Valuable Goat Celebration Zone
7
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
4. CSPs (8.5 pts)
In the game of Hidato, we start with a partially filled grid with N blanks, and our goal is to fill in all the blanks with integers 1 through N such that every number is unique, and every number q (except N) appears adjacent to its successor q + 1, with diagonal moves allowed. For example, if we start from the figure on the left, we can get the solution on the right. Black squares act as obstacles and never receive numbers.
We can formulate this as a CSP, where the domain of the variables is {1, …, N}, where N is the number of non-black (i.e. initially blank) squares.
i. (2.5 pts) If we start from the figure below with 4 variables assigned (to 1, 3, 9, and 14), and run AC-3 (our arc-consistency algorithm), what values remain in the domain of the variable marked with an X? Assume that all 10 blank squares (including X and Y) are unassigned.
Arc-consistency involves processing of binary arcs only, so we must start by deciding what binary arcs exist. Our original intention was that you would derive and apply the following two constraints:
● NO_EQUAL_SQUARES (NES): There are binary arcs between all squares which specify that they cannot be equal (and a separate set of constraints that require numbers to be in a sequence)
● NO_DISTANT_SUCCESSORS (NDS): There are binary arcs between non-adjacent squares that specify that they cannot be one away from each other.
In addition, you can also add a 3rd constraint that is a more powerful version of NDS:
● RADIAL_DIFFERENCE_TEST (RDT): There are binary arcs between every pair of states such that if
they have values Xi and Xj respectively then |Xi – Xj| < distance between those arcs. It it redundant with NDS as far as solutions go, but it makes arc consistency more powerful.
However: it is also possible to cast the Hidato problem without any binary constraints, in which case arc-consistency won’t do anything at all. For example, consider the following multi-variable constraints:
● ALL_DISTINCT: A single N-ary constraint that ensures all variables are distinct.
● ALL_PRESENT: A single N-ary constraint that says that for every X_i, one of them is i.
● NEIGHBOR_IS_ADJACENT: For each node, its larger successor is one of the neighbors. This is a Q-ary
constraint, where Q-1 is the number of neighbors.
In retrospect, we should have asked you to explicitly enumerate your binary constraints, but we did not, so we gave you credit for ANY valid answer, even if you did not write out your binary constraints.
8
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
Assuming no binary constraints: The right answer is: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
If we apply arc-consistency considering only the arcs from X to 1, 3, 9, and 14, then:
● NES eliminates: {1, 3, 9, 14}
● NDS eliminates: {8, 10}
● RDT eliminates: {7, 8, 10, 11}
Completing the entire AC-3 algorithm yields no further eliminations from X. So any answer that includes {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14} minus any of the entire sets above is eligible for full credit.
Sample correct answers:
Domain of X (assuming no binary constraints): __{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}___ Domain of X (assuming NO_EQUAL_SQUARES): ___{2, 4, 5, 6, 7, 8, 10, 11, 12, 13}____ Domain of X (assuming NES and NDS): ______{2, 4, 5, 6, 7, 11, 12, 13}_________
Domain of X (assuming NES and RDT): ______{2, 4, 5, 6, 12, 13}_________
ii. (2.5 pts) If we assign X=2, and re-run AC-3, what will Y’s domain be after AC-3 completes? Assume that all 9 blank squares (including Y) other than X are unassigned.
Similar to above, the answer depends on which binary constraints you came up with and applied. In this case, the possibilities are:
● NES eliminates {1, 2, 3, 9, 14}
● NDS eliminates: {8, 10, 13}
● RDT eliminates: (nothing)
As before, any combination that involves eliminating zero or more of the entire sets above is fine.
For example:
Domain of Y (assuming no binary arcs): __{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}___ Domain of Y (assuming NES): ___{4, 5, 6, 7, 8, 10, 11, 12, 13}____
Domain of Y (assuming NES and NDS): ______{4, 5, 6, 7, 11, 12}_________
Domain of Y (assuming NDS only): ______{1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 14}_________
iii. (1 pt) If we ensure arc-consistency after every assignment in any Hidato puzzle, we’ll never backtrack.
◯ True False
Explanation: Even if we have only binary constraints, the constraint graph is not a tree, so backtracking is possible.
iv. (2.5 pts) Repeat part i, but suppose we run an algorithm that ensures 3-consistency instead of arc-consistency (recall that arc-consistency is just 2-consistency). In this case, what is the domain of X?
Explanation: In the CSP lecture, we strongly implied that you didn’t need to know how k-consistency works for k > 2. Oops.
Domain of X: ___[Beyond scope of course]_____
9
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
5. (17 pts) Potpourri
i. (2.5 pts) You are given a choice between the following:
● A random lottery in which you receive either $A or $B uniformly at random.
● An event in which you receive $k, where k is a positive real-valued constant.
Your utility function for money is U(d) = d2 where U(d) is the utility corresponding to d dollars. For what values of k should you take the lottery? Give your answer as an interval or a union of intervals as appropriate, e.g.,
[12, ∞] or [5, 9] ⋃ [7, ∞) . You may handle ties however you wish. Assume that A, B, and k are non-negative.
k ∈ [0, √(A2 + B2)/2]
Explanation: We value the lottery at U(L) = (A2 + B2)/2 . Thus, we will take the lottery if U(k) = k2 is less than or equal to this.
For parts ii and iii: Let X , Y , and Z be independent random variables with the following distributions:
You can take either action 1 or action 2.
● Action1hasutility U1(X,Y,Z)=10*X−3*Y *Z
● Action2hasutility U2(X,Y,Z)=10*X+5*Y −5*Y *Z
ii. (3.5 pts) Calculate the maximum expected utility. Hint, to avoid tedious calculations, recall from your prior coursework that E[A*B] = E[A]*E[B], so long as A and B are independent variables, and that E[X+Y] = E[X] + E[Y] for any random variables X and Y.
MEU({}) = − 72
Explanation:
Wefirstcompute E[X]=0,E[Y]=6,E[Z]=4.Since Y and Z areindependent, E[YZ]=E[Y]E[Z]=24. Wecanthencalculate E[U1]=10E[X]−3E[YZ]=−72 and E[U2]=10E[X]+5E[Y]−5E[YZ]=−90,sowe should take the first utility value, getting M E U ({}) =− 72 .
X
P(X)
Y
P (Y )
Z
P(Z)
3
2/3
2
1/2
−2
1/4
−6
1/3
10
1/2
6
3/4
10
Last 4 digits of SID: ______________ CS188 FINAL, FALL 2016
iii. (2 pts) Which variable has the lowest VPI? Justify your answer either in words or numerically. Please restrict any answers to fewer than 20 words.
X◯Y◯Z
Explanation: V P I (X ) = 0 because the way X affects the utility is independent of the action. Since VPIs are nonnegative, this variable must have the lowest VPI (or be tied for this property, but we can verify that the VPIs of the other two are positive).
Specifically, one way of seeing this is to note that U1(X,Y ,Z) − U2(X,Y ,Z) = 2Y Z − 5Y , which does not depend on X .
Common Mistakes:
● Sayingthat VPI(X)=0 because E[X]=0 orthat E[X]