CS代写 EECS 445 — Introduction to Machine Learning Winter 2022

UNIVERSITY OF MICHIGAN
Department of Electrical Engineering and Computer Science EECS 445 — Introduction to Machine Learning Winter 2022
Homework 4
1 Maximum Likelihood Estimate [24 pts]

Copyright By PowCoder代写 加微信 powcoder

Sanjeev is gearing up for summer vacation and needs to buy some sunscreen this weekend. However, due to time constraints, he can only go to one of the two grocery stores near him: Kroger or Costco. Sanjeev wants to go to a grocery store that has a larger stock of sunscreen. He asks his friends for advice. Over the past several weeks, each of his friends have chosen to visit Kroger with the same probability γ, and Costco with probability 1 − γ. While at the store, they count the number of sunscreen bottles in stock.
• The number of sunscreen bottles at Kroger follows a distribution with parameter λ, 2x λx e−2λ
pK(x|λ) = x! .
• The number of sunscreen bottles at Costco follows a distribution with parameter μ,
μx e−μ/2 pC(x|μ) = 2xx! .
The following table is his friends’ observation for the past several weeks.
REMEMBER to submit your completed assignment to Gradescope by 10:00pm ET on Wednes- day, April 13. Your submission can be either handwritten or typed. Assign all pages to their respective questions. Incorrectly assigned pages will not be graded.
Observance i 1
Store z(i) Kroger Kroger Costco Kroger Costco Costco Kroger Costco Costco Costco
Storage x(i) 2

(a) [4 pts] Write out the likelihood of this dataset, substituting in the probability distributions of pK (x) and pC (x). Leave your answer as a product of terms, without substituting in any numerical values.
Solution: We let z(i) be an indicator denoting whether x(i) is Kroger or Costco. Also, let K and C be the set of indices i for Kroger and Costco visits, respectively.
L(Sn) = 􏰊 P r(x(i), z(i))
= 􏰊 γpK(x(i)|z(i) = K) ∗ 􏰊(1 − γ)pC(x(i)|z(i) = C)
2x(i) λx(i) e−2λ x(i)!
(1−γ) 2x(i)x(i)!
μx(i) e−μ/2
(b) [10 pts] Recall that with MLE, we want to find the parameters that maximize the dataset’s likelihood. Sometimes, it is easier to maximize the log-likelihood rather than directly maximizing the likelihood. Using your answer from part (a), find the log-likelihood for this dataset. Represent your final answer in terms of
nK = |K|,nC = |C|
mK = 􏰄x(i),mC = 􏰄x(i) i∈K i∈C
SK =􏰄ln(x(i)!),SC =􏰄ln(x(i)!) i∈K i∈C
where K and C is the set of indices i for Kroger and Costco visits, respectively.
We begin by writing out the log-likelihood and simplify as follows:
􏰊 2x λx e−2λ 􏰊
(i) 􏰈 μx e−μ/2
ln (L(Sn)) = ln γ x(i)! ∗ i∈K
(1 − γ) 2x(i) x(i)! i∈C
= 􏰄 􏰋ln(γ) + x(i) ln(2λ) − 2λ − ln(x(i)!)􏰌 i∈K
+􏰄􏰋ln(1−γ)+x(i) ln(μ2)− μ2 −ln(x(i)!)􏰌 i∈C
= 􏰄 ln(γ) + 􏰄 x(i) ln(2λ) − 􏰄 2λ − 􏰄 ln(x(i)!) i∈K i∈K i∈K i∈K
+􏰄ln(1−γ)+􏰄x(i)ln(μ2)−􏰄μ2 −􏰄ln(x(i)!) i∈C i∈C i∈C i∈C
= nK ln(γ) + mK ln(2λ) − nK ∗ 2λ − SK + nC ln(1 − γ) + mC ln(μ2) − nC ∗ μ2 − SC 2

(c) [10 pts] Calculate the maximum likelihood estimation for γ, λ, and μ by maximizing the expression for log-likelihood that you derived in part (b). Express your final answer as numbers, and be sure to show all your work.
We take the partial derivative of L(Sn) with respect to γ, λ, μ, set the derivatives to 0 and substi- tute the respective values to obtain the likelihood estimation for each parameter.
∂L(Sn) = nK − nC ∂γ γ 1−γ
=⇒γ= nK =4=2 nK+nC 10 5
∂L(Sn) = mK − 2nK
=⇒ λ = mK = 23
∂L(Sn) = mC − nC
=⇒ μ = 2mC = 42 nC 6
2 Expectation Maximization [16 pts]
You are a student stuck in BBB doing your latest EECS 445 project. However, your friend is on vacation for 6 months. At the start of each month, they plan to visit one of three cities A, B and C with equal probability. Each time they stay at the city they visit for exactly 10 days. Each of these cities has a probability of having sunny weather on any given day, θA , θB , and θC , unknown to you. Your friend records the weather on each day and sends it to you. Armed with your understanding of Maximum Likelihood Estimation, you plan to predict the weather of each city, as indicated by the parameters θA,θB, and θC thus impressing your friend. On the i-th day, we define:
• z(i) is the city chosen for this month, z(i) ∈ {A, B, C}
• x(i) is the weather for the j-th day for this month, x(i) ∈ {S,R} where S represents a sunny day, and
R represents a rainy day

• si is the number of sunny days recorded over the 10-day period.
The results of our experiment (with N = 6 months and M = 10 days per month) are shown below.
Day x ̄(i) = [x(i), . . . , x(i)] 1M
SSSSSSRRSS
SRRRRSRSRR
RSSRSRRSRS
RSRRRRRRRS
SRRSSRRSSS
SSSRRSSRSS
The Classification-Maximization algorithm estimates the model parameters by alternating between two steps: 1) Determining city assignment for each month, 2) Maximizing likelihood. More precisely, it alter- natesbetweenthefollowingtwostepsafterrandomlyinitializingtheparametersθ ̄(0) =[θ ̄A,θ ̄B,θ ̄C]T:
(i) (i) (i) ̄(t) C-step:Computezˆ =arg max p(z =k|x ̄ ;θ )fori=1,…,N
̄(t+1) M-step: Update θ
= arg max p(x ̄ θ ̄
, zˆ (t+1)
(maximum likelihood estimation given fixed city assignments)
(ties broken arbitrarily)
For this section, your task is to derive a Classification-Maximization procedure for the city detection exper- iment.
(a) (10 pt) Write down an expression for p(z(i) = k | x ̄(i); θ ̄(t)), the probability that month i is in city
k ∈ {A,B,C} given the data x ̄(i) and the current parameter estimate θ ̄(t). Express your answer in
terms of si, M and the model parameters θ(t), θ(t), θ(t). ABC

P ( z ( i ) = k | x ̄ ( i ) ; θ ̄ ( t ) )
= P ( z ( i ) = k , x ̄ ( i ) ; θ ̄ ( t ) ) P ( x ̄ ( i ) ; θ ̄ ( t ) )
= P (x ̄(i) | z(i) = k; θ ̄(t))P (z(i) = k; θ ̄(t))
P(x ̄(i),z(i) =A;θ ̄(t))+P(x ̄(i),z(i) =B;θ ̄(t))+P(x ̄(i),z(i) =C;θ ̄(t))
P (x ̄(i) | z(i) = k; θ ̄(t))P (z(i) = k; θ ̄(t)) =􏰉l∈A,B,C P(x ̄(i) | z(i) = l;θ ̄(t))P(z(i) = l;θ ̄(t))
13 P ( x ̄ ( i ) | z ( i ) = k ; θ ̄ ( t ) )
=13P(x ̄(i) |z(i) =A;θ ̄(t))+13P(x ̄(i) |z(i) =B;θ ̄(t))+13P(x ̄(i) |z(i) =C;θ ̄(t))
= (θk)si(1−θk)M−si
(θA)si(1−θA)M−si +(θB)si(1−θB)M−si +(θC)si(1−θC)M−si
(b) (6pt)PerformoneiterationofClassification-Maximizationforthesampledata(eitherwithcodeorby
(i) (i) ̄(t) (i) ̄(t) ̄(0)
= k | x ̄ ,θ ), zˆ , and θ . Initialize θ = (0.7,0.4,0.6).
hand) and report your values of p(z
In the case of a tie, choose the lexicographically smaller city, i.e., A < B < C. θ ̄(0) = (0.7, 0.4, 0.6) ITERATION 1: (i) (i) ̄(t) (i) (i) ̄(t) (i) (i) ̄(t) (i) Trial P(z =A|x ̄ ,θ ) P(z =B|x ̄ ,θ ) P(z =C|x ̄ ,θ ) zˆ 1 0.6396 0.0291 2 0.0338 0.8068 3 0.2041 0.3979 4 0.0109 0.9093 5 0.3558 0.1982 6 0.5089 0.0810 θ ̄(1) = (0.75, 0.33, 0.6) 0.3313 A 0.1594 B 0.3979 B 0.0798 B 0.4460 C 0.4101 A 3 Bayesian Networks [20 pts] Consider the following Bayesian Network: Figure 1: A Very Cool Bayesian Network (a) (12 pts) Check the following independence conditions of the graph and justify your answers: • Are Z1 and Z6 conditionally independent given exactly one other node? If so, which node? • Are Z5 and Z6 marginally independent? Are they conditionally independent given Z1? Are they conditionally independent given Z4? Are they conditionally independent given both Z1 and Z4? • No, Z1 and Z6 are not conditionally independent given any other node. When we go through the d-separation algorithm, we see that there exists a path directly from Z1 to Z6. • They are not marginally independent, since there exists a path between Z5 to Z6 in the moralized graph. By examining the moralized graph, we also know that Z5 and Z6 are not conditionally independent given Z1 or Z4. However, Z5 and Z6 are conditionally indepen- dent given both Z1 and Z4. (b) (6 pts) Write down the factored joint probability distribution associated with the Bayesian Network. P (Z1, Z2, Z3, Z4, Z5, Z6) = P (Z1)P (Z2|Z1)P (Z3|Z1)P (Z4|Z1, Z2)P (Z6|Z1, Z4)P (Z5|Z2) (c) (2 pts) If each of the random variables can take on one of n distinct values, how many parameters are there in the joint probability distribution associated with the graph above? n−1+(n−1)×n+(n−1)×n+(n−1)×n2 +(n−1)×n+(n−1)×n2 = 2n3 +n2 −2n−1 4 Another Bayesian Network [8 pts] Consider the following Bayesian Network and associated conditional probability tables. Figure 2: Bayesian network for humidity Suppose we observe that it is humid (i.e., H = T ), what is the probability that it is raining? Show your work and round your answer to 2 decimal places. Solution: P(R = T|H = T) = P(R=T,H=T) = P(R=T,H=T,S,W) P (H=T ) P (H=T,S,R,W ) P(R = T,H = T,S,W) = 􏰉s 􏰉w P(S = s)P(R = T|S = s)P(W = w|S = s)P(H = T|R = T,W =w) = 0.5 · 0.1 · 0.8 · 0.99 + 0.5 · 0.1 · 0.2 · 0.9 + 0.5 · 0.5 · 0.2 · 0.99 + 0.5 · 0.5 · 0.8 · 0.9 = 0.2781 P(H = T,S,R,W) = 􏰉s 􏰉r 􏰉w P(S = s)P(R = r|S = s)P(W = w|S = s)P(H = T|R = = 0.5·(0.1·0.8·0.99+0.1·0.2·0.9+0.9·0.8·0.9+0)+0.5·(0.5·0.2·0.99+0.5·0.2·0.9+0.5·0.8·0.9+0) = 0.6471 P(R=T|H =T)= 0.2781 ≈0.43 0.6471 REMEMBER to submit your completed assignment to Gradescope by 10:00pm ET on Wednes- day, April 13. Your submission can be either handwritten or typed. Assign all pages to their respective questions. Incorrectly assigned pages will not be graded. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com