代写代考 4/19/22, 2:48 PM

4/19/22, 2:48 PM
􏰏 Add Question 11 Save Assignment
Edit Assignment | creates a proof of concept HMM for a different problem
with three hidden states (labeled D, E, F) and three observed

states (labeled X, Y, Z) in order to test his code. Given the
parameters in the table below, what is the likeliest sequence
of hidden states for the observed sequence $$ZX$$. For your
convenience, we’ve run the Viterbi algorithm and included the
resulting output for the $$\psi(\cdot,\cdot)$$ function here.
**Enter the hidden state sequence in the text box. _In
addition_, either enter your work in the text box _or_ upload
your work as an image.**
![Q8.4a.png](/files/9622b95b-e8b0-4e7f-8556-bb989e92ca8a)
![Q8.4b.png](/files/4b7e91f6-dba6-414a-89df-7084b6a1420d)
[[First, we can compute the hidden state at time $$t=2$$ by
taking the max value of the second row of the Viterbi table. We
compute $$\max\{0.028, 0.0016, 0.0168\} = 0.028$$, so $$z_2 =
[[Next, we can find the hidden state at time t=1 by following
the second equation given in the previous slide. We compute
$$\max\{0.05*0.8, 0.04*0.2, 0.07*0\} = 0.04$$, so $$z_1 =
[[Therefore, the most likely hidden sequence is $$DD$$. ]]
Add Subquestion
Q1 Neural Networks 0 Points

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Consider a 2-layer feed-forward neural network that takes as input two- dimensional examples xˉ ∈ R2 and has two ReLU hidden units.
The values of the weights W (1) ∈ {−1, 1, 0} in the hidden layers are set such ij
that they result in the z1 and z2 “classifiers” given in the figure below. These classifiers are indicated by the “decision boundaries” and the corresponding normal vectors marked (1) for z1 and (2) for z2.
Map the input data to the 2-dimensional space of hidden unit activations h1 and h2. Only map points marked ‘a’ through ‘d’. The label for each of these points has been specified on the graph as a blue square ■ or a red circle ∙.
For each point, choose the number that corresponds to its mapped position. Zero, one, or multiple points may map to each position.

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Point a maps to the position 􏰑1
Point b maps to the position 􏰑1
Point c maps to the position 􏰒1

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Point d maps to the position 􏰑1
Suppose we are given a binary classification task, in which we want to separate the ■’s from the ∙’s in the figure from above Q2.1. Suppose we keep the first hidden layer’s parameters fixed, but add and train additional hidden layers (applied after the first layer) to further transform the data. Could the resulting neural network solve this classification problem (fit all points perfectly)? Justify your answer in 1-2 sentences.
􏰑 Yes, because… 􏰒 No, because…
EXPLANATION
No, because both point a and point d map to (0, 0) and they have different labels. In future layers, these points will never be separated, so the neural network will always mislabel at least one of them.

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Suppose we stick to the current 2-layer architecture but add many more ReLU hidden units within the first layer. Given this new architecture, is it possible to train a model to perfectly separate the points in the figure from Q2.1? Justify your answer in 1-2 sentences.
􏰒 Yes, because… 􏰑 No, because…
EXPLANATION
Yes, because you could conceivably learn 4 different “classifiers”, forming a decision boundary between squares and circles on each of the four sides.
In general, according to the Universal Approximation Theorem, a one hidden layer neural net can approximate any function given sufficient width (more hidden units).
Suppose neural network A achieves lower training error on the same dataset as neural network B. For a given held-out set, neural network A will always achieve higher test error than neural network B.
􏰑 True 􏰒 False
EXPLANATION
Training performance gives no indication of test performance as test data is completely held out from the training stage of a model.
Q2 Convolutional Neural Networks 0 Points

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Consider a neural network architecture that takes an input from R7 and produces a scalar output. Layers are configured as in the table below.
We have expressed the neural network using the following notation:
A CONV-N layer is the N-th convolutional layer with C filters of width W, and height 1. A FC-N layer is the N-th fully connected layer with output size H. Assume stride = 1 and padding = 0 for the convolutional layers. None of the layers have bias terms.
Layer Hyperparameters
Activation
CONV-1 W = 3, C = 2 ReLU
FC-1 H=2 ReLU
Suppose the input to the CONV-2 layer has 2 channels of size 5. Calculate the number of parameters and the output size for this layer.
The number of parameters
The output size
CONV-2 W = 2, C = 1 ReLU
FC-2 H = 1 Linear

4/19/22, 2:48 PM
Edit Assignment | Gradescope
EXPLANATION
There are two filters of height 1 and width 2 and there are 2 channels in the input, so the number of parameters is 2 ⋅ 1 ⋅ 2 ⋅ 1 = 4
The output dimension is 1x4x1 where the notation is HxWxC
Suppose the input to the FC-2 layer is of size 2. Calculate the number of parameters for this layer.
The number of parameters
Q3 Clustering 0 Points
Recall the k-means algorithm as specified in lecture. The following table provides the cluster assignments after the first two iterations of the algorithm with k = 3.
x(1) x(2) x(3) x(4) x(5) Iteration1 1 1 3 2 2 Iteration2 1 2 3 2 3
EXPLANATION

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Is the following clustering assignment possible in the next iteration (i.e., iteration 3)?
x(1) :1, x(2) :1, x(3) :3, x(4) :2, x(5) :2 􏰑 Yes
Justify your answer below.
EXPLANATION
J monotonically decreases with each iteration of k-means until convergence. Because the assignments changed from iteration 1 to 2, we know it didn’t converge. This means we should never revisit the assignments from iteration 1 because that would necessarily increase the loss.
[OPTIONAL] You may upload your work as an image. 􏰓 No files uploaded
Consider the linkage criteria to combine clusters in Hierarchical Clustering. Allen proposes a new linkage criteria he calls “centroid linkage” where the distance between two clusters is computed as the distance between the two cluster centroids.
Hierarchical clustering with centroid linkage will always produce the same results as using average linkage.
􏰑 True 􏰒 False

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Q4 Spectral Clustering 0 Points
Given the graph above, answer the following questions about the Spectral Clustering algorithm. Assume that the similarity of a node to itself is 1 and the following matrices order the nodes as A, B, C, D, E.
What is row 3 (corresponding to node C) of the similarity matrix W in our Spectral Clustering algorithm.
􏰑 [0.6,0.8,1,0,0] 􏰑 [0.8,0,1,0,0.6] 􏰒 [0.8,0,1,0.6,0] 􏰑 [0.6,0.8,1,0,0] 􏰑 [0.8,0.6,1,0.8,0]

4/19/22, 2:48 PM
Edit Assignment | Gradescope
What is row 3 (corresponding to node C) of the degree matrix D in our Spectral Clustering algorithm?
􏰑 [0,0,2.3,0,0] 􏰒 [0,0,2.4,0,0] 􏰑 [2.7,0,0,0,0] 􏰑 [0,0,0,2.4,0] 􏰑 [0,0,0,0,2.9]
What is row 3 (corresponding to node C) of the Laplacian Matrix, L? 􏰑 [−0.8,0,1.4,0.6,0]
􏰑 [0,2.7,0,0,0]
􏰑 [0.8,0,−1.4,0.6,0]
􏰑 [0,−0.3,0,−0.4,0.7] 􏰒 [−0.8,0,1.4,−0.6,0]
The following table contains the eigenvalues and corresponding eigenvectors for a given graph Laplacian. Run k-means clustering to split the underlying dataset given by points {P , Q, X, Y , Z} into k = 3 clusters. Assume the graph Laplacian uses the node ordering P , Q, X , Y , Z .

4/19/22, 2:48 PM
Edit Assignment | Gradescope
􏰑 {P,Z},{Q,Y},{X} 􏰒 {P,X},{Q,Y},{Z} 􏰑 {P,Z},{X,Y},{Q} 􏰑 {P,Q},{X,Y},{Z} 􏰑 None of the above
EXPLANATION
If we use k = 3, and make sure we reorder our eigenvalues in the correct order (increasing), we see that the first dimension the points are all on the same line (-0.4 = x1), but then we get 3 distinct clusters for the 5 points, with P and X grouped closed together, Q and Y grouped near each other, and Z by itself. Performing the k-means clustering after our eigenvalue and eigenvector results yields the answer.
Recall that the Gaussian kernel similarity metric between two data points xˉ(i) and xˉ(j) is defined as
wij =exp{−∣∣xˉ(i) −xˉ(j)∣∣2} 2σ2
Annika computed the graph Laplacian twice on the same dataset using the Gaussian kernel similarity metric with two different settings of σ2 (and no other changes). She also computed the eigenvectors and corresponding eigenvalues for each of the two graph Laplacians. As we did in homework, she then plotted the eigenvalues of each of the graph Laplacians as shown below. The y-axis represents the value of the k-th smallest eigenvalue, and both plots have the same scale on the y-axis.

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Zachary offered to label them, but unfortunately misplaced the labels of the plots! Is it possible to identify which plot corresponds to the lower setting of σ2?
􏰒 Plot A corresponds to the lower value of σ2 􏰑 Plot B corresponds to the lower value of σ2 􏰑 Insufficient information given
Briefly justify your answer below.
EXPLANATION
In plot A, we notice that the jump occurs around k=3 whereas in plot B the jump occurs around k=2. Since we used the same graph of the Laplacian and since σ2 is proportional to the number of points in a neighborhood (larger σ2 correspond to more points clustered together), plot A would have a smaller σ2 since the jump corresponds to a larger k-value to split on meaning less number of points per neighborhood.
Q5 Maximum Likelihood Estimates 0 Points
Thomas has a coin, but unfortunately, he doesn’t know the probability of heads θ. As we saw in class, one way to estimate θ is to compute a maximum likelihood estimate of this parameter θM LE based on a given dataset. Thomas collects a dataset, Sn, consisting of n coin flips, out of which x are heads. However,

4/19/22, 2:48 PM
Edit Assignment | Gradescope
Thomas also wants his parameter estimate to incorporate an initial belief, α = 12 , on the probability of heads. For this, he finds an augmented dataset, Sn′ , whose MLE estimate, θ′, satisfies:
θ′ =γα+(1−γ)θMLE
where γ = 15 is the weight on the initial belief. Find a possible pair of values y, N , where y is the total number of heads in Sn′ , and N is the total number of coin flips in Sn′ . You may assume that both x and n are positive integers and 0 < x < n. Express your answers in terms of x and n. EXPLANATION θ′ = 1 ⋅ 1 + 4 ⋅ x = 1 + 8x = n+8x 5 2 5 n 10 10n 10n This implies the number of heads is n + 8x and the number of flips is 10n. Note that positive scaled versions of these answers was also acceptable. Q6 Mixture Models 0 Points Suppose we have three cities, A, B, and C, each with a probability of having sunny weather, θA , θB , θC , unknown to us. To predict the weather of each city, we conduct an experiment and construct a dataset as follows: The experiment consists of N trials. For each trial, we choose one city from A, B , and C at random (each with equal probability), detect the weather of that city over M days, and record the sequence of sunny / rainy weather. On the ith trial, we define: z(i) is the city chosen for this trial. x(i) is the jth detection for this trial j 4/19/22, 2:48 PM Edit Assignment | Gradescope si is the number of sunny days recorded in this trial. The data are generated as follows: for trial i ∈ 1,...,N: z(i) ∼EquiProb(A,B,C) for detection j ∈ 1, ..., M : x(i) ∼ Bernoulli(θ (i) ) jz EquiProb(A, B, C) represents a distribution where each of A, B, and C are chosen with probability 13 , and θz(i) represents the probability of sunny for city Given X = {xˉ(i) }n where each xˉ(i) = [x(i) , x(i) , ..., x(i) ]T and Z = {z(i) }n , what is the expression for the complete data log-likelihood of the i=1 entire dataset? 􏰑∑n 1 +silnθ(i) +(M−si)ln(1−(θ(i))si) i=13z z 􏰑∑n ln1 +silnθz(i) +Mln(1−θz(i)) i=1 3 􏰑∑n ln1 +siMlnθz(i) −siln(1−θz(i)) i=1 3 􏰒∑n ln1 +silnθ(i) +(M−si)ln(1−θ(i)) i=13z z 􏰑 None of the above [OPTIONAL] You may upload your work as an image. 􏰓 No files uploaded 4/19/22, 2:48 PM Edit Assignment | Gradescope EXPLANATION Q7 Expectation Maximization 0 Points Suppose the parameters for EM are initialized as θˉ(0) = [γ1,γ2,μ1,μ2,σ1,σ2]T = [0.6,0.4,5,6,1,1]T Using the notation established in class, which of the following is true of the posterior probability of data point x(1) = 5 after the first E-step? 􏰒 p(1∣1) > p(2∣1)
􏰑 p(1∣1) = p(2∣1)
􏰑 p(1∣1) < p(2∣1) 4/19/22, 2:48 PM Edit Assignment | Gradescope EXPLANATION x(i) lies on the mean for cluster 1, and cluster 1 has a larger prior probability. Select the statements that are true about EM: The EM algorithm always converges to the global optimum. The log-likelihood of the objective function can decrease after an EM iteration. The EM algorithm can only be used with GMM. 􏰔 None of the above. EXPLANATION The EM algorithm might get stuck in local maxima. The log-likelihood of the objective function can stay the same or increase, but it cannot decrease. The EM algorithm can be used with other algorithms too, not just with GMM. Thus, the answer is None of the above. Q8 Bayesian Networks 0 Points Sanjeev constructs a Bayesian network, shown below. 4/19/22, 2:48 PM Edit Assignment | Gradescope Which of the following statements must be true? Select all that apply. 4/19/22, 2:48 PM Edit Assignment | Gradescope P(Z5,Z7) = P(Z5∣Z7)P(Z7) P(Z1)P(Z2) = P(Z1,Z2) P(Z1,Z3∣Z2) = P(Z1)P(Z2)P(Z3∣Z1,Z2) P(Z6,Z8∣Z3) = P(Z6∣Z3)P(Z8∣Z3) None of the above Naitian is interested in using a Bayesian Network to model the probability he will drink a hot chocolate. He identifies three variables that have influence over his decision: The main activity he did today, A, either sledding, building a snowman, or coding indoors. His current marshmallow supply, M either he has marshmallows or he does not. The time of day it is, T either morning or evening. He also has a fourth variable, lightness, which is denoted L and represents if it is dark or light outside. The variable H is a binary variable where 1 indicates he drank hot chocolate and 0 indicates he did not. The full Bayesian Network is shown below. 4/19/22, 2:48 PM Edit Assignment | Gradescope How many independent parameters are there in this model? 􏰑 None of the above EXPLANATION The factored joint probability of this graph is P(A)P(M)P(T)P(H∣A,M,T)P(L∣T). We need two independent parameters for P (A) since A takes on three values, one independent parameter for each of P(M) and P(T) since M and T are binary variables, and we need one independent parameter for each combination of parent's values in P (H ∣A, M , T ) and P (L∣T ) since H and L are binary as well. So, there are 3 ⋅ 2 ⋅ 2 = 12 independent parameters for P (H ∣A, M , T ) and 2 for P (L∣T ). This means there are a total of 2 + 1 + 1 + 12 + 2 = 18 independent parameters. 4/19/22, 2:48 PM Edit Assignment | reflects on his past hot chocolate drinking patterns and comes up with the following observation table. However, he is unsure of some of the table ^ values. He recalls the MLE estimate based on the observation table for θ(H = 2^ 1∣A=Snowman,T =Evening,M =No,L=Light)is 3,θ(L= Dark∣T = Evening) is 5 , and θ(A = Sledding) is 3 . Using that information about the MLE and the Bayesian Network in Q9.2, fill out the table's missing values, a and b: No Light a No Dark 0 Coding Morning Yes Dark 1 Sledding Evening No Dark 1 Sledding Morning No Dark 1 4/19/22, 2:48 PM Edit Assignment | Gradescope EXPLANATION Solving for a: We are given θ^H (H = 1∣A = Snowman, T = Evening, M = No,L=Light)is 23,andwecanalsonoticethatLandHare independent, so we need only look for the rows that contain A = Snowman, T = Evening, M = No. There are 3 such rows: The one with the variable a, one with H = 0, and one with H = 1. Since we want θ^H(H = 1∣A = Snowman,T = Evening,M = No,L = Light)is 23 , we must set a = 1 to achieve that MLE value. Solving for b: We are given θ^H (L = Dark∣T = Evening) is 35 and we see that 5 rows have T = Evening 3 of which have L = Dark. This means we should have b = Light to preserve the MLE value. Q9 Hidden Markov Models 0 Points A second order HMM is an HMM where the hidden states depend on the previous two states rather than just the one before. Below is a diagram of such an HMM. Suppose we are using a second order HMM to model whether students will fall asleep during class after eating lunch. The observable states are binary, either 4/19/22, 2:48 PM Edit Assignment | Gradescope asleep or awake, and the hidden states contain information about their lunch. Each day’s lunch can include any one or more of bread, soda, and/or fruits. How many hidden states are there? EXPLANATION Since each day’s lunch can include any one or more of bread, soda, and/or fruits, there are 2^3-1=7 situations, resulting in 7 hidden states. How many independent parameters are in the transition probability table? EXPLANATION There are two previous states contributing to current state and the transition probability for the 7th state for current state can be fully determined by 6 other hidden states, so the number of independent parameters in transition probability table is 7*7*(7-1)=294 How many independent parameters are in the emission probability table? 4/19/22, 2:48 PM Edit Assignment | Gradescope EXPLANATION There are 7 hidden states, each can emit either asleep or awake. The emission probability for asleep or awake can be fully determined by the other, so the number of independent parameters in emission probability table is 7* (2-1)=7 Write an expression for the likelihood function of the second order HMM. [OPTIONAL] You may upload your work as an image. 􏰓 No files uploaded For the same dataset, we will never be able to find parameters that achieve a better likelihood value using the second order HMM than a first order HMM. 􏰑 True 􏰒 False Q10 HMM and Viterbi 0 Points Naitian creates a proof of concept HMM for a different problem with three hidden states (labeled D, E, F) and three observed states (labeled X, Y, Z) in order to test his code. Given the parameters in the table below, what is the likeliest sequence EXPLANATION L = P (Z1 ) ∗ P (Z2 ∣Z1 ) ∗ ∏T P (Zt ∣Zt−1 , Zt−2 ) ∗ ∏T P (Xt ∣Zt ) t=3 t=1 4/19/22, 2:48 PM Edit Assignment | Gradescope of hidden states for the observed sequence ZX. For your convenience, we’ve run the Viterbi algorithm and included the resulting output for the ψ(⋅, ⋅) function here. Enter the hidden state sequence in the text box. In addition, either enter your work in the text box or upload your work as an image. 􏰓 No files uploaded 4/19/22, 2:48 PM Edit Assignment | Gradescope EXPLANATION First, we can compute the hidden state at time t = 2 by taking the max value of the second row of 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts