UNIVERSITY OF MICHIGAN
Department of Electrical Engineering and Computer Science EECS 445 — Introduction to Machine Learning Winter 2022
Homework 3, Due: Wed. 4/6 at 10:00pm 1 Backpropagation [10 pts]
Consider the fully connected neural network given below. We name the output vectors and weights of these layers using the convention specified in the figure: Here w(1) is the weight on the edge connecting xi to h(2)
Copyright By PowCoder代写 加微信 powcoder
and w(2) is the weight connecting h(2) to z(3). As before, interpret w(1) and w(2) as the weights on edges 1j j j010
connecting the bias nodes to h(2) and z(3) respectively. j
xi’s x1 x2 x3 x4 +1
w(1) h(2)’s ji j
The current weights are given in the tables below:
Weights h(2) h(2) 12
Weights z(3) x −1 0 h(2) +1
x1 +1−1 21
x3 0 +1 h(2) −1 2
x4 −1 0 +1 +1 +1 +1 −1
Suppose the loss function is squared loss: L(y, z) = 12 (y − z)2 and the hidden neurons apply the ReLU activation function. Assume the output neuron does not use an activation function.
(a) [3 pts] Write out the expression for h(2) and h(2) in terms of the weights of the first layer w(1) and the 12 ji
inputs xi. Then, write out the expression for z(3) in terms of the weights of the second layer w(2) and 1j
hidden neurons h(2). j
h(2) = ReLU(4 w(1)xi + w(1)) 1 i=1 1i 10
h(2) = ReLU(4 w(1)xi + w(1)) 2 i=1 2i 20
z(3) = 2 w(2)h(2) + w(2) i=1 1i i 10
(b) [2 pts] Consider an arbitrary weight in the first layer w(1) and suppose we wanted to update this weight ji
via back propagation and gradient descent. Write out the update step for this weight in terms of
w(1) ,w(1) ,η, and the expanded form of ∂L . There is no need to compute the derivatives in this jinew jiold ∂w(1)
∂L can be expanded as: ∂w(1)
(3) ∂h(2) ∂z(2) ∂L=∂L·∂z ·j ·j
∂w(1) ∂z(3) ∂h(2) ∂z(2) ∂w(1) ji jjji
The update step is then:
∂z(2) w =w −η∂L ·∂z · j · j
(1) (1) (3) ∂h(2)
jinew jiold ∂z(3) ∂h(2) ∂z(2) ∂w(1) j j ji
(c) [5pts]Nowwewillperformanupdate.Calculatetheupdatedweightw(1) afterasingleSGDiteration 11new
with the step size η = 1 using a training input x ̄ = [1,−1,1,0]T and a label y = 3 and show your work below.
Solution: First we forward propagate x ̄ through the network.
z(2) = 4 w(1)xi + w(1) = (1)(1) + (−1)(−1) + (0)(1) + (−1)(0) + (1) = 3 =⇒ h(2) =
1 i=1 1i 10 1 ReLU(z(2)) = 3
z(2) = 4 w(1)xi + w(1) = (−1)(1) + (0)(−1) + (1)(1) + (0)(0) + (−1) = −1 =⇒ h(2) =
2 i=1 2i 20 2 ReLU(z(2)) = 0
z(3) = 2 w(2)h(2) + w(2) = (1)(3) + (−1)(0) + (1) = 4
i=1 1i i 10
Now we can write out the backpropagation formula to update weight w(1).
(3) ∂h(2) ∂z(2) ∂L=∂L·∂z ·1 ·1
∂w(1) ∂z(3) ∂h(2) ∂z(2) ∂w(1) 11 1111
∂L =2·1(y−z)·−1=−(y−z)=−(3−4)=1 ∂z(3) 2
(3) ∂[2 w(2)h(2)+w(2)] (2)
∂z = i=1 1i i ∂ h(2) ∂ h(2)
10 =w =1 11
∂[ReLU(z(2))] 1
> 0]] = [[3 > 0]] = 1
∂z(2) ∂[4 w(1)xi+w(1)]
1 = i=1 1i 10 =x=1
∂w(1) ∂w(1) 11 11
So, we have:
∂L = (1)·(1)·(1)·(1) = 1 ∂w(1)
The actual weight update is calculated as follows:
w(1) =w(1) −η ∂L =1−(1)(1)=0. 11new 11old ∂w(1)
Therefore, the updated value of w(1) is 0. 11
2 Clustering [21 pts]
In this problem, we will implement spectral and k-means clustering and compare the results of the two algorithms. For all conceptual questions, assume that the number of clusters k ≤ n, the number of points.
2.1 Spectral Clustering [9 pts]
In this problem, we will be exploring and implementing spectral clustering (you are allowed to use scikit-learn functions in your implementation). You will write your code in the file spectral.py. Consider the dataset mickey.csv that we have provided, consisting of 400 data points in a two-dimensional feature space.
(a) [2 pts] Implement the plot spectral num cluster function in spectral.py by following the instructions in the code comments. Include a screenshot (or equivalent) of your implemented function as your solution to this question.
spectral = SpectralClustering(gamma=gamma)
lap_matrix = laplacian(spectral.affinity_matrix_, normed=True)
eigenvalues, _ = eigh(lap_matrix)
(b) [2 pts] Implement the visualize spectral function in spectral.py by following the instruc- tions in the code comments. Include a screenshot (or equivalent) of your implemented function as your solution to this question.
spectral = SpectralClustering(n clusters=n clusters, gamma=gamma)
(c) Run spectral.py and answer the following questions. Please include all generated plots in your write-up!
i. [2 pts] Plot the smallest 10 eigenvalues of the Laplacian matrix as the output of the
plot spectral num cluster function and include the plot in your write-up. Explain which number of clusters you would choose based on the plot.
ii. [3 pts] Visualize the spectral clustering assignments for gammas = [0.1, 1, 10] as the out- put of visualize spectral function and include your plot in your write-up. Explain which value of gamma you would choose based on the visualization and why this value of gamma works well for this dataset.
Solution: The plot should look like Figure 1.
Figure 1: The smallest 10 eigenvalues of the Laplacian
Any number of clusters is acceptable as long as the reasoning is sound. For example, we can choose 3 clusters since the gap in eigenvalue has the first jump after k = 3.
Solution: The plot should look like Figure 2.
Figure 2: Visualization of the results of spectral clustering
The visualization indicates gamma = 10 is the best. The larger value of gamma works well because larger gamma weighs closer connections more heavily. Thus, for the larger gamma, cluster assignments are based on regions of high point density rather than distance to centroids.
2.2 k-Means Clustering [9 pts]
In this problem, you will be exploring the k-means clustering algorithm as well as a variant of the algorithm, k-means++, which has the additional property of encouraging good cluster centroid initializations. These are simple yet powerful unsupervised methods for grouping similar examples within a dataset. Consider the dataset unbalance.csv, consisting of 6500 points with two features each.
(a) [1 pt] Why are poor centroid initializations often a problem?
(b) [2 pts] Implement visualize kmeans function in kmeans.py by following the instructions in the code comments. Include a screenshot (or equivalent) of your implemented function as your solution to this question.
Solution: Poor centroid initialization could cause the model to converge to an arbitrarily bad local minimum. Specifically for k-means, each iteration makes a greedy update with no regard for global optimality.
kmeans = KMeans(n_clusters=n_clusters, init=init, random_state=20)
print(kmeans.inertia_)
(c) Run kmeans.py and answer the following questions. The program takes in three command-line arguments:
• Data file: a .csv file
• k: an integer number of clusters
• Initialization method: either k-means++ or random
In this part, use the following command, where
$ python3 kmeans.py unbalance.csv 8
i. [2 pts] Print the final value of the objective function using random initialization as well as k- means++. Make sure to set random state = 20 in the KMeans object to allow for proper comparison. Compare the two values and explain any discrepancies you might see.
ii. [2 pt] Visualize the resulting clusters using both random initialization as well as k-means++. In- clude your plots below.
Solution: The final value of the objective function is lower using k-means++ than when using random initialization. This is because k-means++ is less susceptible to ending up in local minima than random initialization, since the initial centroids are chosen to be as far from each other as possible.
Solution: The plots should look like Figures 3 and 4.
Figure 3: Visualization of the results of k-means++
Figure 4: Visualization of the results of k-means, random initialization
iii. [2 pts] Compare the two generated plots. Which method provides a better solution, and why does that method work better?
2.3 Comparison [3 pts]
We now revisit the mickey.csv dataset to compare the performance of both the k-means and spectral clustering algorithms.
(a) [1 pt] Run kmeans.py on mickey.csv with 3 clusters and k-means++ initialization. Include your plot below – you may want to change the filename the plot is saved under so that your earlier plot for Question 2.2(c)(ii) is not overwritten. Then, run the following command:
$ python3 kmeans.py mickey.csv 3 k-means++
Solution: The visualization indicates that k-means++ performs better than k-means with ran- dom initialization. Since k-means++ avoids picking centroids that are close to each other, we can avoid the situation shown in the second graph where the clusters vary greatly in size. This corresponds to getting “stuck” in a local minima of the objective function.
Solution: The plot should look like Figure 5.
Figure 5: Visualization of k-means Clusters
(b) [2 pts] Compare the graph from 2.1(c)(ii) to the one above. Which method provides a better solution, and why does that method work better?
Solution: Comparing spectral and k-means clustering for the case k = 3, the spectral clustering better captures the three natural clusters, especially the “ears”. This is because k-means clustering works best on homogeneous, globular clusters, and the clusters here are of varying density, which spectral clustering is better at handling.
3 Hierarchical Clustering [8 pts]
In this problem, we will use scikit-learn functions to implement hierarchical clustering. Consider the dataset campus.csv which represents the expected walking time between five buildings on campus as seen in Table 1.
Table 1: The expected time of arrival between five campus buildings
ETA (min.) 0 MasonHall9
BBB 44 Chrysler 40 FXB 48
GGBL BBB Chrysler FXB 47444048 48454049 04102
4 0 8 5 10 8 0 8 2 5 8 0
(a) [4 pts] Implement plot hierarchical dendrogram function in hierarchical.py by fol- lowing the instructions in the code comments.
agglomerative = AgglomerativeClustering(n_clusters=None,
affinity=’precomputed’, linkage=’complete’, distance_threshold=0)
dendrogram(linkage_matrix, labels=buildings)
(b) [4 pts] Plot the dendrogram as the output of plot hierarchical dendrogram function. List the cluster assignments when there are 2 and 4 total clusters, respectively, based on your dendrogram.
Solution: The plot should look like Figure 6.
Figure 6: Dendrogram for hierarchical clustering
Two clusters: { , East Hall}, {Chrysler Center, BBB, GGBL, FXB}.
Four clusters: { }, {East Hall}, {Chrysler Center}, {BBB, GGBL, FXB}. Another Acceptable solution replaces “East Hall” with “ ” as shown below
Figure 7: Dendrogram for hierarchical clustering
Two clusters: { , }, {Chrysler Center, BBB, GGBL, FXB}.
Four clusters: { }, { }, {Chrysler Center}, {BBB, GGBL, FXB}.
4 UV Decomposition [14 pts]
We can think of collaborative filtering as matrix factorization. Given a binary n × m matrix Y in which thereareemptycells,wewanttousethelowrankmatrixUVT toapproximatetheobservedbinarylabels in Y . Let Yij ∈ {−1, 1} if the (i, j) entry is observed, and Yij = 0 otherwise. Then we would like to find U and V such that the following condition is satisfied:
Yij[UV T ]ij = Yij(u ̄(i) · v ̄(j)) > 0 whenever Yij ̸= 0
It is often advantageous to potentially avoid satisfying all the constraints, e.g., if some of the labels may be mistakes. We might also suspect that we gain something by turning the estimation problem into multiple maximum margin problems, for u ̄(i)’s given {v ̄(j)} and for v ̄(j)’s given {u ̄(i)}. The formulation with slack is given by:
min n 1∥u ̄(i)∥2+m 1∥v ̄(j)∥2+C ξij U,V,ξ i=1 2 j=1 2 i,j:Yij̸=0
subject to Yij(u ̄(i) · v ̄(j)) ≥ 1 − ξij,
ξij ≥0forall(i,j)whereYij ̸=0
(a) [3 pts] If we fix U, i.e., fix all u ̄(1),…,u ̄(n) the optimization problem reduces to m independent opti- mization problems. Define these optimization problems for v ̄(j) where j = 1, …, m. What previous algorithm that we have learned does this optimization problem resemble? (Hint: write out minimization problems with respect to each v ̄(j).)
Solution: If we fix U, i.e., fix all u ̄(1),…,u ̄(n), we essentially reduce the problem to, for each v ̄(j), j = 1,…,m,
min 1∥v ̄(j)∥2+Cξij v ̄(j),ξij 2 i: Yij̸=0
subject to Yij(u ̄(i) · v ̄(j)) ≥ 1 − ξij,
ξij ≥0foralliwhereYij ̸=0
Namely, we now have m independent soft-margin SVM problems, one for each v ̄(j) with labeled examples {(u ̄(i),Yij)}i: Yij̸=0.
(b) [3 pt] Similar to part (a), if we fix V , what does the optimization problem reduce to? And what previous algorithm that we have learned does this optimization problem resemble?
Solution: Bysymmetry,ifwefixV,i.e.,fixallv ̄(1),…,v ̄(m),theproblemreducestonindependent soft margin SVM problems, one for each u ̄(i) with labeled examples {(v ̄(j),Yij)}j: Yij̸=0.
min 1∥u ̄(i)∥2+Cξij u ̄(i),ξij 2 j: Yij̸=0
subject to Yij(u ̄(i) · v ̄(j)) ≥ 1 − ξij,
ξij ≥0foralljwhereYij ̸=0
Another acceptable optimization formulation is:
min 1∥u ̄(i)∥2+C ξij U,ξij 2 j: Yij̸=0
subject to Yij(u ̄(i) · v ̄(j)) ≥ 1 − ξij,
ξij ≥0foralljwhereYij ̸=0
(c) [8 pt] Consider a simple two-user and two-movie example and let the observed rating matrix and our initial guess for V be given by:
11 10 Y=−10 V=01
i. [2 pts] Now we would like to solve by hand Uˆ = [u ̄(1), u ̄(2)]T , given the initial V . First, write out the explicit form of the optimization problems for u ̄(1) and u ̄(2) (i.e. no i, j indices and summation symbol and substitute in the relevant values from the Y and V matrices).
For u ̄(1), the minimization problem is: min
u ̄(1) ,ξ11 ,ξ12 subject to
1||u ̄(1)||2 + Cξ11 + Cξ12 2
Y11(u ̄(1) · v ̄(1)) ≥ 1 − ξ11, Y12(u ̄(1) · v ̄(2)) ≥ 1 − ξ12
Plugging in v ̄(1), v ̄(2), Y11, Y12, we have constraints: (1)u ̄(1) ≥ 1 − ξ11,
(1)u ̄(1) ≥ 1 − ξ12 2
For u ̄(2), the minimization problem is:
u ̄(2) ,ξ21 subject to
Y21(u ̄(2) · v ̄(1)) ≥ 1 − ξ21 Plugging in v ̄(1), Y21, we have constraints:
1||u ̄(2)||2 + Cξ21 2
(−1)u ̄(2) ≥ 1 − ξ21, 1
ii. [4 pts] Let’s take C = ∞. Solve the optimization problems for u ̄(1) and u ̄(2).
Note that when C = ∞, this becomes hard margin SVM problems, i.e. ξ11, ξ12, ξ21 = 0. For u ̄(1), the minimization problem is:
min 1 ||u ̄(1) ||2 u ̄(1) 2
subject to Y11(u ̄(1) · v ̄(1)) ≥ 1, Y12(u ̄(1) · v ̄(2)) ≥ 1
Plugging in v ̄(1), v ̄(2), Y11, Y12, we have constraints: (1)u ̄(1) ≥ 1,
(1)u ̄(1) ≥ 1 2
The original optimization problem is minimized when:
Therefore, we have u ̄(1)∗ = [1, 1]T . For u ̄(2), the minimization problem is:
min 1 ||u ̄(2) ||2 u ̄(2) 2
subject to Y21(u ̄(2) · v ̄(1)) ≥ 1 Plugging in v ̄(1), Y21, we have constraints:
(−1)u ̄(2) ≥ 1, 1
The original optimization problem is minimized when:
Therefore, we have u ̄(2)∗ = [−1, 0]T .
u ̄(1) = 1, 1
u ̄(1) = 1 2
u ̄(2) = −1, 1
u ̄(2) = 0 2
iii. [1 pt] Given the initial V and your solution Uˆ , can you predict Y22 as label −1 or 1? If so, what’s your prediction? If not, please explain.
iv. [1 pts] In real-life recommender system development, we sometimes run into the issue where the predictions of some labels in Yˆ are never clear to us; a specific instance of this problem is the Cold Start Problem. If you were a recommender system developer, what would you do if you encountered such a problem (i.e., no initial label and the algorithm you are using does not give a reasonable prediction or does not give a prediction at all)?
Solution: Since Yˆ22 = u ̄(2)∗ · v ̄(2) = [−1, 0]T · [0, 1]T = 0, it is unclear whether Yˆ22 should be predicted as −1 or 1.
Solution: Open ended question.
Sample answer could be: change algorithm; do some initial guess (e.g. average, nearest neigh- bor, random guess, etc.).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com