程序代写代做代考 game algorithm go GMM • Section A. Linear Regression (10 points)

• Section A. Linear Regression (10 points)
1. (1 point) Linear regression:
(A) is a form of unsupervised learning
(B) is a form of supervised learning
(C) can be used in classification, where the loss function does not di↵erentiate between
“easy” and “hard” training samples.
(D) can be used in classification but is not robust to outliers.
Your choice(s): BCD
1 correct answer: 0.33 2 correct answers: 0.67 3 correct answers: 1
2. (2 points) You have been hired by a restaurant to model how many people are going to eat there on a given day. The dataset you have consists of pairs (x, y) where x is the number of days since the restaurant opened and y is the number of people who ate there on that day. For example the pair (0, 61) indicates 61 people ate at the restaurant on its opening night. Your task is to use linear regression to predict y as a function of x. Given that you know the number of customers depends primarily on cyclic factors like the season and day of the week, use no more than 2 sentences to describe what might be a suitable choice of features.
3. Assume that a dataset has N training samples (xi, yi), i = 1, …, N, where xi 2 R, yi 2 R. We assume the following linear relationship between x and y,
y=x+b
In this model, we have only one parameter to optimize, i.e., b.
(a) (3 points) Using mean squared error, derive the closed-form solution of the parameter b in terms of the training samples xi and yi,i = 1,…,N. (Hint: linear regression from the statistics point of view)
(b) (2 points) Now we have N = 2 samples, (1, 2) and (2, 1). Calculate b and the value of the mean squared error in 3a.
(c) (2 points) Is y = x + b the optimal model for the 2 points in 3b? On the figure below, draw the fitted line that you think is optimal.
Your answer: (1pt) We can use sine or cosine functions as our choice of feature. If students write polynomials, they should receive 0 mark.
2

3
(x , y ) = (1, 2) 11
2
1
0 0123
x
(x , y ) = (2, 1) 22
Write your answer to 3(a), 3(b), 3(c) on your paper. The fitted line in 3(c) should be drawn on the figure provided in Page 2.
Your answer:
(a)
We first formulate the objective function. That is
1 XN
L(x,y,b)= N
1 XN
(yi (xi +b))2 ((yi xi)b))2
i=1
= N
1 XN
i=1
i=1
(yi xi)2 N
Therefore, we can derive the closed-form solution by calculating the gradient and setting it
= N
i=1
(yi xi)+b2 (yixi)+2b=0
2 b XN
to zero.
Therefore,
@L 2XN
@b =N
b = N
(yi xi)
(b) substitute these two samples into our equations, we have
i=1
1 XN
and
(c) No. it should be a straight line go through (1,2) and (2,1) i.e. y = x + 3
i=1
b= 1((21)+(12))=0 2
L(x,y,b)= 1((2(1+0))2 +(1(2+0))2)=1 2
3
y

• Section B. Probability (10 points)
I have a bag, containing three six-sided dice: a green die g, a red die r, and a blue die b. These
dice have di↵erent sets of numbers on their faces, as described below,
g = {2,2,2,2,2,3} r = {1,2,3,4,5,6} b = {0,1,1,1,8,11}
Figure 1: Two dice (numbers on it are di↵erent from our question). “Die” is the singular form of “dice”.
(1) (2 point) Whatever number you roll, you get that many dollars. You want to win as much money as possible on average. Which die do you choose, and why?
Your answer:
Since we would get dollars according to the number we roll, we calculate the expected
value for each die. Let x be the money we get, then Eg(x) = 2 ⇥ 5 + 3 ⇥ 1 = 13
666 X6 121
Er(x)= i⇥6= 6 =3.5 i=1
Eb(x) = 0 ⇥ 1 + 1 ⇥ 3 + 8 ⇥ 1 + 11 ⇥ 1 = 22 = 11 666663
As Eb(x) > Er(x) > Eg and we want to win as much as possible, we should choose the blue die.
(2) (2 points) What would be a reasonable way to define how random a die is? Justify why you think your definition is a suitable definition of randomness. Which die under your definition is the most random one?
4

Your answer:
We can use the standard deviation of each die as an indicator. That is
vu u 1 XN
= tN (xi μ)2
i=1
where N = 6 for all the dice, xi is number on each face, and μ is the mean calculated in (1). The larger the variance, the more uncertain the die is, so it is more random. Under
this definition, we have
Therefore, the blue die is the most random one.
g= 5,r=35,b=161 36 12 9
Some equivalent measures (e.g. variance) are also valid.
(3) (2 points) Suppose I choose the blue die, and you chose the green die. We both roll our dice, and whoever has the highest number wins. With what probability do you win this game?
Your answer:
Let x be a random variable represents the number we roll, then we calculate the proba- bility by enumerating all the cases. That is
Pg(x = 2)Pb(x 2 {0, 1}) + Pg(x = 3)Pb(x 2 {0, 1}) =5⇥4+1⇥4
=2 3
Therefore, the probability that I win is 2 3
6666
(4)
1. (4 points) Suppose I put the three dice in a bag, shake them up, and draw one out randomly. I roll the die, and you observe the result to be a 1. Suppose furthermore that you are colourblind, and can’t tell the colour of the dice apart.1 With what probability did I select the blue die from the bag? The red die? The green die?
(Hint: You can shortcut a lot of work once you know the probability of choosing the blue die.)
1Otherwise this is a rather trivial question to answer.
5

Your answer:
let d 2 {g,r,b} be a random variable represent the color of the chosen die and x be a random variable indicates the number we roll. Since the green die doesn’t have 1 on its faces, P(d = g) = 0
We calculate the probability by Bayes’ Theorem. That is
p(d=r|x=1)= p(x=1|d=r)p(d=r) = 1/6⇥1/2 =1/4 p(x = 1) 4/12
p(d = b|x = 1) = 1 p(d = r|x = 1) = 3/4
Therefore, given the result is 1, the probability is P(d = g) = 0, P(d = r) = 1/4, P(d = b) = 3/4
• Section C. Gaussian Mixture Models (10 points)
1. (1 point) Let n index the data points and k index the mixture components. The parameters
rnk in Gaussian Mixture models are
(A) Used identically to the rnk in K-means clustering.
(B) Conceptually similar to the rnk in K-means clustering, but can take on more values. (C) Completely unrelated to the rnk in K-means clustering.
(D) Updated iteratively during training.
2. (1 point) Let K denote the number of Gaussian components. How do we determine the parameter K in a GMM?
(A) It’s a hyper-parameter so we use prior knowledge and empirical results (B) Gradient descent
(C) The EM algorithm
(D) There is a closed form solution that involves inverting a matrix
3. Consider the pictured probability density function p:
Your choice(s): BD
Your choice(s): A
6

We want to approximate p as a gaussian mixture model.
(1) (1 point) What would be a suitable choice of K? Use one sentence to justify your
answer.
(2) (1 point) For each k 2 {1, . . . , K}, suggest a suitable choice of μk (approximately).
(3) (2 points) Suppose we only observe two points x1 = 1, x2 = 2. Their responsibilities rnk are: r11 = 0.9, r21 = 0.1. Calculate μ1 and μ2.
(4) (2 points) Given that p is a probability density function, what is an advantage of approximating it with a Gaussian mixture rather than linear regression? (no more than 3 sentences)
Your answer:
K=2, because there are two peaks.
Your answer:
μ1 =1,μ2 =3.
Your answer: μ1 and μ2 are worth 1 pt each.
Correctly writing down the formula, but with incorrect calculations will get 1pt.
7

Your answer: Possible answers:
1) In its nature, linear regression does not predict probabilities. 2) It is hard to find an appropriate feature for linear regression. 3) Every component has a marginal distribution as a Gaussian.
1 correct advantage is worth 1 pt
1 incorrect advantage will lose 1 pt
(5) (2 points) Suppose we observe three points from this distribution, x1 = 1.5, 1.8, and x3 = 3. We use k-means and GMM to cluster these points, respectively, using the same K. Will these two algorithms give us the same means? Explain your answer in 3 sentences.
Your answer:
No.
1) k-means is hard assignment, and GMM is soft assignment.
2) From the three points, x1 and x2 are one cluster (component), and x3 is another cluster (component).
3) For GMM, x3 will always contribute to the mean of x1 and x2, while for k-means, x3 does not contribute to the mean between x1 and x2.
8

• Section D. Matrix Decomposition (10 points)
1. (2 points) For what values of x 2 R is the matrix
26 6 6 1 0 3 37 7 7 A = 60 x 57
6 64 3 4 x 7 75 invertible? Write your answer on your paper.
Your answer:
A is invertible equivalent to A is full rank, so we derive the condition via Gaussian
elimination.
60 x 57 60 x 5 7 60 x 5 7
434x5404x95 40 4 x95
When x = 0, we have
Which is full rank. When x 6= 0, we have
To make it full rank,
So
In conclusion, when x 2 R\{4, 5}, A is invertible
26103372610337 7 26 610337
2610337 26 610337 60 0 57 60 4 97 40495 40055
2610 3 37
7 40 0 x9+5⇥45
60 x 5
x 9 + 5 ⇥ 4 6= 0
x
x
x2 9x+206=0
(x4)(x5) 6= 0
x6=4, x6=5
9

2. Let
264 337 B = 64 2 3 75
(1) (2 points) Find all eigenvalues of B.
(2) (2 points) Find a set of eigenvectors of B that spans R2.
(3) (2 points) Compute the eigendecomposition of B.
(4) (2 points) Describe how you could use the eigendecompositon of B to quickly compute
Bn for large integers n. (You don’t need to give the closed form expression for Bn.) Write your answer on your paper.
Your answer:
(1) We form the characteristic polynomial and set it to zero. That is
4 3
pB() = = (4 )(3 ) + 6 = ( 3)( + 2)
2 3 Therefore, the eigenvalues of B are 3, 2
(2) we find the corresponding eigenvector of each eigenvalue by solve (A I)x = 0 where x 2 R2
For = 3, we obtain
264 3 3 37 26×137 261 337 26×137
64 2 3 3 75 64 x 2 75 = 64 2 675 64×275 = 0
26337
We solve this homogeneous system and obtain the first eigenvector p1 = 64175 For = 2, we obtain264 + 2 3 37 26×137 266 337 26×137
64 2 3 + 2 75 64 x 2 75 = 64 2 175 64×275 = 0
26137
We solve this homogeneous system and obtain the second eigenvector p2 = 64275 10

(3) The eigenvectors p1,p2 form a basis of R2 . Therefore, we can diagonalize B as B = PDP1
where
26p1 37 263 137 P = 64p275 = 641 275
1 1262 137 P =5641 375
1 263 037 D = P B P = 64 0 2 75
and
(4) We derive the computation of Bn for large integer n via eigendecomposition as below
Bn =(PDP1)n
= PDP1PDP1…PDP1 = PDnP1
Since D is a diagonal matrix, it is easy to compute Dn.
11

• Section E. Principal Component Analysis (10 points)
You are doing some research that uses PCA to analyze cities. You choose to describe a city using the following features: population, area (km2), average salary (k AUD/year), temperature (C). Each column represents a sample, and each row is a feature. From left to right, the columns denote Sydney, Melbourne, Perth and Canberra, respectively.
264600000 4800000 2000000 40000037
6 12368 9990 6418 814.2 7
X = 6 68.2 62.3 63.4 74 7
64 21.8 20.4 24.7 20.2 75
After data standardization, we obtain the eigenvectors and eigenvalues from the data covariance
matrix. The standardized matrix A, covariance matrix S are shown below.
260.75 0.89 0.45 1.19 37 60.99 0.52 0.19 1.31 7
A = 60.23 0.88
640.01 0.66 1.41 0.76 75
26 0.40 0.24 0.17
6 0.24 0.57 0.21 S= 60.17 0.21 0.70
0.4837 0.607 0.297
40.48 0.60 0.29
The eigenvalues and their corresponding eigenvectors (columns) are:
1.36 5 241.26 ⇥ 1012 0.23 0.89 1.8835
12
0.67 1.32 7

260.50
60.50 U = 60.50
640.50 (1) (2 points) Explain the necessity of
than 3 sentences).
0.75 0.27 0.3337
0.65 0.37 0.437
0.06 0.86 0.087
0.04 0.22 0.84 75
doing data standardization in this example (no more
Your answer:
The data has di↵erent scales for di↵erent features.
If standardization is not performed, the largest variances might be found at features with the largest scale, which might not be true.
(2) (3 points) By looking at the covariance matrix S, list three observations (each with no more than 2 sentences) you can make.
Your answer:
Population and area are positively correlated. population and salary are negatively correlated. Salary and temperature are negatively correlated. Area and salary are negatively correlated.
Each correct observation is worth 1 point. An incorrect observation will lose 1 point.
(3) (2 points) If you want to capture 90% of the data variance, which eigenvalue(s) should you choose?
13

Your answer:
The student should find the largest eigenvalues that contributes to 90% of the sum of all the eigenvalues.
The student will get 2 points if they correctly choose 0.89 and 1.88, which contribute to 92% of the total variance.
If a student knows the captured variance corresponds to value of eigenvalues, he will get 0.5.
If a student incorrectly calculates the percentage of the eigenvalues, he will get 0.5.
(4) (3 points) If we project A onto the first principal component (corresponding to the largest eigenvalue), we obtain the following coordinates, -0.68, -1.00, 1.47, 0.22 for Sydney, Mel- bourne, Perth and Canberra, respectively. Therefore, Perth has the largest coordinate. Describe Perth (regarding its population, area, salary, and temperature) based on it (no more than 4 sentences). (Hint: does Perth have large population among the four cities?)
Your answer:
Among the 4 cities, perth has a relatively small population. Perth has a relatively small area.
Perth has a relatively low salary.
Perth has a relatively high temperature.
Each correct observation is worth 1 point.
An incorrect observation will lose 1 point.
14

• Section F. Classification (10 points)
1. (1 point) Given N data points xi 2 RD,i = 1,…,N, you are using logistic regression for a binary classification problem and can obtain both training accuracy and testing accuracy. Now, you add a new dimension of feature to each data point, such that x ̃i = [xi, ui] 2 RD+1. You re-train the classifier until convergence and obtain the new training accuracy and testing accuracy. Select the correct statement(s) below.
(A) The new training accuracy is lower than the previous training accuracy. (B) The new training accuracy is no lower than the previous training accuracy. (C) The new testing accuracy is lower than the previous testing accuracy.
(D) The new testing accuracy is no lower than the previous testing accuracy.
2. We are interested in a 2-class classification problem shown below. We’d like to use the logistic regression:
h(x;✓1,✓2) = P(y = +1|x,✓1,✓2) = 1 . 1 + exp (✓1×1 ✓2×2)
We successfully find the decision boundary L0 (red line).
Your choice(s): B
x2
L3
x1
L2
L0 L1
(1) (1 point) Can a Perceptron classifier find a suitable decision boundary? Use one sentence to justify your answer.
(2) Consider the regularized logistic regression. We maximize the following function.
Your answer: Yes, because the dataset is linearly separable.
XN 2
logP(yi|xi,✓1,✓2) 2✓1. i=1
where 0 is the weighting parameter. We re-train the classifier using this equation. Answer the following questions regarding the new decision boundary.
15

(a) (2 points) Will the decision boundary be L1? Provide a short justification (no more than 3 sentences).
Your answer: No.
The decision boundary is x2 = ✓1 x1
✓2
When the magnitude of ✓1 is constained to be minimized, the absolute value of the slope
✓1 becomes smaller, so the decision boundary becomes flatter. ✓2
(b) (2 points) Will the decision boundary be L2? Provide a short justification (no more than 3 sentences).
Your answer: L2 is possible.
When ✓1 is constrained to be small, the new decision boundary should be flatter than
L0.
L2 is flatter than L1.
Compared with L3 which has a similar slope, the training error of L2 is smaller, so L2 is possible.
Note: if the first point is answered in question (a), then student will automatically get 0.5 point.
Note: if the third point is answered in (c), the student will automatically get 0.5 point.
(c) (2 points) Will the decision boundary be L3? Provide a short justification (no more than 3 sentences).
Your answer: L3 is flatter than L0, and has a similar slope magnitude with L2.
(1 pt) Comparing L3 and L2, we find L3 has a larger training error (two blue points are
misclassifed) than L2 (1 blue point is misclassified).
Therefore, L3 will at least converge to L2 where the overall loss is smaller.
(3) (2 points) If we increase the value of starting from = 0, how will the average log- probability on the test set change? Use no more than 5 sentences to justify your answer.
16

Your answer:
When is relatively small, regularization is e↵ective, so usually the test accuracy (log- probability on the test set) will become higher.
When beomes relatively large, regularization e↵ect is unnecessarily large, so usually the test accuracy (log-probability on the test set) will become lower.
——— End of the paper ——–
17

Related Posts