EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Tuesday, September 15 at 11:59 p.m. 2 Nonlinear Classification Boundaries
HW2
Make sure to submit the code you write in this problem to “HW2 Code” on Gradescope.
In this problem we will learn how to use polynomial features to learn nonlinear classification boundaries.
In Problem ”A Simple Classification Approach” on HW1, we found that linear regression can be quite effective for classification. We applied it in the setting where the training data points were approximately linearly separable. The term “linearly separable” in classification means there exists a hyperplane such that most of the training data points in the first class are on one side of the hyperplane and most training data points in the second class are on the other side of the hyperplane.
However, often times in practice classification datasets are not linearly separable in their native representation. In such cases we can often find a representation that is linearly separable by aug- menting the original data with appropriate features. Sometimes, the polynomial features that we saw in Problem ”Learning 1-D functions” of HW1 are good enough to let us do this, because after all, they are from a universal feature family. This “lifting” embeds the data points into a higher dimensional space where they are more likely to be linearly separable.
In this problem we consider a simple dataset of points (xi, yi) ∈ R2, each associated with a binary label bi which is −1 or +1. The dataset was generated by sampling data points with label −1 from a disk of radius 1.0 and data points with label +1 from a ring with inner radius 0.8 and outer radius 2.0.
(a) Run the starter code to load and visualize the dataset and submit a scatterplot of the points with your homework. Why can’t these points be classified with a linear classifica- tion boundary in the x,y plane?
Solution:
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
If we want to find a hyperplane that separates these points approximately, one of the half-spaces must contain most of the points from the outer ring (because they form a class). However due to the shape of the dataset, this half space will also contain most of the points in the inner circle and therefore the points cannot be classified with a linear classification boundary.
(b) Classify the points with the technique from Problem 7 on HW1 (“A Simple classification approach”). Use the feature matrix X whose first column consists of the x-coordinates of the training points and whose second column consists of the y-coordinates of the training points. The target vector b consists of the class label −1 or +1. Perform the linear regression w1 = arg minw ∥Xw − b∥2. Report the classification accuracy on the test set.
Solution:
To solve the regression you can run the following code:
w1 = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
Then the classification and its accuracy can be computed with:
The accuracy on the test set is 0.5075, so the classifier only gets about half of the predictions right. This is because the data is not linearly separable as seen in (a).
ypred = -1.0*(Xtest.dot(w1) < 0.0) + 1.0*(Xtest.dot(w1) > 0.0)
print(“prediction accuracy on the test set is “, 1.0*sum(ypred == ytest
→ ) / n)
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
(c) Now augment the data matrix X with polynomial features 1, x2, xy, y2 and classify the points again, i.e. create a new feature matrix
22 x1 y1 x x1y1 y 1
11 22 x x x xy y 1 2222 22
Φ=….. ….. …..
x n y n x n2 x n y n y 2n 1
and perform the linear regression w2 = arg minw ∥Φw − b∥2. Report the classification accu-
racy on the test set.
Solution: Continuing from the code in the last part, the matrix Φ and the classification of the data can be computed in the following way:
After that, the accuracy on the test set can be obtained with:
Phi = np.vstack([X[:,0], X[:,1], X[:,0]**2, X[:,0] * X[:,1], X[:,1]**2, → 1.0 + 0.0*X[:,1]]).T
w2 = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))
Phitest = np.vstack([Xtest[:,0], Xtest[:,1], Xtest[:,0]**2, Xtest[:,0] → * Xtest[:,1], Xtest[:,1]**2, 1.0 + 0.0*Xtest[:,1]]).T
ypred2 = -1.0*(Phitest.dot(w2) < 0.0) + 1.0*(Phitest.dot(w2) > 0.0)
print(“prediction accuracy on the test set is “, 1.0*sum(ypred2 == → ytest) / n)
The accuracy on the test set is now 0.89 which shows that polynomial features are able to classify most of the points in the test set correctly.
(d) Reporttheweightvectorthatwasfoundinthefeaturespacewiththepolynomialfeatures. Show that up to small error the classification rule has the form αx2 + αy2 ≤ β. What is the interpretation of β/α here? Why did the classification in the augmented space work?
Solution: The weight vector can be printed with
print(“w2 = “, w2)
The result is
w2 = array([ 0.08278755, 0.04680803, 0.65422045,
-0.03610703, 0.71621139, -0.81494877])
so we have α ≈ 0.7 and β ≈ 0.8. The interpretation of β/α is that r2 = β/α is about the radius of the inner disk squared, so r ≈ 1 which is the approximate radius of the inner disk.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
3
The classification in the polynomial feature space worked, because the polynomial features are rich enough to encode the exact classification rule (x2 + y2 ≤ r2).
Later on in the course, when we have access to nonlinear optimization techniques, we will see how the same ideas of using features can be combined with better loss functions for classifi- cation. But this problem should show you the power of features in classification problems as well as in regression ones.
Blair and their giant peaches
Make sure to submit the code you write in this problem to “HW2 Code” on Gradescope.
Blair is a mage testing how long they can fly a collection of giant peaches. They has n training peaches – with masses given by x1, x2, . . . xn – and flies these peaches once to collect training data. The experimental flight time of peach i is given by yi. They believes that the flight time is well approximated by a polynomial function of the mass
y i ≈ w 0 + w 1 x i + w 2 x i2 · · · + w D x iD
where their goal is to fit a polynomial of degree D to this data. Include all text responses and plots in your write-up.
(a)
Show how Blair’s problem can be formulated as a linear regression problem.
Solution: The problem is to find the coefficients wd such that the squared error is minimized:
min (w0 +w1xi +w2xi2 ···+wDxiD −yi)2
w0 ,…,wD
i
Assume that we construct the following matrix:
(b)
You are given data of the masses {xi}ni=1 and flying times {yi}ni=1 in the “x train” and “y train”
keys of the file 1D poly.mat with the masses centered and normalized to lie in the range
[−1, 1]. Fill missing part in part (b) of the iPython notebook to do a least-squares fit
(taking care to include a constant term) of a polynomial function of degree D to the data.
Letting f denote the fitted polynomial, plot the average training error R(D) = 1 n (y − D n i=1 i
Then the problem can be written as:
2 D 1 x1 x … x
11 2 D 1 x x … x 2 22
X=. . . . . . . . . . . . . . .
1 x x2 … xD
nnn
min ||Xw − y||2 w
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
fD(xi))2 against D in the range D ∈ {0,1,2,3,…,n − 1}. You may not use any library other than numpy and numpy.linalg for computation.
Solution:
See Figure 1 for the plot. See code below:
for d in range(n -1): D=d+1
for i in range(D + 1):
if i == 0:
Xf = np.array([1] * x_train.size)
else:
Xf = np.vstack([np.power(x_train, i), Xf])
Xf = Xf.T
w = lstsq(Xf, y_train)
y_predicted = Xf @ w
err[d] = (np.linalg.norm(y_train – y_predicted)**2) / n
#print(err)
35 30 25 20 15 10
5 0
0.0 2.5
5.0 7.5
Degree of Polynomial
15.0 17.5
10.0 12.5
Figure1:TrainingError: Resultfor1Dpoly.mat
(c) HowdoestheaveragetrainingerrorbehaveasafunctionofD,andwhy?Whathappens
if you try to fit a polynomial of degree n with a standard matrix inversion method?
Solution: The training error decreases since we have more degrees of freedom to fit the dataset. If we try to fit a polynomial of degree n − 1, we will have enough parameters to fit the data exactly, so the training error will be zero. Using a polynomial of degree n results in a non- invertible data matrix, and so we cannot use a standard matrix inversion method to find our solution.
(d) Blair has taken CS189 so decides that they needs to run another experiment before deciding that their prediction is true. They run another fresh experiment of flight times using the same peaches, to obtain the data with key “y fresh” in 1D poly.mat. Denoting the fresh flight time
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Training Error
of peach i by y ̃i, by completing part (d) of the iPython notebook, plot the average error R ̃(D) = 1 n (y ̃ − f (x ))2 for the same values of D as in part (b) using the polynomial
n i=1 i D i
approximations fD also from the previous part. How does this plot differ from the plot in
(b) and why?
Solution: The plots are shown in Figure 2. The plots are different from the training errors since the data was generated afresh. While increasing the polynomial degree served to fit the noise in the training data, we cannot hope to fit noise by using the same model for fresh data. Hence, our performance degrades as we increase polynomial degree, with a minimum seen at the true model order.
for d in range(n – 1):
D=d+1
for i in range(D + 1):
if i == 0:
Xf = np.array([1] * n)
else:
Xf = np.vstack([np.power(x_train, i), Xf])
Xf = Xf.T
w = lstsq(Xf, y_train)
err_train[d] = (np.linalg.norm(y_train – Xf @ w)**2) / n
err_fresh[d] = (np.linalg.norm(y_fresh – Xf @ w)**2) / n
6 5 4 3 2 1 0
0.0 2.5 5.0
7.5 10.0 12.5
15.0 17.5
train fresh
Figure2:FreshErroryfresh: Resultfor1Dpoly.mat
(e) How do you propose using the two plots from parts (b) and (d) to “select” the right
polynomial model for Blair?
Solution: The right model is the one that minimizes the error in the fresh dataset. The mini- mizer is at D = 4.
(f) Blair has a new hypothesis – the flying time is actually a function of the mass, smoothness, size, and sweetness of the peach, and some multivariate polynomial function of all of these
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
parameters. A D-multivariate polynomial function looks like f (x)=α xpji,
Dji ji
where ∀j : p ≤ D. Here α is the scale constant for jth term and p is the exponent ijij ji
of xi in jth term. The data in polynomial regression samples.mat (100000 × 5) with columnscorrespondingtothe5attributesofthepeach. Use4-foldcross-validationtodecide which of D ∈ {0, 1, 2, 3, 4, 5} is the best fit for the data provided. For this part, compute the polynomial coefficients via ridge regression with penalty λ = 0.1, instead of ordinary least squares. You are not allowed to use any library other than numpy and numpy.linalg. Write your implementation by completing the part (g) of the iPython notebook.
Solution: Please refer to the next part for the general implementation.
(g) Now redo the previous part, but use 4-fold cross-validation on all combinations of D ∈ {1, 2, 3, 4, 5} and λ ∈ {0.05, 0.1, 0.15, 0.2} – this is referred to as a grid search. Find the best D and λ that best explains the data using ridge regression. Print the average training/val- idation error per sample for all D and λ. Again, write your implementation by completing the part (g) of the iPython notebook.
Solution: Minimum of cross validation error happens at D = 4 and λ = 0.15.
Average train error:
[[0.05856762013 0.05856762066 0.05856762225 0.05856762491 0.05856762862]
[0.05848920002 0.05848948229 0.0584902034 0.05849122 0.05849243261]
[0.05846063295 0.05848702853 0.05848893458 0.05849036454 0.05849178744]
[0.05839260718 0.05848700898 0.05848892445 0.05849035741 0.05849178171]
[0.05831118703 0.05848700861 0.05848892426 0.05849035728 0.05849178161]]
Average valid error:
[[0.05857468619 0.05857468536 0.0585746856 0.05857468691 0.05857468927]
[0.05852250667 0.05852112813 0.05852038752 0.05852010745 0.05852016181]
[0.05854617135 0.0585210268 0.05852032221 0.05852006042 0.0585201252 ]
[0.05860046897 0.05852102354 0.05852032035 0.05852005896 0.05852012386]
[0.05869725681 0.05852102357 0.05852032036 0.05852005896 0.05852012386]]
In the following implementation, we run cross-validation to find errors for each D separately, by varying the value of λ. This is purely for pedagogical purposes. You can combine these two steps by running through D in an outer loop. Note that we can collect all of these errors, and then choose the model that minimizes the cross-validation error.
Code to compute the average error per sample:
Ns = int(data_x.shape[0] * (Kc – 1) / Kc) # training
Nv = int(Ns / (Kc – 1)) # validation
Etrain = np.zeros(4)
Evalid = np.zeros(4)
for c in range(4):
valid_x = feat_x[c * Nv:(c + 1) * Nv]
valid_y = data_y[c * Nv:(c + 1) * Nv]
train_x = np.delete(feat_x, list(range(c * Nv, (c + 1) * Nv)), axis=0)
train_y = np.delete(data_y, list(range(c * Nv, (c + 1) * Nv)))
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
w = lstsq(train_x, train_y, lambda_=lambda_)
Etrain[c] = np.mean((train_y – train_x @ w)**2)
Evalid[c] = np.mean((valid_y – valid_x @ w)**2)
return np.mean(Etrain), np.mean(Evalid)
Code to find best D and λ:
4 Outlier Removal via OMP (Part 2)
In the previous homework (and in 16B), we looked at how the orthogonal matching pursuit (OMP) can be used to remove an effect of outliers in the training data. This problem can be viewed as a continuation of that problem that forms a natural second part. We will look into why the concept of validation is needed here as well as how to use a validation set that may itself be corrupted or contains outliers. We will assume the same setting as the first part. To summarize, we want to perform linear regression on training data that contains an unknown fraction of outliers. Part (a) – (c) involve coding in a Jupyter notebook and some short answers. Please read more detailed instructions in the notebook itself.
Make sure to submit the code you write in this problem to “HW2 Code” on Gradescope.
(a) From the training error plot, we will see that the training error always decreases as the number of non-zeros in the weight or the number of the removed outliers increases. However, this does not reflect the true error in the recovered pattern. For example, if we had a clean or uncorrupted test set, we could get an idea of what that true error is like. Fill in the missing code to plot the test error using the weights obtained from the OMP outlier-removal method. Then describe in words the trends of the test error you see, and based on the plot alone, tell us how many outliers should be removed by OMP?
Solution: From Figure 3, you should see the test error decreases as the number of removed outliers increases until it hits the minimum when 20 outliers are removed, or equivalently, the OMP weight has 21 non-zero entries. After this point, the test error should keep rising. We want the model that performs best on the clean test set. Hence, based on this the plot, we can set the number of outliers to be 20 for the OMP.
Coding solution:
(b) In the real world, we often cannot evaluate our models on a clean test set so we will need to carry out the validation process based on the data we have, i.e. potentially corrupted data. It is
D, i = np.unravel_index(Evalid.argmin(), Evalid.shape)
print(“D =”, D + 1)
print(“lambda =”, LAMBDA[i])
for w in weights:
test_err = np.mean((X_test.dot(w[:d]) – y_test) ** 2)
test_errs.append(test_err)
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Figure 3: (Part a) mean error on the clean test set.
no longer safe to compute the mean of the error as before. Computing the mean error on the corrupted validation set will result in an unreliable representation of the errors on uncorrupted data since that mean will be heavily skewed by the outliers.
Luckily, there are multiple ways of measuring the central tendency, other than the mean value. In fact, median is another option which is known to be robust to outliers. To see how median is more robust compared to mean, compute the validation errors based on the weights at the specified indices. For each of the weights, plot a histogram of the validation errors and compute their mean and median. See where the values of mean and median fall in each histogram.
Solution: As in Figure 4, you should see that majority of the errors concentrate in the first or the smallest bin close to zero while there are a smaller number of samples in the rest of the bins with much larger errors. The median is much smaller than the mean and is a better representative of the majority of the data. This is because the mean is skewed heavily due to errors incurred by the outliers.
Coding solution:
val_err = (X_val.dot(weights[idx][:d]) – y_val) ** 2
Figure 4: (Part b) histograms of the errors computed at five OMP weights with different numbers of points removed.
(c) Does the idea of median being a robust estimator give you some idea for how to do validation in the presence of outliers? Plot mean and median of the validation errors against the number of the non-zero entires in OMP solutions (all the weights we obtained from part
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
(a)). Now determine from the median plot the number of the outliers. Why does the median work better than the mean in this case?
Solution: The trend in the median plot of the validation error should be roughly similar to the mean plot of the clean test error (see Figure 5). This suggests that, even in the presence of many outliers, we can still use the corrupted validation set to accurately determine a good hyperparameter. In this case, the appropriate number of non-zero weights for OMP is 21 as mentioned in part (b), but from the plot, we also accept any value from 20 to 25.
Coding solution:
Figure 5: (Part c) median error on the corrupted validation set.
Now that we have seen how useful the median is against corrupted data and outliers, we will do some theoretical analysis on median. First, we’ll look its convergence and second, we’ll verify its robustness.
(d) From the law of large numbers, we have seen that with a large number of samples, the sample mean converges to the population mean or expected value. More rigorously, the weak law of large numbers states the following: For any positive number ε,
lim P Xn − μ > ε = 0
for w in weights:
val_err = (X_val.dot(w[:d]) – y_val) ** 2
mean_val_errs.append(np.mean(val_err))
median_val_errs.append(np.median(val_err))
n→∞
where μ is the expectation, and X = 1 n X is the sample mean. Here, we would like to make
n n i=1 i
a similar statement for sample median and population median. Given a sequence of n random
variables i.i.d. drawn from the same distribution, {X1, X2, …, Xn}, let’s denote the population median as med(X) and the sample median as X ̃n. We want to make no other assumption on the distribution of Xi’s. The goal is to give a proof of the following statement:
̃ limP Xn −med(X)>ε =0
n→∞
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
But to make the proof easier to follow and to understand things in terms of their natural de-
pendencies, we will modify the above statement to involve quantiles of X. For every ε > 0 for
which the ( 1 − ε) quantile is different from the median and the ( 1 + ε) quantile is also different 22
from the median, we have: ̃ ̃
lim P Xn < (1/2 − ε)-quantile or Xn > (1/2 + ε)-quantile = 0. n→∞
Here, (for simplicity) a p-quantile of a random variable is a value x for which the CDF P(X ≤ x) = p. [To be precise, a p-quantile is an x for which P(X < x) ≤ p and P(X ≤ x) ≥ p. This allows the distribution of X to have atoms in it and for quantiles to still be defined in a reasonable manner.]
Notice that by choosing an appropriate value of ε, we can recover the desired ε, and hence, the two statements are equivalent.
(First hint: Consider a Bernoulli random variable: Yi = 1 Xi > (1/2 + ε)-quantile)
(Second hint: Think about a relevant Chernoff bound and use it. You don’t have to use a
Chernoff bound to prove it, but it helps in understanding the speed of this convergence.)
Solution: We will first exploit the symmetry of quantiles and try to prove one side of it at a time. Here, we will first upper bound probability of the second event in the desired expression:
̃ P Xn > (1/2 + ε)-quantile
NoticethatX ̃n >(1/2+ε)-quantileisequivalenttosayingthatthesamplemedianof{X1,X2,…,Xn} is larger than its (1/2 + ε)-quantile, and this can happen only when at least half of all Xi’s are larger than its (1/2 + ε)-quantile.
Now it make sense to consider a Bernoulli random variable which depends on Xi in the follow- ing manner:
Yi = 1 Xi > (1/2 + ε)-quantile
TheprobabilityofY being1isthensimplyp=1/2−ε,andE Y =np.Byrewriting
n i i=1 i
the previous statement using Yi, we can see that
̃ nn
Xn > (1/2 + ε)-quantile ⇐⇒
Yi ≥ 2
i=1 n
̃
P X >(1/2+ε)-quantile =P Y ≥(1+δ)np ni
i=1
whereδ= 1 −1= 1 −1= 2ε .Nowwearereadytoapplythesecondhint,theChernoff
2 p 1−2ε 1−2ε
bound. There are many different Chernoff bounds out there, and in this case, the following one
(often referred to as the multiplicative form of the bound) on Y ̄ = n Y results in: n i=1i
n
2 npδ
Y ≥(1+δ)np≤exp−
i3
P
i=1
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
2
n 2ε n(1/2−ε) n 1−2ε
P Y≥ ≤exp−
i
2 3 2
i = 1
2
2nε
= exp − .
3(1−2ε)
Repeating the same argument for the other half of the statement, we end up with the two
Notice this is falling exponentially in n. following bounds:
̃ 2nε P X >(1/2+ε)-quantile ≤exp−
n 3(1−2ε) 2
They can be combined using a union bound, and notice that as n approaches infinity, both of them go to zero with, at least, an exponential rate in n. Now we have completed the proof since this shows that the empirical mean can only rarely be outside of the range between the
1 − ε-quantile and the 1 + ε-quantile. 22
(e) For this part, we want to also show that median has a robustness property against a strong adversary who can completely change portions of the data. We will use a similar setting as in the previous part but asssume that the samples are fixed. In other words, we will compute the median of n samples {x1, x2, …, xn}, but there is now an adversary who can completely change γn samples or an γ portion where 0 < γ < 1/2. You can assume that γn is an integer. The goal is to find the largest deviation from the true median of the n samples that the adversary can cause as a function of γ. Your answer should involve quantiles.
This should be more straightforward than the previous part.
(Hint: Suppose that n = 100 and the samples were 1, 2, 3, . . . , 99, 100. Suppose that the adversary wanted to make your median be as small as possible and was allowed to change ten values. Which ten would the adversary change and what would they change them to? What would the resulting median be of the modified data? Working out this example should help you understand the general case.)
Solution: Let’s denote the true median with x ̃. This implies that there are an equal number of samples that are larger than x ̃ and that are smaller. Now suppose that the adversary wishes to make the corrupted median as small as possible. The way to accomplish this is to change the γn samples that are larger than x ̃ into a very small value, e.g., negative infinity. Consequently, the corrupted median will become the (1/2 − γ)-quantile, and no matter what the adversary does, it cannot go any lower. By symmetry, the largest value of the corrupted median is the (1/2 + γ)-quantile. Hence, the corrupted median will always lie between the (1/2 − γ)-quantile and the (1/2 + γ)-quantile.
2nε P X <(1/2−ε)-quantile ≤exp−
̃
n 3(1−2ε)
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
5 Simple Regularization Tradeoff, with a Final Twist
In an earlier HW, you saw how ridge-regression can be reconceptualized as adding fake data points at zero and just doing ordinary least-squares on the augmented data set. Because ordinary least- squares can be viewed as a generalization of the simple act of taking an average, it is worth stepping back and exploring regularization in this extremely simple context.
The advantage of the extremely simple context is that it lets us easily use probability-based models to sharpen our understanding of what is going on, and to understand what the in-principle best amount of regularization should be.
Consider a random variable X, which has unknown mean μ and unknown variance σ2. Given n iid realizations of training samples X1 = x1, X2 = x2, . . . , Xn = xn from the random variable, we wish to have the best constant predictor for new realizations from this distribution. If we had access to the distribution of X, the mean μ would be the best constant predictor in the squared error sense since the mean μ minimizes E[(X − μ)2].
However, we don’t have access to the distribution of X and only have data. We will call our predictor for X the random variable Xˆ, which will have its own mean μˆ. There are a few ways we can estimate μ given the realizations of the n samples:
1. Average the n samples: x1+x2+···+xn . n
2. Average the n samples and one sample of 0: x1+x2+···+xn . n+1
3. Average the n samples and n0 samples of 0: x1+x2+···+xn . n+n0
4. Ignore the samples: just return 0.
In the parts of this question, we will measure the bias and variance of each of our estimators. The bias is defined as E[Xˆ − μ] and the variance is defined as Var[Xˆ]. Notice that both of these do not depend on the actual realizations of the n samples.
(a) What is the bias of each of the four estimators above?
Solution: Using the linearity of expectation, we write E[Xˆ − X] as E[Xˆ] − E[X] = E[Xˆ] − μ,
so we have the following biases:
(a) E[Xˆ] = E[X1+X2+...+Xn ] = nμ nn
(b) E[Xˆ] = E[X1+X2+...+Xn ] = nμ n+1 n+1
=⇒ bias = 0
=⇒ bias = − 1 μ
n+1 (c) E[Xˆ] = E[X1+X2+...+Xn ] = nμ =⇒ bias = − n0 μ
n+n0 n+n0 n+n0 (d) E[Xˆ]=0 =⇒ bias=−μ
(b) What is the variance of each of the four estimators above?
Solution: The two key identities to remember are Var[A + B] = Var[A] + Var[B] (when A and B are independent) and Var[kA] = k2Var[A], where A and B are random variables and k is a constant.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
(a) Var[Xˆ]=Var[X1+X2+...+Xn]= 1 Var[X +X +...+X ]= 1 (nσ2)= σ2 n n2 1 2 n n2 n
(b) Var[Xˆ]=Var[X1+X2+...+Xn]= 1 n+1 (n+1)2
Var[X +X +...+X ]= 1 2 n
1 (n+1)2
(nσ2)= n σ2 (n+1)2
(c) Var[Xˆ]=Var[X1+X2+...+Xn]= 1 n+n0 (n+n0)2
(d) Var[Xˆ]=0
Var[X +X +...+X ]= 1
1 2 n (n+n0)2
(nσ2)= n (n+n0)2
σ2
(c) Suppose we have constructed an estimator Xˆ from some samples of X. We now want to know how well Xˆ estimates a fresh (new) sample of X. Denote this fresh sample by X′. Note that X′ is an i.i.d. copy of the random variable X. Derive a general expression for the expected squared error E[(Xˆ − X′)2] in terms of σ2 and the bias and variance of the estimator Xˆ. Similarly, derive an expression for the expected squared error E[(Xˆ − μ)2]. Compare the two expressions and comment on the differences between them, if any.
Solution: Since Xˆ is a function of X, we conclude that the random variables Xˆ and X′ are independent of each other. Now we provide two ways to solve the first problem.
Method 1: In this method, we use the trick of adding and subtracting a term to derive the desired expression:
E[(Xˆ − X′)2] = E[(Xˆ − μ + μ − X′)2]
= E[(Xˆ − μ)2 + E[(μ − X′)2]
Method 2:
=Var(X′)=σ2
= E[(Xˆ − μ)2 + σ2
= E[(Xˆ −E[Xˆ]+E[Xˆ]−μ)2 +σ2
= E[(Xˆ − E[Xˆ ])2 + (E[Xˆ ] − μ)2 +2 E[(Xˆ − E[Xˆ ]) · (E[Xˆ ] − μ)] +σ2
=Var(Xˆ ) =bias2 =0
In this method, we make use of the definition of variance. We have
E[(Xˆ − X′)2] = E[Xˆ2] + E[X′2] − 2E[XˆX′]
= (Var(Xˆ) + (E[Xˆ])2) + (Var(X′) + (E[X′])2) − 2E[XˆX′]
= ((E[Xˆ])2 − 2E[XˆX′] + (E[X′])2) + Var(Xˆ) + Var(X′)
= (E[Xˆ] − E[X′])2 + Var(Xˆ) + Var(X)
=E[X]=μ
= (E[Xˆ] − μ)2 +Var(Xˆ) + σ2
=bias2
The first term is equivalent to the bias of our estimator squared, the second term is the variance of the estimator, and the last term is the irreducible error.
Now let’s do E[(Xˆ − μ)2].
E[(Xˆ − μ)2] = E[Xˆ2] + E[μ2] − 2E[Xˆμ] (1)
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
=Var(X)
= (Var(Xˆ) + E[Xˆ]2) + (Var(μ) + E[μ]2) − 2E[Xˆμ] (2) = (E[Xˆ]2 − 2E[Xˆμ] + E[μ]2) + Var(Xˆ) + Var(μ) (3) = (E[Xˆ] − E[μ])2 + Var(Xˆ) + Var(μ) (4) = (E[Xˆ] − μ)2 + Var(Xˆ). (5)
Notice that these two expected squared errors resulted in the same expressions except for the σ2 in E[(Xˆ − X′)2]. The error σ2 is considered “irreducible error” because it is associated with the noise that comes from sampling from the distribution of X. This term is not present in the second derivation because μ is a fixed value that we are trying to estimate.
(d) For the following parts, we will refer to expected total error as E[(Xˆ − μ)2]. It is a common mistake to assume that an unbiased estimator is always “best.” Let’s explore this a bit further. Compute the expected squared error for each of the estimators above. Solution: Adding
the previous two answers: (a) σ2
n
(b) 1 (μ2 + nσ2)
(n+1)2
(c) 1 (n2μ2 + nσ2)
(n+n0 )2 0 (d) μ2
(e) Demonstrate that the four estimators are each just special cases of the third estimator, but with different instantiations of the hyperparameter n0.
Solution: The derivation for the third estimator works for any value of n0. The first estimator is just the third estimator with n0 set to 0:
x1 +x2 +...+xn = x1 +x2 +...+xn + x1 +x2 +...+xn. n+n0 n+0 n
The second estimator is just the third estimator with n0 set to 1:
x1 +x2 +...+xn = x1 +x2 +...+xn.
n+n0 n+1
The last estimator is the limiting behavior as n0 goes to ∞. In other words, we can get arbitrarily
close to the fourth estimator by setting n0 very large:
lim x1 +x2 +...+xn =0.
n0→∞ n+n0
(f) What happens to bias as n0 increases? What happens to variance as n0 increases?
Solution:
One reason for increasing the samples of n0 is if you have reason to believe that X is centered around 0. In increasing the number of zeros we are injecting more confidence in our belief
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
that the distribution is centered around zero. Consequently, in increasing the number of ”fake”
data, the variance decreases because your distriubtion becomes more peaked. Examining the
expressions for bias and variance for the third estimator, we can see that larger values of n0
result in decreasing variance ( n σ2) but potentially increasing bias ( n0μ ). Hopefully you (n+n0 )2 n+n0
can see that there is a trade-off between bias and variance. Using an unbiased estimator is not always optimal nor is using an estimator with small variance always optimal. One has to carefully trade-off the two terms in order to obtain minimum squared error.
(g) Say that n0 = αn. Find the setting for α that would minimize the expected total error, assuming you secretly knew μ and σ. Your answer will depend on σ, μ, and n.
In other words, how many fake 0s should we add to our data set?
Solution: First, we write our expression for the total error in terms of α: 1 ( n 20 μ 2 + n σ 2 )
(n + n0)2
1 ((αn)2μ2 + nσ2)
(n + αn)2
1 1 (α2n2μ2 + nσ2)
(1 + α)2 n2
1 22 σ2
(1+α)2(αμ + n) α2 1σ2
(1+α)2μ2 + (1+α)2 n Now take the derivative with respect to α and set it equal to 0:
2α2 2σ2 (1+α)3μ −(1+α)3 n =0.
2α μ2= 2 σ2 (1+α)3 (1+α)3 n
2αμ2 =2σ2 n
σ2
α = nμ2 . (6)
(h) What happens to the optimal α in the previous part as you get more samples and n gets bigger? What does this mean in terms of the number n0 of fake data points we should add?
Solution:
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
Wecanseefrom(6)thattheαdecreasesasnincreases.Computingtheoptimaln =αn=σ2 0 μ2
we see that this is a constant that is the ratio of the variance to the mean squared, or the square of the ratio of the standard deviation to the mean.
This shows that as the number of training samples increases, the impact of the optimal ridge- style regularization (by adding fake data points) goes away.
(i) For this part, let’s assume that we had some reason to believe that μ should be small (close to 0) and σ should be large. In this case, what happens to the expression for the optimal α that you computed in a previous part? Solution: The value of α can be quite large, since
the solution has a small value of μ in the denominator. In mathematical terms, plugging in (6), we could write the limit as μ goes to 0:
σ2
lim = ∞.
μ→0 nμ2
We can also see this in terms of the number of fake data points that we want as we calculated
in the previous part.
(j) Draw a connection between α in this problem and the regularization parameter λ in the ridge- regression version of least-squares.
What is the optimal value for λ as a function of n, σ, μ?
Solution: The answer depends on the exact form of ridge-regression we use. One ridge-
regression version of least-squares can be defined as
n
1 minR(θ)= (θ−x)2 +λθ2,
i θ ni=1
Ifwefindθˆsuchthat∂R(θ)|θˆ =0.wecanget ∂θ
ni xi ni xi θˆ=(1+λ)n=n+ λn ,
n0
where we find that λ plays exactly the same role as α defined in Part (g). Thus the optimal
value for λ is
The interesting thing here is that in this variant, the optimal λ drops as n increases. This is
because there is a 1 out in front of the first term.
n
σ2 λ = nμ2 .
If we used the variant of ridge regression that didn’t have the 1 in front of the least-squares
n
term, then the optimal λ would not have the n in the denominator. Notice how this matches the
“fake data points” case that you saw in the earlier homework.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
(k) Now, we can use this setup to take a more skeptical look at whether a simple validation ap- proach — taking some extra nv validation points and using them to tune the hyperparameter of interest — can work reasonably in this particular setting.
Suppose we had nv validation points xn+1, . . . , xn+nv that were all drawn i.i.d. according to the distribution for X. Describe what would happen if you used these validation points to pick the hyperparameter α (you can also answer in terms of λ if you wish) that minimizes the validation error and comment on whether the resulting estimator is good or not.
How does this help you understand the importance of having a test set that is different from the validation set?
(HINT: Try this out and see what happens. If it helps, suppose that some adversary just made your validation points come from a different distribution. How much can the adversary impact you?)
Solution: Suppose we are using the validation points to pick the hyperparameter α, then we are minimizing the following square loss over α:
nv 2 x1 + x2 + · · · + xn
min −xn+j . α n+αn
j=1 μˆ α
Taking the derivative and find the α⋆ such that the derivative equals to zero, then we could get α ⋆ = μˆ 0 − 1 ,
μˆ val
whereμˆ =1n xandμˆ =1nv x .Thenif|μˆ |0andtheresultingα⋆>0,the
0 n i=1 i val nv i=1 n+i val hyperparameter picked by simply using validation points is
μˆα⋆ =x1+x2+···+xn =μˆval, n+α⋆n
which that our final estimated mean completely ignores the training points and “over-fits” to the mean of the validation points.
The alternative condition occurs when μˆval > μˆ0. In that case, we just pick α∗ = 0 and the estimate is just the mean of the training points. Once again, it is as though we have not deployed any regularization at all.
Either way, we are not getting the full benefit of the regularization. We’re just dividing our data into two blocks and taking the smaller mean as our estimate.
This tells us that it is really important to have a separate test set to evaluate how well our learned function actually behaves. The performance on the validation set alone can be misleading.
The issue in this problem arises from the fact that there is only one parameter and the single hyperparameter can essentially overpower it. In most practical machine learning settings, we have far more parameters than hyperparameters and so the hyperparameters don’t have this kind of “veto power” over the training data.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
6 Hyperparameter Selection for Ridge Regression
In the previous question, we understood the case of learning a constant predictor. In this problem, we will increase the complexity of the model that we are learning so that we can better understand the role of hyperparameter selection.
(a) For the first part of this problem, we will focus on the special case when d = 1, working with the observation model:
y = xw + ε
where w is the true parameter.
Assume that we have n observations, each with xi of dimension d = 1, so x is an n × d matrix, just a column. ε is some random source of error in our observations. For this problem, we will model each component of the error as an independent Gaussian random variable, such that
εi ∼N(0,σ2).
Let wˆ (λ) be the ridge-regression estimate. If w is the true parameter value that generated the data, give an expression for E[(wˆ(λ) − w)2] and an expression for the optimal choice of λ as a possible function of n, x, w, σ2.
Of course, in many machine-learning contexts, you don’t already know w or σ2. This part is just asking you to understand how the in-principle optimal value of λ depends on these parameters.
Solution:
We know that
wˆ ( λ ) = ( X T X + λ I ) − 1 X T y .
is the solution to the standard ridge-regression problem which minimizes E[(wˆ (λ) − w)2 ] for a
given λ.
We want to find λ∗ which minimizes this expectation, so let’s start by plugging in what we
know.
By direct substitution, we see that
2T −1T T −1T 2 E wˆ(λ)−w =E (x x+λ) x xw+(x x+λ) x ε−w
T −1T T −12 =E (x x+λ) x ε−(x x+λ) λw
= σ2(xT x + λ)−2xT x + ((xT x + λ)−1λw)2.
To minimize this quantity, we can differentiate with respect to λ and equate to zero. This yields
the condition
2xTx 2λ 2λ2 σ2 −(xTx+λ)3σ2+(xTx+λ)2w2−(xTx+λ)3w2 =0 =⇒ λ∗ =w2.
Notice the similarity to what you saw in the previous problem.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19
(b) Now, we will consider the case when d > 1, so each of our xi have d components, not scalars. Similarly, the true weight vector w has d components. The observation noise model is the same as the previous part.
To make the calculations more tractable, we will consider the special case when all the columns of our training data matrix X are orthonormal, so XT X = I. (For example, from the previous homework, you should see how this situation can arise if the features are Fourier features and the training samples are taken in a regularly spaced manner.)
We are going to keep to a single scalar hyperparameter λ and must find a single λ which does a good job of balancing the needs of the different components of w.
Solve for the scalar value of λ that minimizes E[(wˆ (λ) − w)2] exactly, as a function of n, d, w, λ, and σ2.
Solution: Repeating our solution, we find that
2T −1T T −1T 2
E wˆ(λ)−w =E (X X+λI) X Xw+(X X+λI) X ε−w −1 T −1 2
=E (1+λ) X ε−(1+λ) λw = dσ2(1 + λ)−2 + ((1 + λ)−1λw)2.
Differentiating with respect to λ and equating to zero, we obtain
2d 2λ 2λ2 dσ2
−(1+λ)3σ2+(1+λ)2wTw−(1+λ)3wTw=0 =⇒ λ∗ =wTw.
(c) In the cases considered in the previous parts, we were able to analytically compute the optimal value of λ. Can we use that closed form expession in practice? Unfortunately not, as it depends on w, which is of course exactly the quantity that we are trying to estimate. Instead, in the real- world, we have to conduct hyperparameter tuning to pick a reasonable λ in a data driven way. Essentially, we split our dataset into two groups: training and validation. For various values of λ, we train our model on the training set, and observe its performance on the validation set. Then, we choose the value of λ that minimizes the error on the validation set, and use that λ. How do we use that λ? For example, we can use it to train our model on the entire input dataset.
Using this method, we’d expect the optimized value of λ to approximately match the predic- tions from the previous parts.
Fill in the blanks in the Jupyter notebook to compute the optimal value λopt for a toy model, then fill in the blanks to implement hyperparameter tuning, and compute the tuned value of λtuned. Report both values here for some σ and validation fraction, and compare them. Also report the σ and validation fraction you chose.
[Hint: You should expect λopt ≈ λtuned. If they differ, you have done something wrong, so you can use this part to verify your answers to the previous parts.]
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20
7
Solution: For a particular choice of random seed, we obtain λopt = 0.357 and λtuned ≈ 0.349, which are fairly close. These values are for σ = 0.6 with a validation fraction of 0.5. Your numbers will differ, but you should expect to again see that λopt ≈ λtuned most of the time. You can confirm this by looking at the λ vs frequency plot in the notebook.
However, from that plot, you should see that it is not unlikely for λtuned and λpred to differ significantly, even for large n. This is because the exact choice of λ does not have a huge impact on the expected validation error, as you should be able to see from the plot of λ against validation error, as the classifier obtained from ridge regression works reasonably well for a wide range of λ. This is reassuring, since in real life we can never calculate the optimal λ and we hope that as long as we are reasonably close our error will not change much. However, this is not always the case. There are many cases where the choice of hyperparameter has a strong influence on the final error and tuning a hyperparameter requires an exhaustive search and huge amounts of computation.
Ridge Regression Through Feature Augmentation
In Homework 1, you showed that ridge regression can be accomplished through augmenting the feature space of a least-squares problem with an appropriately scaled identity. In this problem, you will see this equivalence on a real dataset, and practice predicting with a model trained using such augmented features.
Before beginning the body of this problem, you should copy over helper functions from Problem 7.
(a) First, you will perform hyperparameter tuning and explicit ridge regression to establish a base- line for later parts. Report the tuned λ you obtain through validation here, and your final MSE on the training data.
Solution: Your training MSE should be around 0.73. Code:
(b) Next, augment the feature space of the training data with an appropriate identity and use the minimum-norm least-squares solution to find wˆ . You should find that you reach an identical parameter vector, but a different training MSE. Explain why the training MSE is different when solving with this method of regularization.
Solution: The training MSE should be almost exactly 0. If we had infinite numerical precision it would of course be exactly 0. The reason the MSE is 0 is that all of the residuals are absorbed by the augmented features. If you compare the fake features in η to the residuals from the explicit ridge regression part, you will see that they are exactly proportional.
lambd = 40
w_rr = ridge_regress(X, y, lambd)
mse = np.mean((X @ w_rr – y) ** 2)
print(“MSE:”, mse)
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
Augmentation code:
lambd = 40
alpha = np.sqrt(lambd)
X_aug[:, :d_raw] = X
X_aug[:, d_raw:] = np.eye(n) * alpha
Min-norm LS code:
η and wˆ plots code:
(c) Now that we have shown the equivalence of our two regression methods, practice predicting on the test set by extracting wˆ from η and by appropriately augmenting the test data. Report what you augmented the test set with. You should see different training MSEs but identical test MSEs with the two methods.
One of the important points of this question is to help you understand what the concept of training MSE is, and its dependence on context.
Solution: The test set should be augmented with 0s, since the scaled residuals from training should not be useful predictors for testing.
Prediction method 1 code:
Prediction method 2 code:
# Min-norm solution for underdetermined (fat X) system
eta = X_aug.T @ np.linalg.solve(np.dot(X_aug, X_aug.T), y)
residuals = np.dot(X_aug, eta) – y
epsilon = np.linalg.norm(residuals)**2
print(“regression residual squared error is {}”.format(epsilon))
print(“weights are {}”.format(eta))
plt.plot(eta)
plt.title(“Min-Norm LS eta”)
plt.show()
plt.plot(eta[:d_raw])
plt.title(“Weights for original features”)
w = eta[:d_raw]
train_mse = np.mean((X @ w – y) ** 2)
print(“MSE on the train dataset is {}”.format(train_mse))
test_mse = np.mean((X_test @ w – y_test) ** 2)
print(“MSE on the test dataset is {}”.format(test_mse))
train_mse = np.mean((X_aug @ eta – y) ** 2)
print(“MSE on the train dataset is {}”.format(train_mse))
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22
# Augment test data **with zeros**
test_augmented = np.zeros((n_test, d_raw + n_test))
test_augmented[:, :d_raw] = X_test
# What is the new eta? Just the first d_raw + n_test features from
→ train eta
test_eta = eta[:d_raw + n_test]
test_mse = np.mean((test_augmented @ test_eta – y_test) ** 2) print(“MSE on the test dataset is {}”.format(test_mse))
(d)
Finally, try augmenting the test set with various scaled random variables on the diagonal. Try different distributions and scalings describe how the augmented features impact the test performance as you vary the scale and distribution.
This part of this problem is somewhat open ended, but your explorations here are setting the stage for a problem that will come in a later homework that connects to a very important issue in modern machine learning.
Solution: As long as the random augmentations are small, they have almost no impact on the test performance. A natural distribution to use is the exponential distribution since its values, like our identity matrix, are strictly positive. However, you should find that a Gaussian distribution scales similarly since the residuals are roughly 0-mean.
Code:
divs = np.logspace(0, 3, 50)
for div in divs:
#
test_augmented[:, d_raw:] = np.diag(np.random.exponential(alpha / → div, n_test))
test_augmented[:, d_raw:] = np.diag(np.random.randn(n_test)) * → alpha / div
test_mse = np.mean((test_augmented @ test_eta – y_test) ** 2)
test_mses.append(test_mse)
scalings = alpha / divs
8
Your Own Question
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
9 Exam Policy and Practice
Please read through the entirety of the EECS189 exam proctoring policy (click here) carefully before proceeding. This question is designed to familiarize you with some of the things you will have to do during the exam.
(a) After reading through the Exam Policy carefully, please answer the following questions.
(i) Given you experience no disruptions during the exam, how many total minutes do you have for scanning and submission? Does it matter if you are using a tablet or whether you are using paper to write your answers?
(ii) Are you required to record locally during the exam? How much space should you have available on your computer for a local recording?
(iii) How should you contact the course staff for an emergency situation during the exam?
Solution:
(i) You have a total of 20 minutes for scanning and submission if you experience no disrup- tion and are using paper. 10 minutes if you are using a tablet. If you experience x minutes of disruption during the exam, you may work for min(x, 15) minutes past the end of the exam.
(ii) You are not required to record locally; you may do a Zoom cloud recording. You should have 5 GB available on your computer if you are doing a local recording.
(iii) You should contact the course staff by making a private post on Piazza. However, you should not be looking at Piazza during the exam other than to make a private post in the case of an emergency.
(b) Please configure your Zoom link.
(i) Fill out the following Google Form (click here) to submit the Zoom link you will be using. You must use this Zoom link for this assignment as well as for the exams. This can easily be done by submitting your Personal Meeting Room link and setting your Personal Meeting ID as your default on all devices you will be using for the final.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24
(ii) Ensure that anyone can join your Zoom link and that there is no waiting room for your Zoom meeting. You can try to do this by entering the meeting on one device that is logged in to your official Berkeley Zoom account and then entering the meeting on another de- vice that is logged into some other Zoom account. If you are able to do that, then your Zoom link is joinable. If you are not put into a waiting room, then your Zoom meeting will not have a waiting room. (You might want to have a member of your study group try this out with you if you don’t already have two Zoom accounts.)
Solution: Ensure your Zoom link is joinable and that the Zoom link in the form is the correct link which you will be using for the exam.
(c) You will now conduct a Zoom recording. You should use this recording to work through a homework problem or other study material to simulate the actual circumstances of the final exam.
(i) Start the Zoom call for the link you provided above. Turn on your microphone and recording device (webcam, phone camera). Turn off your speaker. Share your entire desktop (not just a particular window).
(ii) Start recording via Zoom. You may record locally or on the cloud (see the exam policy document for details).
(iii) HoldyourCalIDnexttoyourfaceandrecordyourselfsayingyournameintothewebcam. Both your face and your entire CalID should be visible in the video. We should be able to read your name and SID. This step should take at least 3 seconds. See figure 2. If you do not have a CalID for some reason, please hold up some document which has an image of you and proves your identity, such as a driver’s license.
Figure 6: ID card demonstration. Do not actually black out your SID and name.
(iv) Turn your recording device (webcam, phone) around 360◦ slowly so that we can see your entire room clearly. There should be no uncovered screens anywhere in the room during your exam. Only admin TAs and instructors will be able to see your videos (both for this assignment and for the actual exams).
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25
(v) Position your recording device in such a way that we can see your workspace and your hands. We don’t care if we can see your face or not. If you are not using your phone to record your workspace, then it should be visible in the recording, face down. See figure 3. Think during this test run. On the actual exam, you will want to use the desktop to see the exam itself. If you are using your laptop’s built-in camera, the angle might be bad. In this case, think about how you might position the laptop or consider using your phone or an external webcam on a stand to record your workspace. We want you to iron out these details ahead of the actual exam so that the exam itself has no additional stress due to proctoring logistics.
Figure 7: Demonstration of taking your exam. Your setup should look like this while you are taking the exam.
(vi) Your microphone should be on at all times. We should be able to see the time on your desktop at all times.
(vii) Record for a full two and a half hours. You should use this time to work through a homework problem or other study material for the course. The more realistic it is to actually taking an exam, the better practice it will be for you. (We also want to make sure that your computer can handle the video encoding task if your are doing a local recording.)
(viii) After two and a half hours, stop the recording. Check your recording to confirm that it contains your video as well as your desktop throughout its duration. Upload your video to Google drive and submit the link to the video using this Google Form (click here). You must make sure that the link sharing permissions are set so that we may view the video.
Solution: Ensure that you were able to submit your video to the Google form.
(d) A Midterm Google document should be shared with you with instructions for the midterm and the time which you will start taking the exam a few days before the midterm. More details will be made available closer to the exam date.
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26
Link for policy:
https://docs.google.com/document/d/14qO3xRt724j8Fs12i9d4E-pMWuyqk4h8b5gBbnd3hVo/
edit
Form to submit Zoom link:
https://forms.gle/P7CDYVpaqLBV4zez5
Form to submit video link:
https://forms.gle/F5V1J89NKf21Rj7Y6
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27
Contributors:
• Alexander Tsigler • Anant Sahai
• Ashwin Pananjady • Chawin Sitawarin • Josh Sanz
• Kate Sanders • Khalil Sarwari • Lydia T. Liu
• Peter Wang
• Philipp Moritz • Rahul Arya
• Satish Rao
• Yichao Zhou
HW2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28