CS229 Problem Set #2 1
CS 229, Autumn 2022 Problem Set #2
Due Wednesday, October 26 at 11:59 pm on Gradescope.
Notes: (1) These questions require thought, but do not require long answers. Please be as
Copyright By PowCoder代写 加微信 powcoder
concise as possible.
(2) If you have a question about this homework, we encourage you to post your question on our
Ed forum, at https://edstem.org/us/courses/30055/discussion/.
(3) If you missed the first lecture or are unfamiliar with the collaboration or honor code policy,
please read the policy on the course website before starting work.
(4) For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) The due date is Wednesday, October 26 at 11:59 pm. If you submit after Wednesday, October 26 at 11:59 pm, you will begin consuming your late days. The late day policy can be found in the course website: Course Logistics and FAQ.
All students must submit an electronic PDF version of the written question including plots generated from the codes. We highly recommend typesetting your solutions via LATEX. All students must also submit a zip file of their source code to Gradescope, which should be created using the make zip.py script. You should make sure to (1) restrict yourself to only using libraries included in the environment.yml file, and (2) make sure your code runs without errors. Your submission may be evaluated by the auto-grader using a private test set, or used for verifying the outputs reported in the writeup. Please make sure that your PDF file and zip file are submitted to the corresponding Gradescope assignments respectively. We reserve the right to not give any points to the written solutions if the associated code is not submitted.
Honor code: We strongly encourage students to form study groups. Students may discuss and work on homework problems in groups. However, each student must write down the solution independently, and without referring to written notes from the joint session. Each student must understand the solution well enough in order to reconstruct it by him/herself. It is an honor code violation to copy, refer to, or look at written or code solutions from a previous year, including but not limited to: official solutions from a previous year, solutions posted online, and solutions you or someone else may have written up in a previous year. Furthermore, it is an honor code violation to post your assignment solutions online, such as on a public git repo. We run plagiarism-detection software on your code against past solutions as well as student submissions from previous years. Please take the time to familiarize yourself with the Stanford Honor Code1 and the Stanford Honor Code2 as it pertains to CS courses.
1https://communitystandards.stanford.edu/policies-and-guidance/honor-code 2https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1164/handouts/honor-code.pdf
CS229 Problem Set #2 2 1. [12 points] Logistic Regression: Training stability
In this problem, we will be delving deeper into the workings of logistic regression. The goal of this problem is to help you develop your skills debugging machine learning algorithms (which can be very different from debugging software in general).
We have provided an implementation of logistic regression in src/stability/stability.py, and two labeled datasets A and B in src/stability/ds1 a.csv and src/stability/ds1 b.csv.
Please do not modify the code for the logistic regression training algorithm for this problem. First, run the given logistic regression code to train two different models on A and B. You can run the code by simply executing python stability.py in the src/stability directory.
(a) [2 points] What is the most notable difference in training the logistic regression model on datasets A and B?
(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate your hypothesis for the misbehavior. Remember, you should address why your explanation does not apply to A.
Hint: The issue is not a numerical rounding or over/underflow error.
(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2, where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term ∥θ∥2 to the loss function.
v. Adding zero-mean Gaussian noise to the training data or labels.
CS229 Problem Set #2 3
2. [20 points] Spam classification
In this problem, we will use the naive Bayes algorithm to build a spam classifier.
In recent years, spam on electronic media has been a growing concern. Here, we’ll build a classifier to distinguish between real messages, and spam messages. For this class, we will be building a classifier to detect SMS spam messages. We will be using an SMS spam dataset developed by Tiago A. Almedia and Jos ́e Mar ́ıa G ́omez Hidalgo which is publicly available on http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 3
We have split this dataset into training and testing sets and have included them in this assignment as src/spam/spam train.tsv and src/spam/spam test.tsv. See src/spam/spam readme.txt for more details about this dataset. Please refrain from redistributing these dataset files. The goal of this assignment is to build a classifier from scratch that can tell the difference the spam and non-spam messages using the text of the SMS message.
(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary, and transform text functions within our provided src/spam.py. Do note the correspond-
ing comments for each function for instructions on what specific processing is required.
The provided code will then run your functions and save the resulting dictionary into spam dictionary and a sample of the resulting training matrix into
spam sample train matrix. In your writeup, report the vocabular size after the pre- processing step. You do not need to include any other output for this subquestion.
(b) [10 points] In this question you are going to implement a naive Bayes classifier for spam classification with multinomial event model and Laplace smoothing.
Code your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/spam/spam.py.
Now src/spam/spam.py should be able to train a Naive Bayes model, compute your predic- tion accuracy and then save your resulting predictions to spam naive bayes predictions. In your writeup, report the accuracy of the trained model on the test set.
Remark. If you implement naive Bayes the straightforward way, you will find that the computed p(x|y) = i p(xi|y) often equals zero. This is because p(x|y), which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute Naive Bayes’ predicted class labels without explicitly representing very small numbers such as p(x|y). [Hint: Think about using logarithms.]
(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a particular class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:
p(xj =i|y=1) P(tokeni|emailisSPAM) logp(xj =i|y=0)=log P(tokeni|emailisNOTSPAM) .
Complete the get top five naive bayes words function within the provided code using the above formula in order to obtain the 5 most indicative tokens. Report the top five words in your writeup.
3Almeida, T.A., Go ́mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), Mountain View, CA, USA, 2011.
CS229 Problem Set #2 4 3. [15 points] Kernelizing the Perceptron
Let there be a binary classification problem with y ∈ {0,1}. The perceptron uses hypotheses of the form hθ(x) = g(θT x), where g(z) = sign(z) = 1 if z ≥ 0, 0 otherwise. In this problem we will consider a stochastic gradient descent-like implementation of the perceptron algorithm where each update to the parameters θ is made using only one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one pass through the entire training set. The update rule for this version of the perceptron algorithm is given by
θ(i+1) := θ(i) + α(y(i+1) − hθ(i) (x(i+1)))x(i+1)
where θ(i) is the value of the parameters after the algorithm has seen the first i training examples.
Prior to seeing any training examples, θ(0) is initialized to ⃗0.
(a) [3 points] Let K be a kernel corresponding to some very high-dimensional feature mapping φ. Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the perceptron to make it work in the high-dimensional feature space φ, but without ever explicitly computing φ(x). [Note: You don’t have to worry about the intercept term. If you like, think of φ as having the property that φ0(x) = 1 so that this is taken care of.] Your description should specify:
i. [1 points] How you will (implicitly) represent the high-dimensional parameter vector θ(i), including how the initial value θ(0) = 0 is represented (note that θ(i) is now a vector whose dimension is the same as the feature vectors φ(x));
ii. [1 points] How you will efficiently make a prediction on a new input x(i+1). I.e., how
you will compute hθ(i) (x(i+1)) = g(θ(i)T φ(x(i+1))), using your representation of θ(i);
iii. [1 points] How you will modify the update rule given above to perform an update to θ
on a new training example (x(i+1),y(i+1)); i.e., using the update rule corresponding to the feature mapping φ:
θ(i+1) := θ(i) + α(y(i+1) − hθ(i) (x(i+1)))φ(x(i+1))
(b) [10 points] Implement your approach by completing the initial state, predict, and update state methods of src/perceptron/perceptron.py.
We provide three functions to be used as kernel, a dot-product kernel defined as:
K(x, z) = x⊤z, (1) a radial basis function (RBF) kernel, defined as:
K(x,z)=exp − 2σ2 , (2)
and finally the following function:
−1 x=z 0 x ̸= z
Note that the last function is not a kernel function (since its corresponding matrix is not a PSD matrix). However, we are still interested to see what happens when
CS229 Problem Set #2 5
the kernel is invalid. Run src/perceptron/perceptron.py to train kernelized per- ceptrons on src/perceptron/train.csv. The code will then test the perceptron on src/perceptron/test.csv and save the resulting predictions in the src/perceptron/ folder. Plots will also be saved in src/perceptron/.
Include the three plots (corresponding to each of the kernels) in your writeup, and indicate which plot belongs to which function.
(c) [2 points] One of the choices in Q4b completely fails, one works a bit, and one works well in classifying the points. Discuss the performance of different choices and why do they fail or perform well?
CS229 Problem Set #2 6 4. [30 points] Neural Networks: MNIST image classification
Note: This question may requires knowledge on backpropagation covered on Monday of Week 5.
In this problem, you will implement a simple neural network to classify grayscale images of handwritten digits (0 – 9) from the MNIST dataset. The dataset contains 60,000 training images and 10,000 testing images of handwritten digits, 0 – 9. Each image is 28×28 pixels in size, and is generally represented as a flat vector of 784 numbers. It also includes labels for each example, a number indicating the actual digit (0 – 9) handwritten in that image. A sample of a few such images are shown below.
The data and starter code for this problem can be found in
• src/mnist/nn.py
• src/mnist/images train.csv • src/mnist/labels train.csv • src/mnist/images test.csv • src/mnist/labels test.csv
The starter code splits the set of 60,000 training images and labels into a set of 50,000 examples as the training set, and 10,000 examples for dev set.
To start, you will implement a neural network with a single hidden layer and cross entropy loss, and train it with the provided data set. Use the sigmoid function as activation for the hidden layer, and softmax function for the output layer. Recall that for a single example (x,y), the cross entropy loss is:
CE(y,yˆ)=−y logyˆ, kk
where yˆ ∈ RK is the vector of softmax outputs from the model for the training example x, and y∈RK istheground-truthvectorforthetrainingexamplexsuchthaty=[0,…,0,1,0,…,0]⊤ contains a single 1 at the position of the correct class (also called a “one-hot” representation).
CS229 Problem Set #2 7
For clarity, we provide the forward propagation equations below for the neural network with a single hidden layer. We have labeled data (x(i),y(i))ni=1, where x(i) ∈ Rd, and y(i) ∈ RK is a one-hot vector as described above. Let h be the number of hidden units in the neural network, so that weight matrices W[1] ∈ Rd×h and W[2] ∈ Rh×K. We also have biases b[1] ∈ Rh and b[2] ∈ RK . The forward propagation equations for a single input x(i) then are:
a(i) =σW[1]⊤x(i) +b[1]∈Rh z(i) = W [2]⊤a(i) + b[2] ∈ RK
yˆ(i) = softmax(z(i)) ∈ RK
where σ is the sigmoid function.
For n training examples, we average the cross entropy loss over the n examples.
J(W[1],W[2],b[1],b[2]) = CE(y(i),yˆ(i)) = − y(i) logyˆ(i). n nkk
i=1 i=1 k=1
The starter code already converts labels into one hot representations for you.
Instead of batch gradient descent or stochastic gradient descent, the common practice is to use mini-batch gradient descent for deep learning tasks. In this case, the cost function is defined as follows:
where B is the batch size, i.e. the number of training example in each mini-batch.
(a) [5 points]
For a single input example x(i) with one-hot label vector y(i), show that
∇z(i) CE(y(i), yˆ(i)) = yˆ(i) − y(i) ∈ RK where z(i) ∈ RK is the input to the softmax function, i.e.
yˆ(i) = softmax(z(i))
(Note: in deep learning, z(i) is sometimes referred to as the ”logits”.)
Hint: To simplify your answer, it might be convenient to denote the true label of x(i) as l ∈ {1,…,K}. Hence l is the index such that that y(i) = [0,…,0,1,0,…,0]⊤ contains a
single 1 at the l-th position. You may also wish to compute ∂CE(y(i),yˆ(i)) for j ̸= l and ∂z(i)
j = l separately. (b) [15 points]
Implement both forward-propagation and back-propagation for the above loss function JMB = B1 Bi=1 CE(y(i), yˆ(i)). Initialize the weights of the network by sampling values from a standard normal distribution. Initialize the bias/intercept term to 0. Set the num- ber of hidden units to be 300, and learning rate to be 5. Set B = 1, 000 (mini batch size).
CE(y(i),yˆ(i))
CS229 Problem Set #2 8
This means that we train with 1,000 examples in each iteration. Therefore, for each epoch, we need 50 iterations to cover the entire training data. The images are pre-shuffled. So you don’t need to randomly sample the data, and can just create mini-batches sequentially. Train the model with mini-batch gradient descent as described above. Run the training for 30 epochs. At the end of each epoch, calculate the value of loss function averaged over the entire training set, and plot it (y-axis) against the number of epochs (x-axis). In the same image, plot the value of the loss function averaged over the dev set, and plot it against the number of epochs.
Similarly, in a new image, plot the accuracy (on y-axis) over the training set, measured as the fraction of correctly classified examples, versus the number of epochs (x-axis). In the same image, also plot the accuracy over the dev set versus number of epochs.
Submit the two plots (one for loss vs epoch, another for accuracy vs epoch) in your writeup.
Also, at the end of 30 epochs, save the learnt parameters (i.e all the weights and biases) into a file, so that next time you can directly initialize the parameters with these values from the file, rather than re-training all over. You do NOT need to submit these parameters. Hint: Be sure to vectorize your code as much as possible! Training can be very slow otherwise.
(c) [7 points] Now add a regularization term to your cross entropy loss. The loss function will
CE(y(i),yˆ(i)) +λ ||W[1]||2 +||W[2]||2
Be careful not to regularize the bias/intercept term. Set λ to be 0.0001. Implement the regularized version and plot the same figures as part (a). Be careful NOT to include the regularization term to measure the loss value for plotting (i.e., regularization should only be used for gradient calculation for the purpose of training).
Submit the two new plots obtained with regularized training (i.e loss (without regularization term) vs epoch, and accuracy vs epoch) in your writeup. Compare the plots obtained from the regularized model with the plots obtained from the non-regularized model, and summarize your observations in a couple of sentences.
As in the previous part, save the learnt parameters (weights and biases) into a different file so that we can initialize from them next time.
(d) [3 points] All this while you should have stayed away from the test data completely. Now that you have convinced yourself that the model is working as expected (i.e, the observations you made in the previous part matches what you learnt in class about regularization), it is finally time to measure the model performance on the test set. Once we measure the test set performance, we report it (whatever value it may be), and NOT go back and refine the model any further.
Initialize your model from the parameters saved in part (a) (i.e, the non-regularized model), and evaluate the model performance on the test data. Repeat this using the parameters saved in part (b) (i.e, the regularized model).
Report your test accuracy for both regularized model and non-regularized model. Briefly (in one sentence) explain why this outcome makes sense” You should have accuracy close to 0.92870 without regularization, and 0.96760 with regularization. Note: these accuracies assume you implement the code with the matrix dimensions as specified in the comments,
CS229 Problem Set #2 9
which is not the same way as specified in your code. Even if you do not precisely these numbers, you should observe good accuracy and better test accuracy with regularization.
CS229 Problem Set #2 10 5. [12 points] Double Descent on Linear Models
Note: This question may require knowledge on double descent that is covered on Wed of Week 5.
Background: In this question, you will empirically observe the sample-wise double descent phenomenon. That is, the validation losses of some learning algorithms or estimators do not monotonically decrease as we have more training examples, but instead have a curve with two U-shaped parts. The double descent phenomenon can be observed even for simple linear models. In this question, we consider the following setup. Let {(x(i),y(i))}ni=1 be the training dataset. Let X ∈ Rn×d be the matrix representing the inputs (i.e., the i-th row of X corresponds to x(i))), and ⃗y ∈ Rn the vector representing the labels (i.e., the i-th row of ⃗y corresponds to y(i))):
− x(1) −
− x(2) − X=. . .,
of the validation dataset. We assume that the data are generated with d = 500.
In this question, we consider regularized linear regression. For a regularization level λ ≥ 0, defin
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com