HOMEWORK 4: LOGISTIC REGRESSION
Introduction to Machine Learning
http://www.cs.cmu.edu/ ̃mgormley/courses/10601/
OUT: Friday, February 18th DUE: Sunday, February 27th TAs: Sana, Hayden, Prasoon, Tori,
Copyright By PowCoder代写 加微信 powcoder
START HERE: Instructions
• Collaboration Policy: Please read the collaboration policy here: http://www.cs.cmu.edu/
̃mgormley/courses/10601/syllabus.html
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ̃mgormley/courses/10601/syllabus.html
• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.
– Written: For written problems such as short answer, multiple choice, derivations, proofs, or plots, please use the provided template. Submissions can be handwritten onto the template, but should be labeled and clearly legible. If your writing is not legible, you will not be awarded marks. Alternatively, submissions can be written in LaTeX. Each derivation/proof should be completed in the boxes provided. You are responsible for ensuring that your submission contains exactly the same number of pages and the same alignment as our PDF template. If you do not follow the template, your assignment may not be graded correctly by our AI assisted grader.
– Programming: You will submit your code for programming questions on the homework to Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.9.6) and versions of permitted libraries (e.g. numpy 1.21.2 and scipy 1.7.1) match those used on Gradescope. You have 10 free Gradescope programming submissions. After 10 submissions, you will begin to lose points from your total programming score. We recommend debugging your implementation on your local machine (or the Linux servers) and making sure your code is running correctly first before submitting your code to Gradescope.
• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on the course website.
In this assignment, you will build a sentiment polarity analyzer, which will be capable of ana- lyzing the overall sentiment polarity (positive or negative) . In the Written component, you will warm up by deriving stochastic gradient descent updates for logistic regression. Then in the Programming component, you will implement a logistic regression model as the core of your natural language processing system.
Homework 4: Logistic Regression
10-301 / 10-601
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely: Select One: Who taught this course?
⃝ Noam Chomsky
If you need to change your answer, you may cross out the previous answer and bubble in the new answer: Select One: Who taught this course?
@ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely: Select all that apply: Which are scientists?
2 I don’t know
Again, if you need to change your answer, you may cross out the previous answer(s) and bubble in the new answer(s):
Select all that apply: Which are scientists? ■
■ ■ I don’t know
For questions where you must fill in a blank, please make sure your final answer is fully included in the given space. You may cross out answers or parts of answers, but the final answer must still be within the given space.
Fill in the blank: What is the course number? 10-601 10-6301
Homework 4: Logistic Regression 10-301 / 10-601
Written Questions (51 points) 1 LATEX Bonus Point (1 points)
1. (1 point) Select one: Did you use LATEX for the entire written portion of this homework? ⃝ Yes
2 Linear Regression (5 points)
1. We would like to fit a linear regression model to the dataset
D = x(1), y(1) , x(2), y(2) , · · · , x(N), y(N) with x(i) ∈ RM by minimizing the ordinary least square (OLS) objective function:
1N M 2 J(w)= y(i) −wjx(i) .
2j i=1 j=1
(a) (2 points) Select one: Specifically, we solve for each coefficient wk (1 ≤ k ≤ M ) by deriving an expression of wk from the critical point ∂J(w) = 0. What is the expression for each wk in terms of
the dataset (x(1),y(1)), ···, (x(N),y(N)) and w1,··· ,wk−1,wk+1,··· ,wM?
N x(i)(y(i)−M
wj x(i)) j
wj x(i)) j
wj x(i)) j
(b) (1 point) Select one: How many coefficients (wk) do you need to estimate? When solving for these coefficients, how many equations do you have?
⃝ M coefficients, M equations ⃝ M coefficients, N equations ⃝ N coefficients, M equations ⃝ N coefficients, N equations
j=1,j̸=k (x(i))2
k N x(i)(y(i)−M
i=1 k j=1,j̸=k Ni=1(y(i))2
N x(i)(y(i)−M
⃝wk=N x(i)(y(i)−M wjx(i))
j=1,j̸=k (x(i)y(i))2
Homework 4: Logistic Regression 10-301 / 10-601
2. (2points) ConsideradatasetDsuchthatwefitaliney=w1x+b1.Letx ̄andy ̄bethemeanofthex and y coordinates, respectively. After mean centering the dataset to create Dnew = (x(1) − x ̄, y(1) − y ̄), . . . , (x(n) − x ̄, y(n) − y ̄), let the solution to linear regression on Dnew be y = w2x + b2. Explain how w2 compares to w1 and justify.
Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
Logistic Regression: Warm-Up (7 points)
(2 points) Select all that apply: Which of the following are true about logistic regression?
□ Our formulation of binary logistic regression will work with both continuous and binary fea- tures.
J(θ) = N1 N −y(i) θT x(i) + log 1 + exp(θT x(i)) i=1
□ Binary Logistic Regression will form a linear decision boundary in our feature space. □ The function σ(x) = 1 is convex.
vex so gradient descent may get stuck in a sub-optimal local minimum. □ None of the above.
(1 point) Select one: The negative log-likelihood J(θ) for binary logistic regression can be expressed
1+e−x □Thenegativelog-likelihoodfunctionforlogisticregression−N1 Ni=1log(σ(x(i)))isnotcon-
where x(i) ∈ RM +1 is the column vector of the feature values of the i-th data point, y(i) ∈ {0, 1} is the i-th class label, θ ∈ RM +1 is the weight vector. When we want to perform logistic ridge regression (i.e. with l2 regularization), we modify our objective function to be
updating θk with learning rate α, which of the following is the correct expression for the update?
f ( θ ) = J ( θ ) + λ 2
where λ is the regularization weight, θj is the jth element in the weight vector θ. Suppose we are
←θ +α∂f(θ) where∂f(θ) = 1 N x(i)y(i)− exp(θTx(i)) +λθ k ∂θk ∂θk N i=1 k 1+exp(θT x(i)) k
←θ +α∂f(θ) where∂f(θ) = 1 N x(i)−y(i)+ exp(θTx(i)) −λθ k ∂θk ∂θk N i=1 k 1+exp(θT x(i)) k
←θ −α∂f(θ) where∂f(θ) = 1 N x(i)−y(i)+ exp(θTx(i)) +λθ k ∂θk ∂θk N i=1 k 1+exp(θT x(i)) k
←θ −α∂f(θ) where∂f(θ) = 1 N x(i)−y(i)− exp(θTx(i)) +λθ k ∂θk ∂θk N i=1 k 1+exp(θT x(i)) k
Homework 4: Logistic Regression 10-301 / 10-601
3. (2 points) Data is separable in one dimension if there exists a threshold t such that all values less than t have one class label and all values greater than or equal to t have the other class label. If you train an unregularized logistic regression model for infinite iterations on training data that is separable in at least one dimension, the corresponding weight(s) can go to infinity in magnitude. What is an explanation for this phenomenon?
Hint: Think about what happens to the probabilities if we train an unregularized logistic regression model, and the role of the weights when calculating such probabilities.
Your Answer
4. (2 points) Select all that apply: How does regularization (such as l1 and l2) help correct the problem in the previous question?
□ l1 regularization prevents weights from going to infinity by penalizing the count of non-zero weights.
□ l1 regularization prevents weights from going to infinity by reducing some of the weights to 0, effectively removing some of the features.
□ l2 regularization prevents weights from going to infinity by reducing the value of some of the weights to close to 0 (reducing the effect of a feature but not necessarily removing it).
□ None of the above.
Homework 4: Logistic Regression 10-301 / 10-601
Logistic Regression: Small Dataset (5 points)
The following questions should be completed before you start the programming component of this assignment.
The following dataset consists of 4 training examples, where x(i) denotes the k-th dimension of the i-th k
training example x(i), and y(i) is the corresponding label (k ∈ {1, 2, 3} and i ∈ {1, 2, 3, 4}).
i x1 x2 x3 y 10010 20101 30111 41000
A binary logistic regression model is trained on this dataset, and the parameter vector θ after training is θ=1.5 2 1T.
Note: There is no intercept term used in this problem.
Use the data above to answer the following questions. For all numerical answers, please use one number rounded to the fourth decimal place; e.g., 0.1234. Showing your work in these questions is optional, but it is recommended to help us understand where any misconceptions may occur.
(2 points) Calculate J(θ), N1 times the negative log-likelihood over the given data after iteration n. (Note here we are using natural log, i.e., the base is e).
Homework 4: Logistic Regression
2. (2 points) Calculate the gradients ∂J(θ) with respect to θj for all j ∈ {1, 2, 3}.
10-301 / 10-601
∂J(θ)/∂θ1 ∂J(θ)/∂θ2 ∂J(θ)/∂θ3
Homework 4: Logistic Regression 10-301 / 10-601
3. (1 point) Update the parameters following the parameter update step θj ← θj − α∂J(θ) and write the ∂θj
updated (numerical) value of the vector θ. Use learning rate α = 1.
Homework 4: Logistic Regression 10-301 / 10-601
Logistic Regression: Adversarial Attack (7 points)
An image can be represented numerically as a vector of values for each pixel. Image classification tasks then use this vector of pixel values as features to predict an image label.
An automobile company is trying to gather data by asking participants to submit grayscale images of cars. Each pixel has an intensity value in the continuous range [0, 1], zero being the darkest. The company then runs a logistic regression model to predict if the photo actually contains a car. After training the model on a training dataset, the company achieves a mediocre test error. The company wants to improve the model and offers monetary compensation to people who can submit photos that contain a car and make the model predict “false” (i.e., a false negative), as well as photos that do not contain a car and make the model predict “true” (i.e., a false positive). Furthermore, the company releases the parameters of their learned logistic regression model. Let’s investigate how to use these parameters to understand the model’s weaknesses.
(2 points) Given the company’s model parameters θ (i.e., the logistic regression coefficients), gradient ascent can be used to find the vector of pixel values that maximizes the “car” prediction. What is the gradient update rule to do so? Write (1) the objective function, (2) the gradient of the objective function with respect to the feature values and (3) the update rule. Use x as the input vector. Hint: You are updating x to produce an input that confuses the model.
Your Answer
(1 point) Modify the procedure in the previous question to find the image that minimizes the “car” prediction.
Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
3. (2 points) To generate an image, we require the feature values to be in the range [0, 1]. Propose a different procedure that optimizes the “car” prediction subject to this constraint and does not require a gradient calculation. What is the runtime of this procedure?
Your Answer
4. (2 points) Select all that apply: Now let’s consider whether logistic regression is well-suited for this task. Suppose the exact same white car in a dark background was used to generate the training set. The training photos were captured with the side view of the car centered in the photo at a distance of between 30-50 meters from the camera. Which (if any) of the below descriptions of a test image would the model predict as “car”?
□ A new photo with the same car centered and 60 meters away from the camera. □ A new photo with the same car in the upper right corner of the image.
□ Identical to one of the training photos, but the car replaced with an equal size white cardboard cutout of the car.
□ Identical to one of the training photos, but the background changed to white. □ None of the above.
Homework 4: Logistic Regression 10-301 / 10-601
Vectorization and Pseudocode (10 points)
The following questions should be completed before you start the programming component of this assignment. Assume the dtypes of all ndarrays are np.float64. Vectors are 1D ndarrays.
(2 points) Select all that apply: Consider a matrix X ∈ RN ×M and vector v ∈ RM . We can create a new vector u ∈ RN whose i-th element is the dot product between v and the i-th row of X using NumPy as follows:
Which of the following produces the same result?
□ u = np.dot(X, v)
□ u = np.dot(v, X)
□ u = np.matmul(X, v) □ u = np.matmul(v, X) □u=X*v □u=v*X
□ None of the above.
# X and v are numpy ndarrays
# X.shape == (N, M), v.shape == (M,) u = np.zeros(X.shape[0])
for i in range(X.shape[0]):
for j in range(X.shape[1]):
u[i] += X[i, j] * v[j]
2.ConsideramatrixX∈RN×M andvectorw∈RN.LetΩ=N−1wi(xi−xi)(xi−xi)T where i=0
xi ∈RM isthecolumnvectordenotingthei-throwofx,xi ∈Risthemeanofxi,andwi ∈Risthe i-th element of w (i ∈ {0,1,··· ,N − 1}). For the following questions, use X and w for X and wi, respectively. X.shape == (N, M), w.shape == (N,). You must use NumPy and vectorize your code for full credit. Do not use functions which are essentially wrappers for Python loops and provide little performance gain, such as np.vectorize.
(a) (2 points) Write one line of valid Python code that constructs a matrix whose i-th row is (xi − xi)T . Your Answer (CASE SENSITIVE)
(b) (2 points) Assume the result from (a) is stored in M. Write one line of valid Python code that computes Ω from M.
Your Answer (CASE SENSITIVE)
Homework 4: Logistic Regression 10-301 / 10-601
3. Now we will compare two different optimization methods using pseudocode. Consider a model with parameter θ ∈ RM being trained with a design matrix X ∈ RN ×M and labels y ∈ RM . Say we update θusingtheobjectivefunctionJ(θ|X,y)=N1 Ni=1J(i)(θ|x(i),y(i))∈R.Recallthatanepochrefers to one complete cycle through the dataset.
(a) (2 points) Complete the pseudocode for gradient descent.
def dJ(theta, X, y, i):
(omitted) # Returns ∂J(i)(θ|x(i), y(i))/∂θ
# You may call this function in your pseudocode.
def GD(theta, X, y, learning_rate): for epoch in range(num_epoch):
Complete this section with the update rule
return theta # return the updated theta
Your Answer (CASE SENSITIVE, 7 lines max)
(b) (2 points) Complete the pseudocode for stochastic gradient descent that samples without replace- ment.
def dJ(theta, i):
(omitted) # Returns ∂J(i)(θ|x(i), y(i))/∂θ
# You may call this function in your pseudocode.
def SGD(theta, X, y, learning_rate): for epoch in range(num_epoch):
indices = shuffle(range(len(X)))
for i in indices:
Complete this section with the update rule
return theta # return the updated theta
Your Answer (CASE SENSITIVE, 7 lines max)
Homework 4: Logistic Regression 10-301 / 10-601
Programming Empirical Questions (16 points)
The following questions should be completed as you work through the programming component of this assignment. Please ensure that all plots are computer-generated.
(2points) ForModel1,usingthedatainthelargedatafolderinthehandout,makeaplotthatshows the average negative log-likelihood for the training and validation data sets after each of 5,000 epochs. The y-axis should show the negative log-likelihood and the x-axis should show the number of epochs. (Note that running the code for 5,000 epochs might take longer than one minute. This is okay since we won’t run your code for more than 500 epochs during auto-grading.)
Your Answer
(2 points) For Model 2, make a plot as in the previous question. Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
3. (2 points) Write a few sentences explaining the output of the above experiments. In particular, do the training and validation log-likelihood curves look the same, or different? Why?
Your Answer
4. (2 points) Make a table with your train and test error for the large data set (found in the largedata folder in the handout) for each of the two models after running for 5,000 epochs. Please use one number rounded to the fourth decimal place, e.g., 0.1234.
Error Rates
Train Error
Model1 ? Model2 ?
Test Error
Table 1: “Large Data” Results
Homework 4: Logistic Regression 10-301 / 10-601
5. (2points) ForModel1,usingthedatainthelargedatafolderofthehandout,makeaplotcomparing the training average negative log-likelihood over epochs for three different values for the learning rates, α ∈ {10−4, 10−5, 10−6}. The y-axis should show the negative log-likelihood, the x-axis should show the number of epochs (from 1 to 5,000 epochs), and the plot should contain three curves corresponding to the three values of α. Provide a legend that indicates the learning rate α for each curve.
Your Answer
6. (2 points) Compare how quickly each curve in the previous question converges.
Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
7. (2 points) Now we will compare the effectiveness of bag-of-words and word2vec. Consider Model 3, which is Model 1 with the size of the dictionary reduced to 300 to match Model 2’s embedding. We pro- vided you the validation average negative log-likelihood over 5,000 epochs in model3 val nll.txt. Using this, make a plot that compares the validation average negative log-likelihood of all three models over 5,000 epochs. The y-axis should show the negative log-likelihood and the x-axis should show the number of epochs.
Your Answer
8. (2 points) Compare and contrast the performance of the three models based on the curves in the previ- ous question. Recall that a better model is one that attains lower negative log-likelihood faster. Explain the relative difference in performance focusing on the dimensions and design of the input data.
Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
8 Collaboration Questions
After you have completed all other components of this assignment, report your answers to these ques- tions regarding the collaboration policy. Details of the policy can be found here.
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.
3. Did you find or come across code that implements any part of this assignment? If so, include full details.
Your Answer
Homework 4: Logistic Regression 10-301 / 10-601
9 Programming (70 points)
Your goal in this assignment is to implement a working Natural Language Processing (NLP) system using binary logistic regression. Your algorithm will determine whether a movie review is positive or negative. You will also explore various approaches to feature engineering for this task.
Note: Before starting the programming, you should work through the written component to get a good understanding of important concepts that are useful for this programming compo
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com