HW VI
EE 588: Optimization for the information and data sciences
University of Southern California
Assigned on: November 16, 2018 Due date: beginning of class on November 28, 2018
1- Logistic regression with momentum. In this problem we will try to predict whether points on a two-dimensional graph reprsent a hill or valley. We are going to use logistic regression to do this. In logistic regression we have a data set of pairs (xi, yi) where xi ∈ Rn represents the features and yi ∈ {0,1} represents the output. Each data point contains coordinates of a two-dimensional graph such that if we plot it it is a hill or valley. The output yi is indicator of being valley or hill(1 for hill and 0 for valley). Using our training data we will try to fit a logistic model to our data of the form
N λ (wˆ,ˆb)=argmin f(w,b):= −yi(wTxi+b)+log1+expwTxi+b+2∥w∥2l2 . (1)
(w,b) i=1
Given a new image we again extract the features x. We then use our trained model to
determine the probability that it represents a hill via p=1.
1 + e − ( wˆ T x + ˆb )
Our final prediction for the output i.e. hill or valley will be based on
1 (hill) if p ≥ 1 y=2
0 (valley) if p < 1 2
The data is available in a comma separated file called Hill-Valley-without-noise-Training.data.txt available in blackboard. Each line corresponds to different coordinates
1317.265789, 1315.220951, 1312.770581, 1309.834252, . . . .
The first row contains the header of the data. Each row represents 100 points on a two- dimensional graph. The last column of the data is the output, indicating whether the plot is a hill or valley. There is a total of 606 instances in this dataset each has 100 features. We will use 500 of these instances as our training data and other 106 instances as our test data set. We will train the model i.e. solve (1) using the training data set (N = 500) and then report our predictions on the test data set. This selection of train/test is at random. That is we pick 106 wines at random from the total of 606 wines and put them in the test category and put the remaining 500 in the training category. In all of our experiments we will perform 100 random partitions of the data for train/test and then report the final result as the average over these trials.
1
- (a) First read the data from Hill-Valley-without-noise-Training.data.txt, remove the header, and separate the features from the outputs. For matlab users you may find the import- data function useful.
- (b) Before we proceed any further we must normalize our data. This is the first thing to do when dealing with real data. Often the right form of normalization is to make sure our data set is zero mean and our features have the same Euclidean norm. More specifically, calculate the mean vector of patient features across all of the data set. That is,
1 4898
x ̄ = x i .4898 i=1
Now subtract the mean from each of the features. That is, set xi ← xi − x ̄. Thennormalize the data via xi ← xi . ∥xi ∥l2
- (c) Now that we have normalized data, partition it into train/test sets at random 100 times as discussed earlier. In each trial learn the model by solving (1) with λ = 0.01. To due this run gradient descent for T = 500 iterations and then use the trained model to make predictions on the test data and calculate the average error (average number of miss-classified patients on the test data) for each trial. Report the average over the 100 trials. The value of the step size you use does not matter too much. However, make sure that the algorithm has converged.
- (d) Perform the experiment with the same step size you used before but now report the number of iterations it takes to get to an accuracy of 10−6 calculated via the first iteration t when the following inequality holds
∥∇f(wt,bt)∥2l2 ≤10−6(1+|f(wt,bt)|). (2) The number you should report is the average of this number over the 100 trials.
- (e) Perform the experiment of part (d) but now add a momentum term (1) using the heavy ball method and (2) using Nesterov’s accelerated scheme. In all the cases keep the same step size as part (d) but fine tune the momentum parameter to get the smallest number of iterations for convergence based on the stopping criteria (2) (again averaged over the 100 trials). Draw the convergence of the three algorithms gradient descent, heavy ball, Nesterov’s accelerated scheme, and Newton’s method for one trial. That is, draw the ratio
∥∇f(wt,bt)∥2l2 , (1+|f(wt,bt)|)
as a function of the iteration number t = 1, 2, . . . , 500. Which algorithm would you use and why?
- (f) Repeat the experiment of part (d) using Newton’s method and damped Newton’s method with appropriate step sizes. Then draw the ratio mentioned in part (e) as a function of the iteration number t = 1, 2, …, 500.
2