代写 C algorithm math python scala statistic software network cuda theory CS 361: Probability and Statistics for Computer Science (November 2019) Stochastic First Order Optimization

CS 361: Probability and Statistics for Computer Science (November 2019) Stochastic First Order Optimization
Ehsan Saleh
CS 361 Project
1 Stochastic Approximation 1.1 Review
The classical root finding problem is finding the solution for the following equation given arbitrary function h: h(x) = 0 (1.1)
Exercise 1.
(8 points) For review, fill in the following table and feel free to choose a method of your choice for the last row:
Table 1: Root finding algorithms
Clarification: ”Needs Value?” in the above table refer to whether the method needs to query the function value at different points. The two middle columns are True/False columns.
Later, we will focus on a noisy version of the root finding problem that we call stochastic approximation. We will assume that we do not have direct access to function h for evaluation. However, we have access to an oracle that given a point x returns a random variable y with the following property
y = h(x) + z, E[z] = 0 (1.2)
In other words, you have access to a noisy version of h(x), and not itself.
Clarification: We will be using the term “oracle” in this project many times. Oracle is like a black box that
takes an x as an input point, and returns an estimate of the value of a function. Some oracle may also report an estimate of the function gradient. This “oracle” definition should not be confused with Random Number Generators(RNGs) which are sometimes referred to as “noise oracles” in other literature.
You can have a very accurate but slow oracle. You can also have a less accurate, but faster oracle. The main point of the theoretical part, is to just see which oracle makes more sense to be used, and under what circumstances; Does the noisy fast oracle work better? or is it the opposite? Does it depend on the problem? What circumstances can make one better than the other?
Exercise 2.
(2 points) For the purpose of root finding, does it make sense to use the secant method when you only have access to the noisy function oracle (i.e. having access to only y given a point x)? why?
1.2 Stochastic Approximation in simple setting
For this part, we will make the following simplifying assumptions. The algorithm needs a sequence of positive learning rates that we denote as {an}n≥1.
Algorithm
Bisection method
Needs Value?
Needs Derivative?
6-word description of the method
False position method
Secant method
Newton–Raphson method
1

CS 361 Project – Stochastic First Order Optimization
2
1. The function h has a unique root x∗, which is a positive zero crossing of h. In other words: Simplified Mathematical Statement:
h(x∗)=0, x>x∗ ⇒h(x)>0, x0: 0<ε<ε0 ⇒h(x∗ +ε)>h(x∗ −ε)
2. We have bounded noise. In other words, if y = h(x) + z for a given x, we have
Simplified Mathematical Statement:
(1.3)
(1.4)
There is some fixed positive c that is not infinity. We may or may not know c, but it exists and it is fixed. For any input argument x, the random variable y (i.e. the noisy function value of x) has the following property
Exact Mathematical Statement:
P{|y| < c} = 1 (1.5) 􏰃􏰄 ∃c<∞: ∀x:P{|y| 1/2.
1. (0.5 point) Start with e2n ∼ cn−2β as an initial guess in 1.20 relationship, where β,c are positive
unknowns. Prove the following equality.
− 2cβn−1−2β − o(n−2−2β ) = n−2α − 2cn−α−2β + cn−2β−2α (1.25)
Hint: Using Taylor expansion, prove that (1 + 1 )−γ = 1 − γn−1 + o(n−2) for γ > 0. Then use this in n
your derivation.
Revision: The Taylor expansion was wrongly written in the initial two releases. It is corrected now.
2. (0.5 point) Some of the terms in 1.25 are clearly weaker than others, and never can be dominant under 1 ≥ α > 1/2 and β > 0 assumptions. Identify them and justify why they are weak. Then drop them in the equality to obtain an asymptotic balance relation.
3. (0.5 point) Assume 1 > α > 1/2. Using the asymptotic balance relation that you found, prove that c = 1 and β = α are the asymptotic balance solution.
22
4. (0.5 point) Assume α = 1. Prove that c = 1 and β = 1 are the asymptotic balance solution. 2
5. (0.5 point) Which learning rate sequence (i.e. which α in 1 ≥ α > 1/2) yields the fastest rate of convergence for e2n?
Exercise 7.
This whole exercise is not mandatory, and is for extra credit.
(1.5 points) Assume the more general form h(x) = |x|2γ+1sgn(x) for some γ > 0. It is obvious that zero is the root for h, and that h is flat near its unique root. So, this can be a more difficult SA problem. Assume the same z noise assumptions (i.e. having a mean of zero, and unit variance). Derive an asymptotic balance relation, and show that
α=γ+1,β=1 2γ + 1 4γ + 2
can be a solution for the asymptotic balance relation.
2 Stochastic Optimization 2.1 Review
Let us assume that the final goal of optimization is to find x∗ = argminxf(x) where f : Rd → R. However, f is either unknown, or very expensive to collect. On the bright side, we have access to a noisy version of f that we call g(x). In other words, g(x) = f(x) + z. You cannot control the additive noise z or predict it, but for simplicity you can assume that it is independent of x, and has a mean of zero.
E[g(x)] = f(x)
We also assume that we have access to the gradient of g, which is also noisy. In other words
E[∇g(x)] = ∇f(x)
2.1.1 Gradient Descent Algorithm
• Initialize X1
• For n = 1,2,··· Then do the following update
Xn+1 = Xn − an∇f(Xn)
(2.1)
(2.2)
(2.3)

CS 361 Project – Stochastic First Order Optimization 6
Exercise 8.
(2 point) Do you agree with the following statement? Why?
“Gradient descent is essentially a root finding algorithm”. Your answer should be limited to three lines.
Clarification: Here are a set of valid/invalid argument templates. You could fill in each of the valid ones and add the necessary and relevant information.
• A valid argument: “GD is essentially a root finding algorithm, because it fits the definition and the procedure, and uses exactly the same update rule when the root of the function we’re looking for is X. GD doesn’t require extra input other than X, and it returns the same kind of output Y.”
• A valid argument: “GD is not a root finding algorithm, because it violates the W and Z assumptions of the root finding procedure. This is due to · · · .”
• A valid argument: “GD is not a pure root finding algorithm, because it has extra unrelated algorithmic parameters that do not exist in the root finding procedure”
• An invalid argument: “GD is essentially a root finding algorithm, because it looks like it.”
Do not submit a templates as it is. For instance, if you are arguing about “unrelated algorithmic parameters”
you should specifically mention which algorithmic parameters you are referring.
2.1.2 Stochastic Gradient Descent Algorithm
• Initialize X1
• For n = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. ∇g(Xn)). Then do the following update
Xn+1 = Xn − an∇g(Xn)
Exercise 9.
(2 point) Do you agree with the following statement? Why?
“Stochastic gradient descent is essentially a stochastic approximation algorithm”. Your answer should be limited to three lines.
(2.4)
Clarification: Here is an extra invalid type of argument.
• An invalid argument: “SGD is not essentially a stochastic approximation algorithm, because we shouldn’t
get the delusion of noise when we are not adding artificial noise.”
2.1.3 Empirical Risk Minimization
In almost all of the machine learning problems, the training problem boils down to the following format: We are trying to minimize function Q(x).
1 􏰇k
Q(x) = k
where Qi(x) is the loss function for ith data point where we have k training data points. Exercise 10.
(4 points) Define f to be Q. Define the g oracle this way:
1. Take x as the input
2. Pick index i at random and uniformly from 1 to k.
3. Compute the noisy function value as g(x) = Qi(x).
4. Compute the noisy function gradient as ∇g(x) = ∇Qi(x) 5. Return g(x) and ∇g(x) as the noisy value and gradient.
You can define the noise to be z = Qi(x) − Q(x). Therefore g(x) = f(x) + z.
Prove that under this definition of f and g, equations 2.2 and 2.1 are satisfied. Also, show that E[z] = 0.
Qj(x) (2.5)
j=1

CS 361 Project – Stochastic First Order Optimization 7 2.2 Convergence rates for SGD and GD
Again, we want to optimize function f with learning rate sequence an. Here are some theorems that we will not prove, but use later.
By the way, we will mention convexity (and sometimes strong convexity) as an assumption for the rest of the project, but you do not need to know anything about it for this project since we will not check or use it in our derivations. However, the following theorems make convexity assumption, and we will just always assume that our function is convex so that it makes our lives easier.
Theorem 2.1 Assume f has a unique root x∗. If f is twice continuously differentiable and strongly convex,
and an = O(n−1), then to reach an approximation error ε = |E[f(Xn)] − f(x∗)|, stochastic gradient descent
requires n = O( 1 ) updates. ε
Theorem 2.2 Assume f has a unique root x∗. If either the smoothness assumptions about f in theorem 2.1
or an = O(n−1) are not met, then to reach an approximation error ε = |E[f(Xn)] − f(x∗)|, stochastic gradient
descent requires n = O( 1 ) updates. ε2
Theorem 2.3 Assume f has a unique root x∗. To reach an approximation error ε = |E[f(Xn)] − f(x∗)|, gradient descent requires n = O(ln( 1 )) updates.
ε
Exercise 11.
Considerthestronglyconvexandonedimensionalfunctionf(x)= 1×2. Also,assumethatg(x)= 1(x+z)2−1, 222
where z is a bounded random noise with mean of zero and a unit variance.
1. 2. 3. 4. 5.
6.
(2 points) Find d g(x). dx
(2 points) Does f have a unique root?
(2 points) Verify that E[g(x)] = f(x).
(2 points) Verify that E[ d g(x)] = d f(x). dx dx
(2 points) Using one of the theorems above, find the order of updates required by SGD to reach approx-
imation error ε = |E[f(Xn)] − f(x∗)| when using g as the noisy oracle. Consider using he learning rate
sequence an = 1 . n
Revision: More instructions were added to the last about what learning rate sequence to use. (2 points) Validate part 5 by what you have proved in exercise 6.
Do this for α = 1 (i.e. the learning rate sequence an = 1 ). You can find the derived asymptotic solution n
parameters c, β for α = 1 and what the order of e2n would be using those parameters in exercise 6. Note that e2n = E[Xn2] = 2E[f(Xn)]. You do not need to prove anything.
Revision: More instructions were added to the last part for clarifying what you need to show.
Comparing SGD vs GD for Empirical Risk Minimization
3
Let us consider the f minimization task discussed earlier for the ERM task.
• f(x) = Q(x) and g is described in exercise 10.
• Let’s assume that f is strongly convex.
• Let’s assume that f is twice continuously differentiable. This is often true for the classes of functions we are optimizing over.
• Let’s consider using the an = 1 learning rate sequence. n
• You have k data points.
• In this part, we are determined to achieve a training precision of ρ = |E[f(Xn)] − f(x∗)|.
• Today’s automatic differentiation software scale linearly with respect to the number of data points. In other words, if finding ∇Qi(x) takes 1 seconds, finding ∇Q(x) will require k seconds.

CS 361 Project – Stochastic First Order Optimization 8
Computational Cost per Update
SGD
GD
O(1)
Number of updates to reach ρ
Total Cost
Table 2: Comparing SGD vs GD in terms of training precision
Table 3: Comparing SGD vs GD in terms of testing precision
Revision: The twice continuously differentiablity and the learning rate sequence assumptions were missing in the initial two releases.
Exercise 12.
(15 points) Fill in the table 2, with complexity orders in terms of k and ρ
Explain your answer for each element of the second row by providing a reference to the theorem mentioned in this document. For other elements, make sure you state the reason no matter how obvious it looks.
4 Comparing SGD vs GD with respect to test loss
Assume the same setting as the previous part. Last exercise was about finding the computational complexity required to reach a specific precision with respect to the optimal training loss (This is a technically wrong statement due to oversimplification of the subject, but you can ignore that). However, a specific precision of training loss does not translate to the same precision of test loss. Here are a couple of interesting facts:
• To achieve a specific test loss precision ε, you need large enough number of training samples k. The relation
ε = O(k−γ)
must hold for most of functions where 1 ≥ γ > 0 is an unknown constant.
• Assume that someone told you ρ = O(ε).
Exercise 13.
(16 points) Fill in the table 3, with complexity orders in terms of ε and γ. You can refer to the previous table,
and just replace the elements with appropriate values. Make sure you state the reason for each element no matter how obvious the reason looks.
For a typical γ = 0.2, Explain why choosing GD over SGD can be a terrible idea.
SGD
GD
Computational Cost per Update
O(1)
Number of updates to reach ε
Total Cost

CS 361: Probability and Statistics for Computer Science (November 2019) Stochastic First Order Optimization Implementation
Anay Pattanaik, Ehsan Saleh CS 361 Project
Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems, Linear Regression and Classification (using Neural Networks).
4.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm
It has been observed that SGD as described in 2.1.2 is inefficient when the function is highly non-convex. Unfortunately, most of the modern applications of Machine Learning deal with these kinds of non-convex optimization landscape. One of the most widely adopted stochastic optimization algorithms that has been adopted to deal with non-convexity is ADAM. ADAM uses past gradient information to ”speed” up optimization by smoothing. Specifically, past gradients and squared gradients are used for the parameter update. You need to implement the ADAM stochastic optimization algorithm for a Linear Regression problem.
The pseudo-code for Adam can be found at [KB14](Algorithm 1) with corresponding explanation in Section 2 of [KB14]. Note that the pseudo-code is self contained.
Exercise 14.
Consider the following problem setting:
• ith data point is represented by x(i) and is a 10-dimensional vector.
• θtrue (i.e. the true parameter set) and θ (i.e. the variable indicating a candidate parameter set) are also 10-dimensional vectors.
• The number of training data points is k = 1000.
• The number of test data points is 50.
• Each element of xi is being generated from a uniform distribution over the interval [−1, 1].
• Data is being generated according to model y(j) = θT x(j) + 0.1 · ε (where ε ∼ N (0, 1) is a sample from
true the normal distribution). The label y(j) is a scalar.
• Assume that the true θtrue is all ones. However, we will pretend that we do not know it and we want to estimate it using the training data.
• Let us assume that all the variants start from the same initial θ, whose elements are picked uniformly in the interval [0, 0.1].
• Use 1000 updates for any SGD variant that you will implement.
Since this is a learning problem, we need to define a loss function. Let’s use the following format
1 􏰇k
(4.1)
(4.2)
􏰀􏰀
Where γ is a hyper-parameter that we control and defines the objective.
Qj(θ)
􏰀 􏰀γ
Q(θ) = k
Qj(θ) = 􏰀􏰀θT x(j) − y(j)􏰀􏰀
j=1
• (2 points) For γ = 2, The problem becomes the least square regression that you learnt in MATH 415, MATH 416, or MATH 225. Use the closed form solution that you learnt to estimate θ from the training data. Implement this, and report whether it is close to the true value.
Revision: Reporting the closeness of the implemented closed form solution was added in the third release.
1

CS 361 Project – Stochastic First Order Optimization Implementation 2 • (2 points) For Qj(θ), find an expression that gives you the gradient ∇Qj(θ). Report this expression,
and implement it in the appropriate function.
Revision: A misplaced statement was replaced in the question.
hint: For r(θ) = h(e(θ)) = h(θT x − y), the gradient can easily be written as ∇r(θ) = ∂h · ∇e(θ) = ∂h · x ∂e ∂e
according to the chain rule.
Revision: in the initial release, there was a typo putting ∇h(θ) instead of ∇e(θ) in the hint. The current relation is correct.
• (3 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning rate as mentioned in the [KB14]) and SGD to find the best parameter θ. Use a batch size of 1 for this problem, and perform 1000 parameter updates.
hint: You are given a starter code that pretty much does everything. You are also free to use any language that you would like and ignore the starter code, but you should exactly use the same problem setting described above.
You could either print the final set of parameters for a single run using the default setting, and optionally compare it to the least square solution. If you’re doing the next question, you can put a plot generated as the evidence for implementation. Any kind of evidence/result that shows you have implemented ADAM is going to be fine. Please do not include code snippets in your report.
Clarification: More instructions about what to include as an answer were added.
• (1 points) Since this is an exercise that depends on random sampling, one run is not going to be enough. Run the training process multiple times from the same generated data and the same initial θ. Compute the average test loss and its standard error at multiple stages during training. To determine the right number of replicates, repeat the exercise as many times as the standard error bars of different methods would not intersect with each other. Always report a training curve, where you show the test loss at different stages (i.e. number of updates) of training. The horizontal axis should be the number of updates, and the vertical axis should be the test loss.
hint: You are given a starter code that pretty much does everything. You are also free to use any language that you would like and ignore the starter code, but you should exactly use the same problem setting described above.
• (8 points) ADAM is the most popular SGD variant, and is believed to be the best. However, this common belief is more driven by matters like insight or practical performance under specific settings, and it is less substantiated by concrete generalize-able theory (just like most of ML algorithms).
Do the training for multiple γ values including at least 0.4,0.7,1,2,3,5. State your observation clearly, and analyze the trends you are seeing. State whether ADAM seems to save the day consistently. Your analysis should include the reason why one method outperforms the other under each setting.
• (8 points)
Another SGD variant is called ADAGRAD[DHS11]. Basically the update rule is:
– Given noisy gradient gt at step t, construct a running squared sum Gt = 􏰅tn=1 gn ⊙ gn, where ⊙ represents the element-wise multiplication.
– The update rule at time step t is:
θt=θt−1−√ α gt (4.3) Gt + ε
where α is the learning rate and ε is the division by zero control hyper-parameter (as mentioned in [KB14]).
– Implement ADAGRAD, and show a training plot for the default setting where γ = 2 and α and ε are just like the ones in [KB14]. How does ADAGRAD do? Why?
– Let us assume that we believe ADAGRAD is better at handling larger learning rates. Of course “belief” is not a scientific concept, but forget about that. Use a learning rate of α = 0.1 only for ADAGRAD, and leave ADAM and SGD learning rates the same as before. How do the results change? Does this confirm our belief? Is this a fair experiment?
hint: Providing a two line concise analysis and a training plot is just fine for this part.

CS 361 Project – Stochastic First Order Optimization Implementation 3
– Keep using α = 0.1 only for ADAGRAD. If you think about the underlying math of ADAGRAD, you can easily notice theoretical issues. Now that you know about the variance of different estimation problems such as average or summation, you should be able to tell some of ADAGRAD problems. Tweak the problem hyper-parameters so that ADAGRAD loses its advantage to its next better method. If ADAGRAD seems the worst, try and tweak the hyper-parameters of the problem so that ADAGRAD is not the worst anymore.
You are free to tweak things like the objective, adding artificial noise to the gradient, starting from unusual initial θ, changing the data generating model, changing the data dimension/size, etc. However, you should be able to justify the change based on some insight gained from theory (i.e. randomly changing the setting with no good reason until a desirable outcome appears is not accepted).
Technical clarifications:
For python, you’re only allowed to use numpy library . Please do not ask whether matplotlib, os, copy, etc. could be imported in python. They obviously have nothing to do with ADAM, and are okay to import. The same goes for other languages; using a library that has already implemented ADAM is prohibited. Unrelated libraries are okay to load/import/include. For this exercise, you may not use any off the shelf ADAM imple- mentation code that you find on the internet or anywhere else.
Since you are given a starter code, submitting just code or a jupyter notebook or screenshots will not get you any points. You should state your findings in a report, provide analysis, and substantiate them with necessary plots in a clean way. A messy report with unorganized plots and many pages of figures without references/captions will get a substantial amount of points deducted. Always read the report before submis- sion. Make sure the figures are grouped nicely together, are captioned and referenced well, and convey a clear message. Dumping multiple pages of figures will not get you any points.
Simplicity was a design principle for this exercise. As you can see, we are trying to keep the setting as simple as possible. We do not ask for large number of iterations. We also keep a fixed learning rate most of the time. The running time on an average computer is a couple of seconds. If you’re implementing/changing anything that causes your code to take long, you are probably doing something wrong or computationally unjustified.
Exercise 15.
(2.5 extra credit points) Here is an article about different variants of SGD that is as good as it gets. Read the article, but think as a mathematician. Specifically, think about the possible advantages or negative consequences of the algorithmic decisions made in each variant. These can essentially be the same weaknesses or strengths that cause a variant to over-perform or under-perform under practical data. You do not need to submit anything as a proof of reading.
Do the same last part for two other variants. i.e. implement the two variants and artificially make them perform better/worse than others.
4.2 Classifying Handwritten Digits with Neural Networks
Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten digits). The objective is to use different stochastic optimization algorithms that you have seen so far and compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits. A self contained starter code has been provided for your reference. Please note that the starter code is fully functional. You need to change a few lines of code to answer the exercise questions. The starter code is in Python.
Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM Turing award for his work on Neural Networks.
Exercise 16.
We will run the starter code provided, understand different blocks of the code, and then run different gradient descent algorithms.
The starter code heavily relies upon PyTorch library that you can install from here. Before installation, please select ”None” for CUDA option. Make sure you have a 64-bit installation of python, since no pre-compiled library exists for the 32-bit versions of python. Please note that most the effort in this exercise will go into installing the libraries correctly. Please use office hours if you can’t figure it out. Apart from this, it also uses Matplotlib library Matplotlib.
1. To run the Gradient Descent (GD) Algorithm

CS 361 Project – Stochastic First Order Optimization Implementation 4
Set the (mini) batch size to 60000 (the size of MNIST dataset)
Run SGD optimizer with learning rate of 0.003
(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer? (2 points) What do you observe? Report the accuracy.
(2 points) List at least 2 ways to improve the accuracy.
2. To run the Stochastic Gradient Descent (SGD) Algorithm
Set the (mini) batch size to 64
Run SGD optimizer with learning rate of 0.003
(2 points) What do you observe? Report the accuracy.
(1 points) List at least one way to improve accuracy further.
3. To run the ADAM algorithm
Set the mini-bacth size to 64
Run the ADAM optimizer with default learning rates (Algorithm 1) ([KB14])
Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer. (2 points) What do you observe? Report the accuracy.
(1 points) Why does ADAM converge faster than SGD?
References
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.Jul (2011), pp. 2121–2159.
[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.6980 [cs.LG].