CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization Project
1 Stochastic Optimization Theory Part 1.1 A common stochastic optimization task
In many machine learning problems, we are trying to minimize function f(θ) in the following format.
1 k
f(θ)= k
where Q(θ, j) is the loss function for jth data point where we have k training data points in the data set D. Here, θ is the set of parameters of the model that we try to optimize for, that is we want to find θ∗ = arg minθ f (θ). However, f is either unknown or very expensive to collect, but we have access to the noisy version g(θ). Using g(θ) to find the optimized θ∗ for f(θ) would be a stochastic optimization task.
Exercise 1.
(4 points) Define the random variable g(θ) specifically here to be Q(θ, i), where i is a random variable that is of uniform distribution in the integer set [1 : k]. So, ∇g(θ) is to be ∇Q(θ,i). Further, we define the noise z = Q(θ, i) − f(θ), hence g(θ) = Q(θ, i) = f(θ) + z. That’s why we call g is the noisy version of f. Given this definition of f and g, prove that equations E[g(θ)] = f(θ), E[∇g(θ)] = ∇f(θ), and E[z] = 0 are satisfied.
1.2 Review of GD and SGD
The goal of optimization is to find the θ∗ that minimizes f(θ), which could be more general than that in (1.1). Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(θ). The following are two approximation methods to find θ∗ iteratively.
1.2.1 Gradient Descent Algorithm (GD)
• Initialize θ1
• For t = 1,2,··· Then do the following update
θt+1 = θt − ηt∇f(θt)
1.2.2 Stochastic Gradient Descent Algorithm (SGD)
• Initialize θ1
• For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. ∇g(θt)). Then do the following update
θt+1 = θt − ηt∇g(θt) 1.3 Convergence rates for SGD and GD
Suppose we want to minimize function f with learning rate sequence ηt, We have the following theorems:
(1.2)
(1.3)
j=1
Theorem 1.1 Assume ∇f has a unique root θ∗. If f is twice continuously differentiable and strongly convex,
and ηt = O(t−1), then to reach an approximation error ε = |E[f(θt)] − f(θ∗)|, stochastic gradient descent
requires t = O( 1 ) updates. ε
Q(θ,j) (1.1)
1
– Stochastic Optimization Project 2
Theorem 1.2 Assume ∇f has a unique root θ∗. If either the smoothness assumptions about f in theorem 1.1
or ηt = O(t−1) is not met, then to reach an approximation error ε = |E[f(θt)] − f(θ∗)|, stochastic gradient
descent requires t = O( 1 ) updates. ε2
Theorem 1.3 Assume ∇f has a unique root θ∗. To reach an approximation error ε = |E[f(θt)] − f(θ∗)|, gradient descent requires t = O(ln( 1 )) updates.
ε
2 Comparing SGD vs GD for their running time
Let us consider the f minimization task discussed earlier in (1.1).
• f(θ) and g(θ) are as described in exercise 1, ∇f(θ∗) = 0.
• Let’s assume that f is strongly convex, and is twice continuously differentiable.
• We use the ηt = 1 learning rate sequence. t
• We have k data points.
• We want to achieve a training precision of ε = |E[f(θt)] − f(θ∗)|.
• Assume finding ∇Q(θ,j) takes c time, and finding ∇f(θ) takes kc time.
Exercise 2.
(15 points) Fill in table 1, with complexity orders in terms of k and ε
Table 1 Comparing SGD vs GD in terms of training precision
Explain your answer for each element of the second row by providing a reference to the theorem mentioned in this document. For every other element, explain your answer.
3 Comparing SGD vs GD with respect to test loss
The last exercise was about finding the computational complexity required to reach a specific precision with respect to the optimal training loss. However, a specific precision of training loss does not translate to the same precision of test loss. Here are some helpful facts:
• To achieve a specific test loss precision ρ, you need a large enough number of training samples k. The relation
ρ = O(k−γ)
must hold for most loss functions where 0 < γ ≤ 1 is an unknown constant.
• Let’s assume ε = Θ(ρ); for simplicity, you can assume that it is as if ε is c×ρ where c is a positive constant.
Exercise 3.
(16 points) 1) Fill in table 2, with complexity orders in terms of ρ and γ. You can refer to the previous table,
and replace the elements with appropriate values. Make sure you state the reason for each element even if it looks obvious.
SGD
GD
Computational Cost per Update
O(1)
Number of updates to reach ε
Total Cost
– Stochastic Optimization Project 3 Table 2 Comparing SGD vs GD in terms of testing precision
2) For a typical γ = 0.2, Explain why choosing GD over SGD can be very unwise.
SGD
Computational Cost per Update
O(1)
Number of updates to reach ρ
Total Cost
GD
CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization Implementation
Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems, Linear Regression and Classification (using Neural Networks).
3.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm
SGD can be ineffective when the function is highly non-convex. Unfortunately, most modern applications of Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic optimization algorithm for a Linear Regression problem.
The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14]. Disclaimer: The textbook, in Chapter 13, uses β for parameters but we will be using θ.
Algorithm 1: gt2 indicates the elementwise square gt ⊙gt. Good default settings for the tested machine learning problems are α = 0.001, β1 = 0.9,β2 = 0.999 and ε = 10−8. All operations on vectors are element-wise. With β1t and β2t, we denote β1 and β2 to the power t.
Require: α: Stepsize
Require: ε: Division-by-zero control parameter
Require: β1, β2 ∈ [0, 1): Exponential decay rates for the moment estimates Require: f(θ): Stochastic objective function with parameters θ.
Require: θ0: Initial parameter vector
m0 ← 0 (Initialize 1st moment vector) v0 ← 0 (Initialize 2nd moment vector) t ← 0 (Initialize timestep)
while θt not converged do
t←t+1
gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 · mt−1 + (1 − β1) · gt (Update biased first moment estimate)
vt ← β2 · vt−1 + (1 − β2) · gt2 (Update biased second raw moment estimate)
mˆt ← mt/(1 − β1t) (Compute bias-corrected first moment estimate).
vˆ ← v /(1 − βt ) (Compute bias-correct second raw moment estimate). tt2√
θˆ←θ −α·mˆ/( vˆ+ε)(Updateparameters) t t−1 t t
end while
return θt (Resulting parameters)
Exercise 4.
Consider the following problem setting:
• The number of training data points is k = 1000. The number of test data points is 50.
• Generation of the data set to fit for the model: the jth data point is represented by xj and is a 10- dimensional vector. Each element of xj is being generated from a uniform distribution over the interval [−1, 1].
• θtrue (i.e. the true parameter set) and θ (i.e. the variable indicating a candidate parameter set) are also 10-dimensional vectors. Assume that θtrue is all ones. However, we will pretend that we do not know it and we want to estimate it using the training data.
• Data is being generated according to yj = xTj θtrue + 0.1 · ξ (where ξ ∼ N (0, 1) is a sample from the normal distribution). The label yj is a scalar.
1
– Stochastic Optimization Implementation 2
• Let us assume that all the algorithm variants start from the same initial θ, whose elements are picked
uniformly in the interval [0, 0.1].
• Use 1000 updates.
Since this is an optimization problem, we need to define the loss function. Let’s use the following format
j j
Where γ is a hyper-parameter that we use to control and define the loss function.
1 k
(3.1)
(3.2)
For the following problems, state findings, provide analysis, and substantiate them in your write-up with screen- shots of the plots from the Gradient Descent Python notebook code. Please save the notebook to your own Google drive so that your work could be preserved after the runtime expires.
1. (5 points) For γ = 2, the problem becomes the least squares regression that you learned from Chapter 13. State the closed form solution (i.e. θ = ...) in your report, and then implement the solution to solve for θ. Provide the results of the experiment and state whether it is close to the true value.
2. (5 points) For Q(θ,j), find an expression that gives you the gradient ∇Q(θ,j). Report this expression, and implement it in the appropriate function.
hint: For r(θ) = h(e(θ)) = h(xT θ − yj), the gradient can be written as ∇r(θ) = ∂h · ∇e(θ) = ∂h · xj j ∂e∂e
according to the chain rule.
hint: The sign function, sgn(x), may be useful.
3. (5 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning rate as mentioned in the pseudocode above) and SGD to find the best parameter θ. Use a batch size of 1 for this problem, and perform 1000 parameter updates. Report the final set of parameters.
4. (5 points) Update your code to compute the average test loss at every 50th update, and plot these values. You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.
5. (9 points) Run ADAM method for each of γ = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly, and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis should include the reason why one method outperforms the other under each setting.
Example Plot
Note: Your plots will probably not end up looking exactly like this one.
F(θ)= k Q(θ,j)=xTθ−y
Q(θ,j) γ
j=1
– Stochastic Optimization Implementation 3 3.2 Classifying Handwritten Digits with Neural Networks
Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten digits). The objective is to use different stochastic optimization algorithms that you have seen so far and compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits.
Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM Turing award for his work on Neural Networks.
Exercise 5.
For the following problems, please modify the Neural Networks Python notebook code.
1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to 60000 (the size of the MNIST dataset). Run the SGD optimizer with a learning rate of 0.003.
(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer? (2 points) What do you observe? Report the accuracy.
(2 points) List at least 2 ways to improve the accuracy.
2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD optimizer with a learning rate of 0.003.
(2 points) What do you observe? Report the accuracy.
(1 points) List at least one way to improve accuracy further.
3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default learning rates (Algorithm 1)
Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer. (2 points) What do you observe? Report the accuracy.
(1 points) Why does ADAM converge faster than SGD?
– Stochastic Optimization Implementation 4 References
[RM51]
[Sac58]
[NY83]
[CZ07]
[Nem+09]
[Bot10]
[Bot12]
[DGL13]
[JZ13]
[Vap13] [KB14]
[MV18] [SSS18]
Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407.
Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals of Mathematical Statistics 29.2 (1958), pp. 373–405.
Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method efficiency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983).
Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24. Cambridge University Press, 2007.
Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”. In: SIAM Journal on optimization 19.4 (2009), pp. 1574–1609.
L ́eon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
L ́eon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436.
Luc Devroye, L ́aszl ́o Gy ̈orfi, and G ́abor Lugosi. A probabilistic theory of pattern recognition. Vol. 31. Springer Science & Business Media, 2013.
Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323.
Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv:
1412.6980 [cs.LG].
Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists.
Cambridge University Press, 2018.
Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.
Acknowledgements
Ehsan Saleh, Anay Pattanik created the first draft of the project outline.
Hongye Liu, Ajay Fewell, Chenhui Zhang, Muhammed Imran, Brian Yang, and Jinglin contributed to the newest edition.
This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09; Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]
CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization - Extra Credit Problems
DISCLAIMER: The exercises in this document are NOT mandatory, and are for extra credit (up to 15 additional points). These exercises were removed from the original iteration of the project outline as they are focused on specific mathematical details, which can help provide a deeper understanding to the underlying theory behind the success of the algorithm, but are not necessary to gain a general intuition.
Background
The classical root finding problem is finding the solution for the following equation given arbitrary function f:
f(θ) = 0 (0.1)
Finding the minimum here can be done with a variety of methods: Bisection method, False position method, Secant method, Newton-Raphson method, etc.
Now suppose that we do not have f, but rather we have a noisy version of f known as g. More precisely, we define g as follows:
g(θ) = f(θ) + z, E[z] = 0 (0.2) To reiterate, g acts as an estimate for f, but due to random noise, the values we see are slightly off. The method
by which we aim to find the root of f given only g is called Stochastic Approximation. 0.1 Stochastic Approximation in simple setting
We will make the following simplifying assumptions. The algorithm needs a sequence of positive learning rates that we denote as {ηt}t≥1.
1. The function f has a unique root θ∗ (i.e., f(θ∗) = 0 for a unique θ∗). This unique root is a positive zero crossing of f. In other words:
f(θ∗)=0, θ>θ∗ ⇒f(θ)>0, θ<θ∗ ⇒f(θ)<0 2. g has a finite upper and lower bound. In other words, we have bounded noise:
3. The noise is independent of θ:
P(|g| < c) = 1 where c ≥ 0
∀θ : E[g] = f(θ), P(z|θ) = P(z)
(0.3)
(0.4)
(0.5) their
(0.6)
4. The learning rates ηt do not approach 0 too quickly or too slowly. More formally: ∞∞
η t = ∞ , η t2 = c t=1 t=1
For some positive c. In other words, the sum of the learning rates is unbounded, but the sum of squares is bounded. An example sequence that satisfies these constraints is ηt = t−0.5, t ≥ 1
The algorithm is defined as follows, where Θt is the tth approximation of θ∗: • Let Θ1 be some initial value or guess
• For t = 1, 2, · · · perform the following iteration
Θt+1 = Θt − ηtGt Where Gt = f(Θt) + Zt, just as mentioned previously.
1
– Stochastic Optimization - Extra Credit Problems
2
1 Stochastic Approximation
1.1 Convergence proof of SA under simplified settings
Define the Mean Squared Error at step t as follows
e2t = E{(Θt − θ∗)2}
We will prove this value converges to 0 as the SA algorithm executes.
Extra Credit Ex. 1.
1. (2 points) Prove the following relationship for any two random variables u, v:
Eu[f(u)] = Ev[Eu|v[f(u)|v]]
(1.1)
(1.2)
Do not assume any kind of independence. It may help to consider the following simplified relation first: E[A] = E[E[A|B]].
Hint: You will need to have a minimal understanding of conditional expectation (Eu|v[f(u)|v]). Here (blogpost or video) are some resources to learn about conditional expectation.
2. (0.5 point) Using the last part, and the unbiasedness and independence assumption presented in Eq 0.4, Prove the following recurrence relation:
where ρt := E{(Θt − θ∗)f(Θt)}.
3. (0.5 point) Unfold the recurrence relation of last part to get the following: tt
e2t+1 =e21 −2ηiρi +ηi2E[G2i] i=1 i=1
e2t+1 = e2t − 2ηtρt + ηt2E{G2t }
(1.3)
(1.4)
hint: Unfolding means turning a recurrence relation of type f(θt, θt−1) = 0 to a relation of type f(θt, θ0) = 0. Usually if you have a f(θt,θt−1) = θt −θt−1 −ct = 0, you can obtain θt = θ0 +i ci by summing over the telescopic differences θi − θi−1.
4. (0.5 point) Using the bounded noise assumption, prove that E[G2t ] is bounded.
5. (0.5 point) Prove that ti=1 ηi2E[G2i ] is bounded.
6. (0.5 point) Use the first assumption 1 to prove that ρt is non-negative.
7. (0.5 point) Using the last two parts, and the unfolded relation 1.4, Prove the following t
0≤ηiρi ≤C<∞ i=1
(1.5)
and write down a valid bound C that does not depend on t.
8. (0.5 point) Letting t → ∞, the last part and 1.4 jointly guarantee that the sequence et converges to a
finite number. Does this guarantee that the sequence et converges to zero?
9. (1.5 points) Define b := |Θ | + c t η , and d := min f (θ) . Prove the following three properties:
t 1 i=1 i t |θ|
t≥1:ρt ≥dte2t
(1.6)
(1.7) (1.8)
Also, it is straightforward to prove the following property, but you do not need to prove it. You can refer
to and use this property in later questions.
10. (0.5 point) Using the properties in the last part prove that ∞i=1 ηidie2i < ∞.
11. (1.5 point) Using the evidence you have collected, finalize the proof with the last piece:
lim e2t = 0 n→∞
∞
ηidi =∞ i=1
(1.9)
– Stochastic Optimization - Extra Credit Problems
3
1.2 Simple example of SA
Let us assume the following simplified setting
• Assume f(θ) = θ. This means that the root is θ∗ = 0. • Assumethatηi <1foralli.
• Assume that Zi has a mean of zero and variance of 1.
Extra Credit Ex. 2.
1. (0.5 point) Prove that
2. (0.5 point) Define ut as follows:
and u1 = e21. Prove the following:
3. (0.5 point) Unfold the previous recurrence relation and prove that
t ηj2
ut+1 = u1 +
4. (0.5 point) Prove the following closed form solution for e2t :
j (1−ηi)2 j=1 i=1
∀i : E[Zi] = 0,E[Zi2] = 1
e2t+1 = (1 − ηt)2e2t + ηt2 e2t
(1.10)
(1.11)
(1.12)
(1.13)
(1.14)
t≥2:ut = t−1(1−ηi)2 i=1
ηt2
ut+1 = ut + ti=1(1 − ηi)2
et = (1−ηi) e1+ i=1
j j=1 i=1
(1 − ηi)2
t−1 t−1 η2 222j
CS 361: Probability and Statistics for Computer Science (Spring 2021) Stochastic Optimization Implementation - Extra Credit
Extra Credit Ex. 3.
(4 points) This is an article about different variants of SGD. Please read the article and think about the possible advantages or negative consequences of the algorithmic decisions made in each variant. These can essentially be the same weaknesses or strengths that cause a variant to over-perform or under-perform under practical data. You do not need to submit anything as a proof of reading.
Repeat Exercise 4’s ADAGRAD implementation for two other variants. That is, implement the two variants and artificially make them perform better/worse than others. Include some discussion or explanation as well as your plots.
1