CAP 6617 Advanced Machine Learning, Fall 2022
Homework 2
Due 10/6/2022 11:59PM
1. Proximal (stochastic) subgradient method? Consider the problem of minimizing L(θ) + λr(θ) where both L and r are nonsmooth, but r admits an efficient proximal operator. Instead of simply taking the subgradient of the combined function L + λr, one may be tempted to try the “proximal subgradient method” to utilize the efficient proximal operator of r:
Copyright By PowCoder代写 加微信 powcoder
(a) Rewrite the algorithm as
(t+1) (t) θ ← Proxγ(t)λr θ
θ(t+1) ← θ(t+1) − γ(t)g(t+1)
L(θ(t+1)) + λr(θ(t+1)) − L(θ) − λr(θ) ≤ 1 (θ(t+1) − θ(t))⊤(θ − θ(t+1)). γ(t)
obtain g(t) ∈ ∂L(θ(t))
(t+1) (t) (t) (t)
θ ← Proxγ(t)λr θ − γ g
In this exercise we show that this does not hurt the theoretical convergence of the subgradient method. In fact it requires a slightly milder condition that all subgradients of L are bounded, not those of the combined function L + λr if we were to use plain subgradient method.
(b) We then add and subtract the same term on the right-hand-side to get
L(θ(t+1)) + λr(θ(t+1)) − L(θ) − λr(θ) ≤ ≤
1 (θ(t+1) − θ(t))⊤(θ − θ(t+1)) − (θ(t+1) − θ(t))⊤g(t+1) γ(t)
1 ∥θ−θ(t)∥2 − 1 ∥θ−θ(t+1)∥2 − 1 ∥θ(t+1) −θ(t)∥2
2γ(t) 2γ(t) − (θ(t+1) − θ(t))⊤g(t+1)
1 ∥θ−θ(t)∥2 − 1 ∥θ−θ(t+1)∥2 2γ(t) 2γ(t)
1 (θ(t+1) − θ(t))⊤(θ(t+1) − θ(t) + 2γ(t)g(t+1)). 2γ(t)
(θ(t+1) − θ(t))⊤(θ(t+1) − θ(t) + 2γ(t)g(t+1)) = ∥θ(t+1) − θ(t)∥2 − γ(t)2∥g(t+1)∥2 ≥ −γ(t)2∥g(t+1)∥2.
(You only need to show this inequality. All the steps before that are provided for you as part of the overall proof, which you don’t need to show.)
(c) Conclude that
2λ(t) L(θ(t+1)) + λr(θ(t+1)) − L(θ) − λr(θ) ≤ ∥θ − θ(t)∥2 − ∥θ − θ(t+1)∥2 + γ(t)2∥g(t+1)∥2.
We can then follow the steps on page 25 of lec3.pdf to prove convergence, assuming ∥g(t+1)∥2 ≤ G for all t.
(d) Now consider g(t+1) being a stochastic subgradient at θ(t+1) satisfying Eg(t+1) ∈ ∂L(θ(t+1)), where the expectation is conditioned on θ(t+1), outline how to show its expected convergence.
2. Hand-written digits classification. The MNIST data set is a famous data set for multi-class classifi- cation, which can be downloaded here http://yann.lecun.com/exdb/mnist/. In this question you will implement various algorithms for multi-class logistic regression with quadratic regularization that solves the following optimization problem
minimize log exp(x⊤i θc) − x⊤i θyi + λ∥Θ∥2.
Here we simply assume that the features are the image pixels themselves (we even ignore the constant 1 here).
(a) Derive the gradient descent (GD), stochastic gradient descent (SGD), and (Nesterov’s) accelerated gradient descent (AGD) algorithm for solving it. At iteration t, you can simply denote the step size as γ(t) (and similarly for the extrapolation parameter δ(t) in AGD).
(b) Implement the algorithm in your favorite programming language.
(c) Run the algorithms on the training set of MNIST for two scenarios: weakly convex λ = 0 and strongly convex λ = 0.1. Use a constant step size γ(t) = 0.001,0.01,0.1,1 and report the best result of each algorithm on a figure with horizontal axis the number of full gradient evaluations and vertical axis the prediction error rate on the test set. Note that both GD and AGD require a full gradient evaluation in each iteration, while SGD can run n iterations with the same complexity of evaluating a full gradient.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com