程序代写代做 algorithm graph C information theory Coordinate descent

Coordinate descent
Can Yang
Department of Mathematics
The Hong Kong University of Science and Technology
The First nine slides are from Prof. Ryan Tibshirani at CMU.
Spring, 2020

Outline
Coordinate Descent
Coordinate descent for linear regression with convex penalties

Coordinate descent
Our focus today is a very simple technique that can be surprisingly efficient and scalable: coordinate descent, i.e., coordinate-wise minimization
Q: Given convex, differentiable f : Rn → R, if we are at a point x such that f(x) is minimized along each coordinate axis, have we found a global minimizer?
I.e.,doesf(x+δ·ei)≥f(x)forallδ,i =⇒ f(x)=minzf(z)? (Here ei = (0,…,1,…0) ∈ Rn, the ith standard basis vector)

f
x2
x1
A: Yes! Proof:
∇f(x) = 􏰄 ∂f (x),… ∂f (x)􏰅 = 0 ∂x1 ∂xn
Q: Same question, but for f convex (not differentiable) … ?

f
x2
−4 −2 0 2 4
x2
x1
A: No! Look at the above counterexample
Q: Same question again, but now f(x) = g(x) + 􏰁ni=1 hi(xi), with g convex, differentiable and each hi convex … ? (Non-smooth part here called separable)
●
−4 −2 0 2 4 x1

f
x2
−4 −2 0 2 4
x2
x1
A: Yes! Proof: for any y,
T 􏰂n
f(y)−f(x)≥∇g(x) (y−x)+
=
i=1 􏰂n
i=1 􏰑 􏰐􏰏 􏰒 ≥0
[hi(yi)−hi(xi)] [∇ig(x)(yi − xi) + hi(yi) − hi(xi)] ≥ 0
●
−4 −2 0 2 4 x1

Coordinate descent
This suggests that for f(x) = g(x) + 􏰁ni=1 hi(xi) (with g convex, differentiable and each hi convex) we can use coordinate descent to find a minimizer: start with some initial guess x(0), and repeat
x(k) ∈ argmin f􏰚x ,x(k−1),x(k−1),…x(k−1)􏰛 1123n
x1
x(k) ∈ argmin f􏰚x(k),x ,x(k−1),…x(k−1)􏰛 2123n
x2
x(k) ∈ argmin f􏰚x(k),x(k),x ,…x(k−1)􏰛
3 123n x2
…
x(k) ∈ argmin f􏰚x(k),x(k),x(k),…x 􏰛
n123n x2
for k = 1,2,3,… (note: after we solve for x(k), we use its new i
value from then on!)

Seminal work of Tseng (2001) proves that for such f (provided f is continuous on compact set {x : f(x) ≤ f(x(0))} and f attains its minimum), any limit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1
Notes:
• Order of cycle through coordinates is arbitrary, can use any
permutation of {1, 2, . . . n}
• Can everywhere replace individual coordinates with blocks of
coordinates
• “One-at-a-time” update scheme is critical, and “all-at-once”
scheme does not necessarily converge
1Using real analysis, we know that x(k) has subsequence converging to x⋆ (Bolzano-Weierstrass), and f(x(k)) converges to f⋆ (monotone convergence)

Outline
Coordinate Descent
Coordinate descent for linear regression with convex penalties

Linear Regression
Consider linear regression
min 1∥y−Xβ∥2 β∈Rp 2
where y ∈ Rn, and X ∈ Rn×p with columns X1,…Xp Minimizing over βi, with all βj, j ̸= i fixed:
0=∇if(β)=XiT(Xβ−y)=XiT(Xiβi +X−iβ−i −y)
i.e., we take
βi = XiT (y − X−iβ−i) X iT X i
Coordinate descent repeats this update for i = 1,2,…,p,1,2,…

Soft shrinkage operator
􏰀 Consider
􏰄􏰅
min 1(β ̃−β)2 +λ|β| β2
􏰀 Ifβ>0andsetthegradientβ−β ̃+λ=0,thenwehavethe solution β = β ̃ − λ as long as β ̃ > 0 and λ < β ̃, otherwise 0 is the solution. 􏰀 Ifβ<0,wecandothesimilarthing. 􏰀 The soft shrinkage operator β=S(β ̃,λ)=argmin􏰄1(β ̃−β)2 +λ|β|􏰅 2  β ̃−λ ifβ ̃>0andλ<|β ̃| = β ̃+λ ifβ ̃<0andλ<|β ̃| 0 λ ≥ | β ̃ | = sign(β ̃)(|β ̃| − λ)+. 􏰀 Matlab code function b = softshrinkage(a, lam) b = max(0, a-lam) - max(0, -a-lam); end Shrinkage operators 􏰀 Similarly, we have the shrinkage operator for Ridge regression βRidge =argmin􏰄1(β ̃−β)2+λβ2􏰅= β ̃ . 221+λ Ridge Lasso λ (0,0) (0,0) Coordinate Decent for Lasso. I 􏰀 Consider Lasso regression 1􏰂n 􏰂p 2 􏰂p min yi −β0 − xijβj +λ |βj| x2 =1). i ij β 2i=1 j=1 j=1 􏰀 S􏰁uppose y has been centered and Xj has been standardized (i.e., minR(βj)=min βj |βk|+λ|βj| 1n p 2 p   min 􏰂 yi −􏰂xijβj +λ􏰂|βj| β 2i=1 j=1 j=1  Suppose we are solving βj and other βk̸=j are fixed. Rearrange the above equation:   2  1􏰂n 􏰂p  􏰂p yi − xikβk −xijβj +λ  2 i=1  k̸=j  k̸=j   􏰑 􏰐􏰏 􏰒  y ̃ ( j ) i Coordinate Decent for Lasso. II Minimizing R w.r.t βj ∂R =􏰂􏰟−xij(yi −y ̃(j) −xijβj)􏰠+λsign(βj)=0. Algorithm ∂βj i i we have closed-form solution given by soft shrinkage: βj =S(􏰂xij(yi −y ̃(j)),λ) i Initialize all βj = 0. Cycle over j = 1,2,...,p,1,2,... till convergence: 􏰀 Compute y ̃(j) = 􏰁p xikβk; 􏰀 βj ←S( xij(yi −y ̃(j)),λ). ii 􏰁i k̸=j i Other applications 􏰀 The elastic net 1􏰂n􏰂p2􏰂p 􏰂p min yi − xijβj +λ1 |βj|+λ2 βj2/2. β 2i=1 j=1 j=1 j=1 The coordinate-wise update has the form S(􏰁n xij(yi −y ̃(j));λ1) βj←i=1 i . 1+λ2 􏰀 Ridge regression (special case of the elastic net). 􏰂n 􏰂p 2 􏰂p min1 yi−β0− xijβj +λ βj2. β 2i=1 j=1 2j=1 Other applications 􏰀 The nonnegative garotte 􏰀 Group Lasso xikckβˆk. min∥y −􏰂Xjβj∥2 +􏰂λj∥βj∥2,√ 1􏰂n􏰕􏰂p 􏰖2􏰂p min yi − xijcjβˆj +λ cj subjecttocj ≥0, c 2i=1 i=1 j=1 where βˆj are the usual least squares estimates (p ≤ n). The coordinate-wise update has the form c j ← 􏰕 β ̃ j βˆ j − λ 􏰖 , βˆj2 whereβ ̃j =􏰁n xij(yi −y ̃(j)),andy ̃(j) =􏰁 + i=1 i i k̸=j β mm j=1 j=1 whereXj isanN×pj orthogonalmatrixandλj =λ pj. 􏰀 ...... glmnet my CD with active set my CD −5.0 −4.5 −4.0 −3.5 log(lambda) my GD −5.0 −4.5 −4.0 −3.5 log(lambda) my GD with momentum −5.0 −4.5 −4.0 −3.5 log(lambda) −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 beta beta beta beta −0.2 −0.1 0.0 0.1 0.2 beta −5.0 −4.5 −4.0 −3.5 log(lambda) −5.0 −4.5 −4.0 −3.5 log(lambda) 1e−02 1e−05 1e−08 0 10 20 30 40 iteration method ACD CD GD AGD error Reference I T. Haste, R. Tibshirani, J. Friedman The elements of statistical learning (2nd). Springer, 2009. R. Mazumder, J. Friedman and T. Hastie. SparseNet: Coordinate Descent with Non-Convex Penalties. Journal of Machine Learning Research, Stanford University, 2010. R. Mazumder, T. Hastie. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. Journal of Machine Learning Research, Stanford University, 2010. S. Boyd, N. Parikh, E., Chu, B. Peleato and J. Eckstein Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Tech. Report, Stanford University, Oct., 2010. L. Breiman Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37:373-384, 1995. P. Zhao and B. Yu On model selection consistency of Lasso. The Journal of Machine Learning Research, 2006. Reference II D. Donoho and I. Johostone Ideal spatial adaption by wavelet shinkage. Biometrika, 81:425-455, 1994. S. Chen, D. Donoho and M. Saunders Atomic decomposition by basis pursuit. SIAM Journal on Scientific computing, 1998. R. Tibshirani, Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 1996. J. Friedman, T. Hastie, H. Hoefling and R. Tibshirani Pathwise coordinate optimization. Annals of applied statistics, 2007. D. Donoho Compressed sensing. IEEE Tran. on Information Theory, 2006. E. Candes and T. Tao Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Tran. on Information Theory, 2006. Reference III B. Efron, T. Hastie, I. Johnstone and R. Tibshirani Least angle regression. Annals of statistics, 2004. H. Gao and A. Bruce Waveshink with firm shrinkage. Statistica Sinica, 1997. C. Zhang Nearly unbiased variable selection under minimax concave penalty. Annals of statistics, 2011. J. Friedman, T. Hastie, S. Rosset, R. Tibshirani and J Zhu Discussion of three Boosting papers. Annals of statistics, 2004. P. Buhlmann Invited Discussion on “Regression shrinkage and selection via the Lasso: a retrospective (Tibshirani)”. Journal of the Royal Statistical Society: Series B, 2011. J. Friedman, T. Hastie and R. Tibshirani A Note on the Group Lasso and a Sparse Group Lasso. Technique report, Stanford University, 2010. Reference IV R. Tibshrani et al. Strong Rules for Discarding Predictors in Lasso-type Problems. Tech. report, Stanford University, 2010. J. Friedman, T. Hastie and R. Tibshirani Applications of the lasso and grouped lasso to the estimation of sparse graphical models. Technique report, Stanford University, 2010. J. Fan and R. Li Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001. J. Fan and J. Lv Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008.

Related Posts