Coordinate descent
Can Yang
Department of Mathematics
The Hong Kong University of Science and Technology
The First nine slides are from Prof. Ryan Tibshirani at CMU.
Spring, 2020
Outline
Coordinate Descent
Coordinate descent for linear regression with convex penalties
Outline
Coordinate Descent
Coordinate descent for linear regression with convex penalties
Coordinate descent
Our focus today is a very simple technique that can be surprisingly efficient and scalable: coordinate descent, i.e., coordinate-wise minimization
Q: Given convex, differentiable f : Rn → R, if we are at a point x such that f(x) is minimized along each coordinate axis, have we found a global minimizer?
I.e.,doesf(x+δ·ei)≥f(x)forallδ,i =⇒ f(x)=minzf(z)? (Here ei = (0,…,1,…0) ∈ Rn, the ith standard basis vector)
f
x2
x1
A: Yes! Proof:
∇f(x) = ∂f (x),… ∂f (x) = 0 ∂x1 ∂xn
Q: Same question, but for f convex (not differentiable) … ?
f
x2
−4 −2 0 2 4
x2
x1
A: No! Look at the above counterexample
Q: Same question again, but now f(x) = g(x) + ni=1 hi(xi), with g convex, differentiable and each hi convex … ? (Non-smooth part here called separable)
●
−4 −2 0 2 4 x1
f
x2
−4 −2 0 2 4
x2
x1
A: Yes! Proof: for any y,
T n
f(y)−f(x)≥∇g(x) (y−x)+
=
i=1 n
i=1 ≥0
[hi(yi)−hi(xi)] [∇ig(x)(yi − xi) + hi(yi) − hi(xi)] ≥ 0
●
−4 −2 0 2 4 x1
Coordinate descent
This suggests that for f(x) = g(x) + ni=1 hi(xi) (with g convex, differentiable and each hi convex) we can use coordinate descent to find a minimizer: start with some initial guess x(0), and repeat
x(k) ∈ argmin fx ,x(k−1),x(k−1),…x(k−1) 1123n
x1
x(k) ∈ argmin fx(k),x ,x(k−1),…x(k−1) 2123n
x2
x(k) ∈ argmin fx(k),x(k),x ,…x(k−1)
3 123n x2
…
x(k) ∈ argmin fx(k),x(k),x(k),…x
n123n x2
for k = 1,2,3,… (note: after we solve for x(k), we use its new i
value from then on!)
Seminal work of Tseng (2001) proves that for such f (provided f is continuous on compact set {x : f(x) ≤ f(x(0))} and f attains its minimum), any limit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1
Notes:
• Order of cycle through coordinates is arbitrary, can use any
permutation of {1, 2, . . . n}
• Can everywhere replace individual coordinates with blocks of
coordinates
• “One-at-a-time” update scheme is critical, and “all-at-once”
scheme does not necessarily converge
1Using real analysis, we know that x(k) has subsequence converging to x⋆ (Bolzano-Weierstrass), and f(x(k)) converges to f⋆ (monotone convergence)
Outline
Coordinate Descent
Coordinate descent for linear regression with convex penalties
Linear Regression
Consider linear regression
min 1∥y−Xβ∥2 β∈Rp 2
where y ∈ Rn, and X ∈ Rn×p with columns X1,…Xp Minimizing over βi, with all βj, j ̸= i fixed:
0=∇if(β)=XiT(Xβ−y)=XiT(Xiβi +X−iβ−i −y)
i.e., we take
βi = XiT (y − X−iβ−i) X iT X i
Coordinate descent repeats this update for i = 1,2,…,p,1,2,…
Soft shrinkage operator
Consider
min 1(β ̃−β)2 +λ|β| β2
Ifβ>0andsetthegradientβ−β ̃+λ=0,thenwehavethe solution β = β ̃ − λ as long as β ̃ > 0 and λ < β ̃, otherwise 0 is the solution.
Ifβ<0,wecandothesimilarthing.
The soft shrinkage operator
β=S(β ̃,λ)=argmin1(β ̃−β)2 +λ|β| 2
β ̃−λ ifβ ̃>0andλ<|β ̃| = β ̃+λ ifβ ̃<0andλ<|β ̃|
0 λ ≥ | β ̃ | = sign(β ̃)(|β ̃| − λ)+.
Matlab code
function b = softshrinkage(a, lam)
b = max(0, a-lam) - max(0, -a-lam);
end
Shrinkage operators
Similarly, we have the shrinkage operator for Ridge regression βRidge =argmin1(β ̃−β)2+λβ2= β ̃ .
221+λ Ridge Lasso
λ
(0,0) (0,0)
Coordinate Decent for Lasso. I
Consider Lasso regression
1n p 2 p min yi −β0 − xijβj +λ |βj|
x2 =1). i ij
β 2i=1 j=1 j=1
Suppose y has been centered and Xj has been standardized (i.e.,
minR(βj)=min βj
|βk|+λ|βj|
1n p 2 p
min yi −xijβj +λ|βj|
β 2i=1 j=1 j=1
Suppose we are solving βj and other βk̸=j are fixed. Rearrange the above equation:
2
1n p p yi − xikβk −xijβj +λ
2 i=1 k̸=j k̸=j
y ̃ ( j ) i
Coordinate Decent for Lasso. II
Minimizing R w.r.t βj
∂R =−xij(yi −y ̃(j) −xijβj)+λsign(βj)=0.
Algorithm
∂βj i i
we have closed-form solution given by soft shrinkage: βj =S(xij(yi −y ̃(j)),λ)
i
Initialize all βj = 0. Cycle over j = 1,2,...,p,1,2,... till convergence: Compute y ̃(j) = p xikβk;
βj ←S( xij(yi −y ̃(j)),λ). ii
i k̸=j
i
Other applications
The elastic net
1np2p p
min yi − xijβj +λ1 |βj|+λ2 βj2/2.
β 2i=1 j=1 j=1 j=1 The coordinate-wise update has the form
S(n xij(yi −y ̃(j));λ1) βj←i=1 i .
1+λ2
Ridge regression (special case of the elastic net).
n p 2 p min1 yi−β0− xijβj +λ βj2.
β 2i=1 j=1 2j=1
Other applications
The nonnegative garotte
Group Lasso
xikckβˆk. min∥y −Xjβj∥2 +λj∥βj∥2,√
1np 2p
min yi − xijcjβˆj +λ cj subjecttocj ≥0,
c 2i=1 i=1 j=1
where βˆj are the usual least squares estimates (p ≤ n). The
coordinate-wise update has the form
c j ← β ̃ j βˆ j − λ ,
βˆj2 whereβ ̃j =n xij(yi −y ̃(j)),andy ̃(j) =
+
i=1 i i k̸=j
β
mm j=1 j=1
whereXj isanN×pj orthogonalmatrixandλj =λ pj. ......
glmnet
my CD with active set
my CD
−5.0 −4.5 −4.0 −3.5
log(lambda)
my GD
−5.0 −4.5 −4.0 −3.5
log(lambda)
my GD with momentum
−5.0 −4.5
−4.0 −3.5
log(lambda)
−0.2 −0.1 0.0 0.1 0.2
−0.2 −0.1 0.0 0.1 0.2
−0.2 −0.1 0.0 0.1 0.2
−0.2 −0.1 0.0 0.1 0.2
beta beta
beta beta
−0.2 −0.1 0.0 0.1 0.2
beta
−5.0 −4.5 −4.0 −3.5
log(lambda)
−5.0 −4.5 −4.0 −3.5
log(lambda)
1e−02
1e−05
1e−08
0 10 20 30 40
iteration
method ACD
CD GD AGD
error
Reference I
T. Haste, R. Tibshirani, J. Friedman
The elements of statistical learning (2nd). Springer, 2009.
R. Mazumder, J. Friedman and T. Hastie.
SparseNet: Coordinate Descent with Non-Convex Penalties. Journal of Machine Learning Research, Stanford University, 2010.
R. Mazumder, T. Hastie.
Spectral Regularization Algorithms for Learning Large Incomplete Matrices. Journal of Machine Learning Research, Stanford University, 2010.
S. Boyd, N. Parikh, E., Chu, B. Peleato and J. Eckstein
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.
Tech. Report, Stanford University, Oct., 2010.
L. Breiman
Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37:373-384, 1995.
P. Zhao and B. Yu
On model selection consistency of Lasso.
The Journal of Machine Learning Research, 2006.
Reference II
D. Donoho and I. Johostone
Ideal spatial adaption by wavelet shinkage. Biometrika, 81:425-455, 1994.
S. Chen, D. Donoho and M. Saunders
Atomic decomposition by basis pursuit. SIAM Journal on Scientific computing, 1998.
R. Tibshirani,
Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 1996.
J. Friedman, T. Hastie, H. Hoefling and R. Tibshirani
Pathwise coordinate optimization. Annals of applied statistics, 2007.
D. Donoho
Compressed sensing.
IEEE Tran. on Information Theory, 2006.
E. Candes and T. Tao
Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Tran. on Information Theory, 2006.
Reference III
B. Efron, T. Hastie, I. Johnstone and R. Tibshirani
Least angle regression. Annals of statistics, 2004.
H. Gao and A. Bruce
Waveshink with firm shrinkage. Statistica Sinica, 1997.
C. Zhang
Nearly unbiased variable selection under minimax concave penalty. Annals of statistics, 2011.
J. Friedman, T. Hastie, S. Rosset, R. Tibshirani and J Zhu
Discussion of three Boosting papers. Annals of statistics, 2004.
P. Buhlmann
Invited Discussion on “Regression shrinkage and selection via the Lasso: a retrospective (Tibshirani)”.
Journal of the Royal Statistical Society: Series B, 2011.
J. Friedman, T. Hastie and R. Tibshirani
A Note on the Group Lasso and a Sparse Group Lasso. Technique report, Stanford University, 2010.
Reference IV
R. Tibshrani et al.
Strong Rules for Discarding Predictors in Lasso-type Problems. Tech. report, Stanford University, 2010.
J. Friedman, T. Hastie and R. Tibshirani
Applications of the lasso and grouped lasso to the estimation of sparse graphical models. Technique report, Stanford University, 2010.
J. Fan and R. Li
Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001.
J. Fan and J. Lv
Sure independence screening for ultrahigh dimensional feature space.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008.