代写 R C algorithm Scheme math scala database statistic network Bayesian theory Journal of Machine Learning Research 16 (2015) 993-1034 Submitted 12/13; Revised 8/14; Published 5/15

Journal of Machine Learning Research 16 (2015) 993-1034 Submitted 12/13; Revised 8/14; Published 5/15
Learning with the Maximum Correntropy Criterion Induced Losses for Regression
Yunlong Feng yunlong.feng@esat.kuleuven.be Xiaolin Huang huangxl06@mails.tsinghua.edu.cn Department of Electrical Engineering, ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, Leuven, B-3001, Belgium
Lei Shi leishi@fudan.edu.cn Shanghai Key Laboratory for Contemporary Applied Mathematics
School of Mathematical Sciences, Fudan University, Shanghai, 200433, P.R. China
Yuning Yang yuning.yang@esat.kuleuven.be Johan A. K. Suykens johan.suykens@esat.kuleuven.be Department of Electrical Engineering, ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, Leuven, B-3001, Belgium
Editor: Saharon Rosset
Abstract
Within the statistical learning framework, this paper studies the regression model associ- ated with the correntropy induced losses. The correntropy, as a similarity measure, has been frequently employed in signal processing and pattern recognition. Motivated by its empirical successes, this paper aims at presenting some theoretical understanding towards the maximum correntropy criterion in regression problems. Our focus in this paper is two- fold: first, we are concerned with the connections between the regression model associated with the correntropy induced loss and the least squares regression model. Second, we study its convergence property. A learning theory analysis which is centered around the above two aspects is conducted. From our analysis, we see that the scale parameter in the loss function balances the convergence rates of the regression model and its robustness. We then make some efforts to sketch a general view on robust loss functions when being applied into the learning for regression problems. Numerical experiments are also implemented to verify the effectiveness of the model.
Keywords: correntropy, the maximum correntropy criterion, robust regression, robust loss function, least squares regression, statistical learning theory
1. Introduction and Motivation
Recently, a generalized correlation function named correntropy (see Santamar ́ıa et al., 2006) has drawn much attention in the signal processing and machine learning community (see Liu et al., 2007; Gunduz and Pr ́ıncipe, 2009; He et al., 2011a,b). Formally speaking, correntropy is a generalized similarity measure between two scalar random variables U and V , which is defined by Vσ(U,V) = EKσ(U,V). Here Kσ is a Gaussian kernel given by Kσ(u,v) = exp 􏰂−(u − v)2/σ2􏰃 with the scale parameter σ > 0, (u, v) being a realization of (U, V ).
⃝c 2015 Yunlong Feng, Xiaolin Huang, Lei Shi, Yuning Yang, and Johan A.K. Suykens.

Feng, Huang, Shi, Yang, and Suykens
In this paper, we are interested in the application of the similarity measure Vσ in re- gression problems, namely, the maximum correntropy criterion for regression. Therefore, we first assume that the data generation model is given as
Y=f⋆(X)+ε, E(ε|X=x)=0. (1)
In model (1), X is the explanatory variable that takes values in a separable metric space X and Y ∈ Y = R stands for the response variable. The main purpose of the regression problem is to estimate f∗ from a set of observations generated by (1). The underlying unknown probability distribution on Z := X × Y is denoted as ρ.
Under the regression model (1), probably the most widely employed methodology for quantifying the regression efficiency is the mean squared error. This is the classical tool that minimizes the variance of ε and belongs to the second-order statistics. The drawback of the second-order statistics is that its optimality depends heavily on the assumption of Gaussianity. However, in many real-life applications, data may be contaminated by non- Gaussian noise or outliers. This motivates the introduction of the maximum correntropy criterion into the regression problems.
Given a set of i.i.d observations z = {(xi,yi)}mi=1, for any f : X → Y, the empirical estimator of the correntropy between f(X) and Y is given as
1 􏰓m
V􏰖σ,z(f) = m
The maximum correntropy criterion for regression models the output function via maximiz-
ing the empirical estimator of Vσ as follows
fz =argmaxV􏰖σ,z(f),
f∈H
where H is a certain underlying hypothesis space. The maximum correntropy criterion in regression problems has shown its efficiency for cases when the noises are non-Gaussian, and also with large outliers (see Santamar ́ıa et al., 2006; Liu et al., 2007; Pr ́ıncipe, 2010; Wang et al., 2013). It has also succeeded in many real-world applications, e.g., wind power forecasting (see Bessa et al., 2009) and pattern recognition (see He et al., 2011b).
In this paper, we attempt to present a theoretical understanding on the maximum correntropy criterion for regression (MCCR) within the statistical learning framework. To this end, we first generalize the idea of the maximum correntropy criterion in regression problems using the following supervised regression loss:
Definition 1 The correntropy induced regression loss lσ : R × R → [0, +∞) is defined as 2 􏰉 −(y−t)2 􏰊
lσ(y,t)=σ 1−e σ2 , y∈Y,t∈R, with σ > 0 being a scale parameter.
Figure 1 plots the correntropy induced loss function lσ (the lσ loss for short in what follows) with different choices of σ. Associated with this regression loss, the MCCR model
Kσ(yi,f(xi)).
i=1
994

Maximum Correntropy Criterion in Regression
1.5
1
σ = 0.8 0.5
σ = 1.1
σ = 0.6
0
−6 −4 −2 0 2 4 6
y−t
Figure 1: Plots of lσ(y, t) = σ2(1 − e−(y−t)2/σ2 ) with respect to y − t for different σ values: σ = 0.6 (the dashed curve), σ = 0.8 (the dotted-dashed curve), and σ = 1.1 (the dotted curve).
that we will investigate is the following empirical risk minimization (ERM) scheme 1 􏰓m
fz = argmin lσ(yi,f(xi)), (2) f∈H m i=1
where, throughout, the hypothesis space H is assumed to be a compact subset of C(X). Here C (X ) is the Banach space of continuous functions on X with the norm ∥f ∥∞ = supx∈X |f(x)|. Note that the compactness of H ensures the existence of the empirical target function fz.
We remark that the lσ loss is in fact a variant of the Welsch function, which was originally introduced in robust statistics (see Holland and Welsch, 1977; Dennis and Welsch, 1978). Consequently, the estimator from the MCCR model (2) is essentially a non-parametric M- estimator. For linear regression models, the robustness and the consistency properties of the M-estimator induced by the lσ loss have been investigated in Wang et al. (2013). In Santamar ́ıa et al. (2006) and Liu et al. (2007), an information-theoretical interpretation of the lσ loss by viewing it as a correlation measurement is provided.
However, existing theoretical results on understanding the lσ loss and the MCCR model are still very limited, the barriers of which lie in their non-convexity properties. From Taylor’s expansion, it is easy to see that there holds lσ(t) ≈ t2 for sufficiently large σ. Therefore, in some existing empirical studies, the lσ loss has been roughly taken as the least squares loss when σ is large enough. However, our studies in this paper suggest that this is in general not the case. On the other hand, the consistency property and the convergence rates of the MCCR model are yet unknown, which are the central focuses of the statistical learning research. In view of the above considerations, in this paper, our main concerns are the following two aspects:
995
lσ(y,t)

fz=argmin −
G
,
􏰗

Feng, Huang, Shi, Yang, and Suykens
– We are concerned with the connections between the lσ loss and the least squares loss when they are employed in regression problems. Therefore, we will study the relations between the MCCR model (2) and the ERM-based least squares regression (LSR) model.
– We are concerned with the approximation ability of the output function fz modeled by (2). More concretely, we aim at carrying out a learning theory analysis to bound the difference between fz and f⋆.
It should be mentioned that our study on the MCCR model (2) is inspired by Hu et al. (2013), which presented comprehensive and thorough studies on the minimum error entropy criterion from a learning theory viewpoint. According to Hu et al. (2013), a specific form of the minimum error entropy criterion for regression (MEECR) can be stated as
m 2
σ2 􏰓􏰓 f∈H m(m−1)i=1j̸=i
􏰋[(yi −f(xi))−(yj −f(xj))] 􏰌 2
where G(·) is a window function and can be chosen as G(t) = exp(−t). Hu et al. (2013,
2014) presented the first results concerning the regression consistency and convergence rates
of the above MEECR model and its regularized variant when σ becomes large. Concerning
the two regression models, we can see that MEECR models the empirical target function
fz via a pairwise empirical risk minimization scheme while the MCCR model learns in a point-wise fashion. More discussions on the two different learning schemes will be provided in Section 2.
The rest of this paper is organized as follows. In Section 2, results on the convergence rates of the MCCR model (2) in different situations are provided. Discussions and compar- isons with related studies will be also presented. Section 3 concerns connections between the two regression models: MCCR and LSR, which are explored from three aspects. Section 4 is dedicated to analyzing the MCCR model and giving proofs of theoretical results stated in Section 2. Discussions on the role that the scale parameter σ in the lσ loss plays is given in Section 5. Section 6 makes some efforts in sketching a general view of learning with robust regression losses. Numerical experiments are implemented in Section 7. We end this paper with concluding remarks in Section 8.
2. Theoretical Results on Convergence Rates and Discussions
In this section, we provide theoretical results on the convergence rates of the MCCR model (2). Explicitly, denoting ρX as the marginal distribution of ρ on X, we are going to bound
􏰗
∥fz − fρ∥2L2ρ
Y
and is assumed to satisfy that fρ ∈ L∞ρX throughout this paper. Due to the zero-mean noise
assumption in the data generation model (1), almost surely there holds fρ = f⋆. To analyze 996
X
, where fρ is defined as
fρ(x)= ydρ(y|x), x∈X,
􏰔

Maximum Correntropy Criterion in Regression
the convergence of the model, we need to introduce the following target function in H 􏰔
fH =argmin (f(x)−y)2dρ. f∈H Z
In addition, the convergence rates that we are going to present are obtained by controlling the complexity of the hypothesis space H. Therefore, we need the following definitions and assumptions to state our main results.
2.1 Definitions and Assumptions
Definition 2 (Covering Number) The covering number of the hypothesis space H, which
is denoted as N (H, η) with the radius η > 0, is defined as 􏰏l􏰐
N(H,η) := inf l ≥ 1 : there exist f1,…,fl ∈ H,such that H ⊂ 􏰕 B(fi,η) , i=1
where B(f, η) = {g ∈ H : ∥f − g∥∞ ≤ η} denotes the closed ball in C(X ) with center f ∈ H and radius η.
Definition 3 (l2-Empirical Covering Number) Let x = {x1,x2,…,xn} ⊂ Xn. The l2-empirical covering number of the hypothesis space H, which is denoted as N2 (H, η) with radius η > 0, is defined by
N2 (H, η) := sup sup inf 􏰘l ∈ N : ∃{fi}li=1 ⊂ H such that for each f ∈ H, there exists some n∈N x∈X n
1 􏰓n 􏰙
logN(H,η)≤cI,pη−p, ∀η>0.
Assumption 2 (Complexity Assumption II) There exist positive constants s and cII,s
with 0 0.
In learning theory, the covering number is frequently used to measure the capacity of the hypothesis spaces (see Anthony and Bartlett, 1999; Zhou, 2002). As explained in Zhou (2002), the Complexity Assumption I is typical in the statistical learning theory literature. For instance, it holds when H is chosen as a ball of reproducing kernel Hilbert spaces induced by Sobolev smooth kernels. The l2-empirical covering number is another data-dependent complexity measurement and usually leads to sharper convergence rates. Several examples of hypothesis spaces satisfying the Complexity Assumption II can be found in Guo and Zhou (2013).
|f(xj)−fi(xj)|2 ≤η2 .
i∈{1,2,…,l}with n
Assumption 1 (Complexity Assumption I) There exist positive constants p and cI,p
such that
j=1
997

holds
∥f −f ∥22 ≤3∥f −f ∥2 +C log(2/δ)􏰇σ−2 +σm−1/(1+p)􏰈, z ρ LρX H ρ L2ρX H,ρ
Feng, Huang, Shi, Yang, and Suykens
Assumption 3 (Moment Assumption) Assume that the tail behavior of the response variable Y satisfies 􏰒 y4dρ < ∞. We will give some discussions on the above Moment Assumption in Subsection 2.3. In our study, the Moment Assumption will be employed to analyze the convergence of the MCCR model. For some specific situations of the regression model (1), in our study we will also restrict ourselves to the noise that satisfies the following Noise Assumption. Assumption 4 (Noise Assumption) The density function of the noise variable ε for any given X = x, which is denoted as pε|X=x, is symmetric and uniformly bounded by the interval [−M0, M0] with M0 > 0.
2.2 Theoretical Results on Convergence Rates
We are now ready to state our main results on the convergence rates of the MCCR model (2). Our first result considers a general case of the regression model (1), where the Moment Assumption is assumed to hold.
Theorem 4 Assume that the Complexity Assumption I with p > 0 and the Moment As- sumption hold. Let fz be produced by (2). For any 0 < δ < 1, with confidence 1 − δ, there Z where CH,ρ is a positive constant independent of m, σ or δ and will be given explicitly in the proof. Discussions on the convergence rates established in Theorem 4 are postponed to Sub- section 2.3. Here we remark that the moment condition in the Moment Assumption which is used in Theorem 4 can be relaxed to a weaker moment condition, i.e., 􏰒 |y|ldρ < Z ∞ with l > 2, where meaningful convergence rates can be still derived. Meanwhile, when the condition in the Moment Assumption is further strengthened, refined convergence rates can be derived. For instance, when |y| ≤ M almost surely for some M > 0, we can get the following improved convergence rates:
Theorem 5 Assume that the Complexity Assumption II with 0 < s < 2 holds, and |y| ≤ M almost surely for some M > 0. Let fρ ∈ H and fz be produced by (2) with σ = m1/(2+s). For any 0 < δ < 1, with confidence 1 − δ, there holds ∥f−f∥22 ≤C′ log(2/δ)m−2 , zρLρXH,ρ 2+s where C′ is a positive constant independent of m, σ or δ and will be given explicitly in H,ρ the proof. From Theorem 4 and Theorem 5, we can see that meaningful convergence rates can be obtained when σ is properly chosen, e.g., σ = O(mα) with some α > 0. That is, σ has to grow in accordance with the sample size m to ensure non-trivial convergence rates. In view of this, it is natural to ask whether one can also get consistency properties or even convergence rates for the MCCR model (2) when σ is fixed. Under certain conditions, we give a positive answer in the following theorem.
998

Maximum Correntropy Criterion in Regression
Theorem 6 Assume that the Complexity Assumption II with 0 < s < 2 and the Noise Assumption hold. Let fρ ∈ H, fz be produced by (2) with σ being fixed and σ > σH,ρ where
σ =√2􏰇M+∥f∥+sup∥f∥􏰈. H,ρ 0ρ∞ ∞
f∈H For any 0 < δ < 1, with confidence 1 − δ, there holds ∥f−f∥22 ≤C log(1/δ)m−2 , z ρ LρX H,σ,ρ 2+s where CH,σ,ρ is a positive constant independent of m or δ and will be given explicitly in the proof. Proofs of the above theorems will be given in Subsection 4.3. 2.3 Discussions and Comparisons We now give some discussions on the obtained convergence rates, the Moment Assumption and also comparisons with related studies. 2.3.1 Convergence Rates As shown in Theorem 4, under the Moment Assumption, the convergence rates of the MCCR model depend on the choice of σ and the regularity of fρ. In the case when fρ ∈ H and σ = O(m1/(2+2p)), the convergence rate of O(m−2/(3+3p)) can be obtained. We then show in Theorem 5 and Theorem 6 that under the boundedness assumption on Y , or with the Noise Assumption, refined convergence rates of O(m−2/(2+s)) can be derived. Note that when s tends to zero which corresponds to the case where functions in H are smooth enough, convergence rates established in Theorem 5 and Theorem 6 tend to O(m−1), which are considered as the optimal rates in learning theory according to the law of large numbers (see Caponnetto and De Vito, 2007; Steinwart et al., 2009; Mendelson and Neeman, 2010; Wang and Zhou, 2011). The established convergence rates indicate the feasibility of applying the lσ loss in regression problems. 2.3.2 Moment Assumption and Related Studies on Robustness Note that convergence rates in Theorem 4 are obtained under the Moment Assumption, which restricts the tail behavior of Y . In fact, as commented in Christmann and Steinwart (2007), tail properties of Y are frequently used in linear regression as well as nonparametric regression problems. For instance, tail behaviors of Y are usually employed to study the robustness and the consistency properties of M-estimators in linear regression problems, see e.g., Hampel et al. (1986); Davies (1993); Audibert and Catoni (2011) and many others. In the statistical learning literature, some recent studies have also confined the tail properties of Y to explore the robustness of the kernel-based regression schemes, see e.g., Christmann and Steinwart (2007); Christmann and Messem (2008); Steinwart and Christmann (2008); De Brabanter et al. (2009); Debruyne et al. (2010). Note also that in the statistical learning literature there are many existing studies on the robust regression problem. For instance, Suykens et al. (2002a,b) presented a weighted 999 Feng, Huang, Shi, Yang, and Suykens least squares method to pursue a robust approximation to the regression function. Debruyne et al. (2008) addressed the model selection problem in kernel-based robust regression. Some efforts have been made in Steinwart and Christmann (2008) to understand generalization abilities of regression schemes associated with convex robust loss functions, e.g., Huber’s loss, which are also conducted by restricting the tail behavior of Y . As shown in Steinwart and Christmann (2008), under certain conditions, empirical estimators learned from the ERM schemes associated with certain convex robust loss functions can generalize. How- ever, this does not directly indicate the regression consistency property of the empirical estimators, e.g., the convergence from the empirical estimator to the regression function with respect to the L2ρX -distance. On the other hand, as far as we can see, few studies can be found in the statistical learning literature towards understanding regression schemes associated with nonconvex robust loss functions, which are frequently employed in robust statistics (see Huber, 1981; Hampel et al., 1986). 2.3.3 Comparisons with Related Studies As mentioned earlier, our study is motivated by recent work towards understanding the minimum error entropy criterion in regression problems (see Hu et al., 2013). Observing that when being applied to regression problems, both of the two models aim at modeling an empirical estimator that approximates the regression function fρ. Therefore, we can give comparisons on the convergence rates of the two models. Under the same assumptions on the tail behavior of Y and the Complexity Assumption I, when fρ ∈ H, the convergence rates established in Hu et al. (2013) are of the type O(m−2/(3+3p)), which are presented 􏰗 with respect to the variance of fz(X) − fρ(X) due to the mean insensitive property of the MEECR model. In addition, when Y is bounded, under the Complexity Assumption I, Hu et al. (2013) reported convergence rates of the type O(m−1/(1+p)). In view of the convergence rates reported in Theorem 4 and Theorem 5, we conclude that the convergence rates of the two regression models are comparable. This is a nice property of the MCCR model considering that it has a lower computational complexity. 3. Connections between MCCR and LSR As aforementioned, it is not suggested to roughly treat the lσ loss as the least squares loss in regression problems even if σ is sufficiently large. This section is dedicated to explaining this issue and trying to explore the connections between the two different regression models: MCCR and LSR. To this end, we first give some notations. For any measurable function f : X → Y, the generalization error of f under the lσ loss and the least squares loss are defined, respectively, as 􏰔􏰔 Eσ(f)= lσ(y,f(x))dρ(x,y), and E(f)= (y−f(x))2dρ(x,y). ZZ The corresponding target functions with respect to the hypothesis space H are given, re- spectively, by fHσ = argminEσ(f), and fH = argminE(f). f∈H f∈H 1000 Proof Maximum Correntropy Criterion in Regression 3.1 A Useful Lemma We first give a lemma which bounds the deviation of the excess risks of f associated with the lσ loss and the least squares loss for any f ∈ H. It will play an important role in our following analysis. In this context, the excess risk of f with respect to the lσ loss refers to the term Eσ(f) − Eσ(fρ) while the excess risk of f with respect to the least squares loss refers to the term E(f) − E(fρ). Lemma 7 Assume that the Moment Assumption holds. For any f ∈ H, the deviation of the two excess risk terms can be bounded as follows 􏰄􏰄􏰄{Eσ(f)−Eσ(fρ)}−{E(f)−E(fρ)}􏰄􏰄􏰄≤ cH,ρ, σ2 where cH,ρ is a positive constant given by 􏰔 cH,ρ =8 y4dρ+4sup∥f∥4∞+4∥fρ∥4∞. Z f∈H Following the inequality |1 − t − e−t| ≤ t2 for t > 0, one has 2
(3)
􏰄􏰄 (y − f(x))2 􏰄1−
􏰋
−exp −
(y − f(x))2 􏰌􏰄􏰄 (y − f(x))4 􏰄≤
σ2 􏰄 2σ4
.
􏰄 σ2 Simple computations show that
􏰄􏰄σ 􏰔 2􏰄􏰄1􏰔 4
􏰄E (f)− (y−f(x)) dρ􏰄≤ (y−f(x)) dρ. 􏰄Z 􏰄2σ2Z
(4)
(5)
Since fρ ∈ L∞ρX , the same estimation process can be applied to fρ, which gives
􏰄􏰄σ 􏰔 2􏰄􏰄1􏰔 4
􏰄E (fρ)− (y−fρ(x)) dρ􏰄≤ (y−fρ(x)) dρ. 􏰄Z 􏰄2σ2Z
Combining estimates in (4) and (5), we come to the following inequality 􏰄σσ 􏰄1􏰉􏰔444􏰊
􏰄􏰄{E (f)−E (fρ)}−{E(f)−E(fρ)}􏰄􏰄≤ σ2 8
where the basic inequality (a + b)4 ≤ 8a4 + 8b4 for a, b ∈ R has been applied. By denoting
􏰔
y dρ+4∥f∥∞ +4∥fρ∥∞
,
Z
cH,ρ =8 y4dρ+4sup∥f∥4∞+4∥fρ∥4∞, Z f∈H
we complete the proof of Lemma 7.
1001

E (f)=σ where
Feng, Huang, Shi, Yang, and Suykens
3.2 An Equivalence Relation between MCCR and LSR
In this part, we proceed with exploring the connections between the two models: MCCR and LSR. We will show that, when σ is large enough, under certain conditions, there does exist an equivalence relation between the two regression models. By equivalence, we mean that the two regression models admit the same target function when working in the same hypothesis space, i.e., fHσ = fH in our study.
Theorem 8 Suppose that the Noise Assumption holds. Under the condition that fρ ∈ H and σ > σH,ρ with
almost surely we have
σ =√2􏰇M+∥f∥+sup∥f∥􏰈, H,ρ 0ρ∞ ∞
f∈H
f Hσ = f H .
Proof Since fρ ∈ H, it is immediate to see that almost surely we have fH = fρ. To finish the proof, it remains to show that there holds fHσ = fρ. In fact, for any f ∈ H, we know that
σ
2􏰔 􏰉 􏰋 (y−f(x))2􏰌􏰊 2􏰔
1−exp − σ2 dρ(x,y)=σ Fx(f(x)−fρ(x))dρX(x),
ZX
􏰔M0 􏰋 (t−u)2􏰌
exp − σ2 pε|X=x(t)dt, x ∈ X. ′ 􏰔 M0 􏰋 (t−u)2􏰌􏰉t−u􏰊
Fx(u) := 1 −
By taking the derivative of F with respect to u, we get
−M0
Fx(u) = −2 exp − σ2 σ2 pε|X=x(t)dt, x ∈ X. −M0
According to the symmetry property of pε|X=x, we know that Fx′ (0) = 0. Moreover, ′′ 􏰔 M0 􏰋 (t−u)2􏰌􏰉σ2 −2(t−u)2􏰊
Fx(u)=2 exp − σ2 σ4 pε|X=x(t)dt, x∈X. −M0
Obviously, F′′(u) > 0 for all x ∈ X when σ > σ
x H,ρ
. Consequently, u = 0 is the unique minimizer of Fx(·) for any x ∈ X. The proof of Theorem 8 can be completed by recalling
the definitions of fHσ and fρ.
Theorem 8 provides a situation where the equivalence relation between the two regression models holds. In the sense of Theorem 8, one can take the lσ loss as the least squares loss when σ is large enough. However, Theorem 8 also indicates that the equivalence relation holds when the Noise Assumption is valid, fρ ∈ H and σ is sufficiently large. Note that the condition fρ ∈ H imposes a regularity requirement on the regression function fρ while the
1002

Maximum Correntropy Criterion in Regression
Noise Assumption asks for the boundedness and symmetry of the noise. In view of these, we conclude that one is not suggested to simply treat the lσ loss as the least squares loss even if σ is sufficiently large.
We remark that Theorem 8 merely provides a sufficient condition to ensure the existence of the equivalence relation between the two models. It would be meaningful to explore some other relaxed conditions to get a similar equivalence relation. However, we also remark that the non-convexity of the lσ loss makes it non-trivial since in this case there exists more than one local optimum of the MCCR model.
3.3 Comparisons on the Convergence Rates of MCCR and LSR
To further elucidate connections between the two regression models, in this part we move
our attention to comparing the learning performance of their empirical estimators, i.e., the
ERM scheme
convergence rates of ∥f − f ∥2 and ∥f ls − f ∥2 where f ls is modeled by the following
z ρL2ρX z ρL2ρX z
i=1
1 􏰓m
z f∈Hm i i
fls =argmin
(f(x)−y)2. (6)
Noticing that due to the assumption that H is a compact subset of C(X), (6) is in fact a constrained optimization model. When H is taken as a bounded subset of a certain repro- ducing kernel Hilbert space HK, there exists an equivalence relation between the constrained optimization model (6) and the following unconstrained model
i=1
where λ > 0 is a regularization parameter. Therefore, our comparison will be conducted between the MCCR model (2) and the regularized least squares regression model (7), which has been well understood in the statistical learning literature.
When Y is bounded, fρ ∈ H and the Complexity Assumption II with 0 < s < 2 1 􏰓m z,λ f∈HKm i i K fls =arg min (f(x)−y)2+λ∥f∥2, (7) holds, the convergence rate of ∥fz − fρ∥2L2ρ of O(m−2/(2+s)), which is the same as that of the regularized LSR (7) under the same conditions as revealed in Wu et al. (2006). In fact, when H is taken as a bounded subset of HK and the Mercer kernel K is sufficiently smooth, the constant s in the Complexity Assumption II can be arbitrarily small. As mentioned earlier, in this case, learning rates of the type O(m−1) can be derived which are regarded as the optimal learning rates in learning theory according to the law of large numbers. On the other hand, due to the non-robustness of the least squares loss, almost all the existing convergence rates established for (7) are reported under the restriction that the response variable has a sub-Gaussian tail (see Wu et al., 2006; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Mendelson and Neeman, 2010; Wang and Zhou, 2011). However, we see from Theorem 4 that for the MCCR model, convergence rates can be obtained under the Moment Assumption. This shows that the MCCR model can deal with non-Gaussian noise, which consequently distinguishes the two models in terms of conditions needed to establish meaningful convergence rates. X established in Theorem 5 belongs to the type 1003 Feng, Huang, Shi, Yang, and Suykens Before ending this section, let us briefly summarize the connections between MCCR and LSR as follows: • For any given f ∈ H, the difference between the excess risk of f with respect to the two regression models can be upper bounded by O(σ−2); • Under certain conditions, we do see the existence of an equivalence relation between the two models, as commonly expected when σ is large enough. However, this equiv- alence relation might hold only under very specific conditions as suggested by our analysis; • The MCCR model can deal with the heavy-tailed noise while the LSR model can only deal with sub-Gaussian noise. Moreover, when being restricted to cases with the bounded output or with the Gaussian noise, the performance of the two regression models are comparable. Therefore, in the above sense, we suggest that one can count on the MCCR model (2) to solve regression problems. 4. Deriving the Convergence Rates This section presents detailed convergence analysis of the MCCR model (2) and proofs of theorems given in Section 2. The main difficulty in analyzing the model lies in the non-convexity of the loss function lσ, which disables usual techniques for analyzing convex learning models (see Cucker and Zhou, 2007; Steinwart and Christmann, 2008). We over- come this difficulty by introducing a novel error decomposition strategy with the help of Lemma 7. Analysis presented in this section is inspired by Cucker and Zhou (2007); Hu et al. (2013) and Fan et al.. 4.1 Decomposing the Error into Bias-Variance Terms The L2ρX -distance between the empirical target function fz and the regression function fρ can be decomposed into the bias and the variance terms (see Vapnik, 1998; Cucker and Zhou, 2007; Steinwart and Christmann, 2008). Roughly speaking, the bias refers to the data-free error terms while the variance refers to the data-dependent error terms. The spirit of the learning theory approach to analyzing the convergence of learning models is trying to find a compromise between bias and variance by controlling the complexity of the hypothesis space. The following proposition offers a method for such compromise with respect to the MCCR model (2). Proposition 9 Assume that the Moment Assumption holds and let fz be produced by (2). The L2ρX -distance between fz and fρ can be decomposed as follows: ∥f −f∥2 ≤A +A +S(z)+S(z), z ρ L2ρX H,σ,ρ H,ρ 1 2 1004 where Maximum Correntropy Criterion in Regression AH,σ,ρ = 2cH,ρ/σ2, AH,ρ = E(fH) − E(fρ), S1(z) = {Ezσ(fHσ ) − Ezσ(fρ)} − {Eσ(fHσ ) − Eσ(fρ)} , S2(z) = {Eσ(fz) − Eσ(fρ)} − {Ezσ(fz) − Ezσ(fρ)} . Proof Following from Lemma 7, with simple computations, we see that ∥f −f ∥2 ≤Eσ(f )−Eσ(f )+c /σ2 zρL2ρ zρH,ρ X ≤ {Eσ(fz) − Ezσ(fz)} + {Ezσ(fz) − Ezσ(fHσ )} + {Ezσ(fHσ ) − Eσ(fHσ )} . + {Eσ(fHσ ) − Eσ(fH)} + {Eσ(fH) − Eσ(fρ)} + cH,ρ/σ2 ≤ {Eσ(fz) − Ezσ(fz)} + {Ezσ(fz) − Ezσ(fHσ )} + {Ezσ(fHσ ) − Eσ(fHσ )} . + {Eσ(fHσ ) − Eσ(fH)} + {E(fH) − E(fρ)} + 2cH,ρ/σ2. The definitions of fz, fHσ and fH tell us that the second and the fourth terms of right-hand side of the last inequality are at most zero. By introducing intermediate terms Ezσ(fρ), Eσ(fρ) and corresponding notations, we finish the proof of Proposition 9. As shown in Proposition 9, the L2ρX -distance between fz and fρ are decomposed into four error terms: AH,σ,ρ, AH,ρ, S1(z), and S2(z). It is easy to see that the first two error terms are data-independent and correspond to the bias while the last two terms are data-dependent, which consequently are referred as the sample error (variance). The quantity AH,ρ can be translated as the approximation ability of fH to fρ, the estimation of which belongs to the topics of the approximation theory and has been well conducted. For instance, when H is chosen as a bounded subset of a certain reproducing kernel Hilbert space (RKHS), a comprehensive study on this term can be found in Smale and Zhou (2003). On the other hand, we remind that the bias term AH,σ,ρ is introduced into the above error decomposition method, which not only depends on the hypothesis space H and the underlying probability distribution ρ, but also relies on the scale parameter σ. As explained later, this is caused by the introduction of the robustness into the regression model. This makes the decomposition strategy for the MCCR model different from those for convex regression models (see Cucker and Zhou, 2007; Steinwart and Christmann, 2008). As a consequence of Proposition 9, to bound ∥f − f ∥2 , it suffices to estimate the z ρL2ρX two sample error terms: S1(z) and S2(z), which will be tackled in the next subsection. 4.2 Concentration Estimates of Sample Error Terms This part presents concentration estimates for the sample error terms S1(z) and S2(z) when the Moment Assumption is assumed. In learning theory, this is typically done by applying concentration inequalities to certain random variables that may be function-space valued. In our study, for this purpose we introduce the following two random variables, ξ1(z) and ξ2(z) with z ∈ Z, which are defined by ξ1(z) := −σ2 exp 􏰂−(y − fHσ (x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 , 1005 and Feng, Huang, Shi, Yang, and Suykens ξ2(z) := −σ2 exp 􏰂−(y − fz(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 . By applying the one-sided Bernstein’s inequality in Lemma 12 to the random variable ξ1, we can get the concentrated estimate for the sample error term S1(z). However, the estimation of the sample error term S2(z) requires us to apply concentration inequalities to the function-space valued random variable ξ2 and consequently depends on the capacity of the hypothesis space H. This is due to the fact that the random variable ξ2 is dependent with fz which varies in accordance with the sample z. Concentrated estimates for S1(z) and S2(z) are presented in the following two proposi- tions, the proofs of which are given in Subsection 4.3. Proposition 10 Assume that the Moment Assumption holds. For any 0 < δ < 1, with confidence 1 − δ/2, there holds 1􏰅 􏰅2 􏰉2􏰊􏰉σ1􏰊 S1(z)≤ 􏰅fH−fρ􏰅L2 +CH,ρ,1 log + 2 , 2ρX δmσ where CH,ρ,1 is a positive constant independent of m, σ or δ and will be given explicitly in the proof. Proposition 11 Assume that the Complexity Assumption I with p > 0 and the Moment
Assumption hold. For any 0 < δ < 1, with confidence 1 − δ/2, there holds 1 12 􏰉2􏰊􏰋1σ􏰌 S2(z)≤ (S1(z)+S2(z))+ ∥fH −fρ∥L2 +CH,ρ,2 log 2 + 1 , where CH,ρ,2 is a positive constant independent of m, σ or δ and will be given explicitly in the proof. 4.3 Proofs 4.3.1 Lemmas We first list several lemmas that will be used in the proofs. Lemma 12 and Lemma 13 are one-sided Bernstein’s concentration inequalities, which were introduced in Bernstein (1946) and can be found in many statistical learning textbooks, see e.g., Cucker and Zhou (2007); Steinwart and Christmann (2008). Lemma 14 was proved in Wu et al. (2007). Lemma 12 Let ξ be a random variable on a probability space Z with variance σ⋆2 satisfying |ξ − Eξ| ≤ Mξ almost surely for some constant Mξ and for all z ∈ Z. Then 22ρX δσm1+p 􏰏m􏰐􏰏􏰐 1􏰓 mε2 Probz∈Zm m ξ(zi) − Eξ ≥ ε ≤ exp −2(σ2 + 1M ε) . i=1 ⋆3ξ 1006 Maximum Correntropy Criterion in Regression Lemma 13 Let ξ be a random variable on a probability space Z with variance σ⋆2 satisfying |ξ−Eξ|≤Mξ almostsurelyforsomeconstantMξ andforallz∈Z. Thenforany0<δ<1, with confidence 1 − δ, we have 1 m 2Mξlog1 􏰜2σ2log1 􏰓ξ(zi)−Eξ≤ δ+ ⋆ δ. m i=1 3m m Lemma 14 Let F be a class of measurable functions on Z. Assume that there are constants B,c>0andθ∈[0,1]suchthat∥f∥∞≤BandEf2≤c(Ef)θ foreveryf∈F.Ifforsome a>0 and s∈(0,2),
logN2 (F,η) ≤ aη−s, ∀η > 0,
then there exists a constant αp depending only on p such that for any t > 0, with probability
at least 1 − e−t, there holds
1 m 1 􏰉ct􏰊 1 18Bt
􏰓1−θθ 2−θ
Ef − m
4.3.2 Proof of Proposition 10
∀ f ∈ F ,
.
where
i=1
m ,
f (zi ) ≤ 2 γ (Ef ) + αp γ + 2 􏰋2−s 􏰇a􏰈2
m + 2−s􏰇a􏰈2􏰌
γ := max c 4−2θ+sθ 4−2θ+sθ mm
, B 2+s 2+s
Proof To bound the sample error term S1(z), we apply the one-sided Bernstein’s inequality in Lemma 13 to the random variable ξ1 introduced in Subsection 4.2. To this end, we need to verify conditions in Lemma 13.
We first verify the boundedness condition. Recall that the random variable ξ1 is defined as
ξ1(z) := −σ2 exp 􏰂−(y − fHσ (x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 , z ∈ Z.
Introducing the auxiliary function h(t) = exp{−t2} with t ∈ R, it is easy to see that ∥h′∥∞ = 􏰚2/e. By taking t1 = (y − fHσ (x))/σ, t2 = (y − fρ(x))/σ and applying the mean value theorem to h, we see that
|ξ1(z)| ≤ 􏰚2/eσ|fHσ (x) − fρ(x)| ≤ 􏰚2/eσ∥fHσ − fρ∥∞, z ∈ Z.
|ξ1 − Eξ1| ≤ 2∥ξ1∥∞ ≤ 2􏰚2/eσ∥fHσ − fρ∥∞ ≤ 2􏰚2/eσ sup ∥f − fρ∥∞.
f∈H
We are now in a position to bound the variance of the random variable ξ1, which is
denoted as var(ξ1). Applying the mean value theorem to the auxiliary function h1(t) = 1007
Consequently,

Feng, Huang, Shi, Yang, and Suykens
exp(−t) at t1 = (y − fHσ (x))2/σ2, t2 = (y − fρ(x))2/σ2 and recalling that ∥h′1∥∞ ≤ 1, we get
var(ξ1) = Eξ12 − (Eξ1)2 ≤ Eξ12
≤ E 􏰀(fHσ (x) − fρ(x))2(2y − fHσ (x) − fρ(x))2􏰁
􏰔􏰇222􏰈􏰔σ2
≤ 12y + 3 sup ∥f∥∞ + 3∥fρ∥∞ dρ(y|x) (fH(x) − fρ(x)) dρX (x)
Yf∈H X
􏰔
(fHσ (x) − fρ(x))2dρX (x),
where the second inequality is from the elementary inequality (a + b + c)2 ≤ 3(a2 + b2 + c2)
= cH,ρ,0 fora,b,c∈RandthepositiveconstantcH,ρ,0 isdenotedas
X
cH,ρ,0 =12 y2dρ+3sup∥f∥2∞+3∥fρ∥2∞. (8) Z f∈H
Now applying Lemma 13 to the random variable ξ1, we see that for any 0 < δ < 1, with confidence 1 − δ/2, there holds 􏰔 􏰚 􏰜2c 4 2/esupf∈H∥f−fρ∥∞σlog(2/δ) H,ρ,0 log(2/δ)∥fσ −f ∥2 H ρ L2ρX S1(z)≤ 3 m + The elementary inequality √ab ≤ (a + b)/2 for a, b ≥ 0 gives1 􏰜2cH,ρ,0 log(2/δ)∥fHσ − fρ∥2L2ρX ≤ 1∥fHσ − fρ∥2L2 m 2 ρX m In addition, as a consequence of Lemma 7, we have . (9) (10) (11) m + cH,ρ,0 log(2/δ). ∥fHσ −fρ∥2L2ρ X ≤Eσ(fHσ)−Eσ(fρ)+cH,ρ/σ2 = Eσ(fHσ ) − Eσ(fH) + Eσ(fH) − Eσ(fρ) + cH,ρ/σ2 ≤ ∥fH − fρ∥2L2ρ + 2cH,ρ/σ2, X where the last inequality is due to the fact that fHσ is the minimizer of the risk functional Eσ(·) in H. Combining estimates in (9), (10), and (11), we come to the conclusion that for any 0 < δ < 1, with confidence 1 − δ/2, there holds 1􏰅 􏰅2 􏰉2􏰊􏰉σ1􏰊 S1(z)≤ 􏰅fH−fρ􏰅L2 +CH,ρ,1 log + 2 , 2ρX δmσ where CH,ρ,1 is a positive constant independent of m, σ or δ and given by CH,ρ,1 = (4/3)􏰚2/e sup ∥f − fρ∥∞ + 2cH,ρ + cH,ρ,0. f∈H Thus we have completed the proof of Proposition 10. 1. Refined estimate can be derived here by applying Young’s inequality ab ≤ ta2 + b2 for a, b ∈ R, t > 0. 2 2t
In our proof, we choose t = 1 for simplification. 1008

Maximum Correntropy Criterion in Regression
4.3.3 Proof of Proposition 11
To prove Proposition 11, we first need to prove the following intermediate conclusion, which is in fact a concentrated estimate for function-space valued random variables.
Proposition 15 Assume that the Moment Assumption holds. Let ε satisfy ε ≥ cH,ρ/σ2. For any 0 < δ < 1, with confidence 1 − δ/2, there holds 􏰏 ≤N H,􏰚2/eσ exp −4􏰚2/esup ∥f−f ∥ σ+6c , f∈H ρ ∞ H,ρ,0 where cH,ρ is given in (3) and cH,ρ,0 is given in (8), both of which are positive constants independent of m, σ or δ. Proof To derive the desired estimate, we will apply the one-sided Bernstein’s inequality in Lemma 13 to the function set H by taking its capacity into account. For any f ∈ H, we redefine the random variable ξ2(z) as follows ξ2(z) = −σ2 exp 􏰂−(y − f(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 , z ∈ Z. Following from the proof of Proposition 10, we know that ∥ξ2∥∞ ≤ 􏰚2/eσ∥f − fρ∥∞ and |ξ2 − Eξ2| ≤ 2􏰚2/eσ sup ∥f − fρ∥∞. f∈H Meanwhile, we also know from the proof of Proposition 10 that Eξ2≤cH,ρ,0∥f−fρ∥2L2ρ , X where the constant cH,ρ,0 is given in (8). Consider a function set {fj }Jj=1 ⊂ H with J = N (H, ε/(􏰚2/eσ)). The compactness of H ensures the existence and finiteness of J. Now we let μ = 􏰛Eσ(fj) − Eσ(fρ) + 2ε, and choose ε such that ε ≥ cH,ρ/σ2. Applying the one-sided Bernstein’s inequality in Lemma 12 to the following group of random variables ξ2,j(z)=−σ2exp􏰂−(y−fj(x))2/σ2􏰃+σ2exp􏰂−(y−fρ(x))2/σ2􏰃, j=1,...,J, 1009 (Eσ(f) − Eσ(fρ)) − (Ezσ(f) − Ezσ(fρ)) √ 􏰐 􏰚 σ σ >4 ε
E (f)−E (fρ)+2ε 􏰍ε􏰎􏰏 3mε 􏰐
Probz∈Zm sup f∈H

we come to the following conclusion
Feng, Huang, Shi, Yang, and Suykens
Probz∈Zm
􏰏(Eσ(fj) − Eσ(fρ)) − (Ezσ(fj) − Ezσ(fρ)) √ 􏰐 􏰚Eσ(fj) − Eσ(fρ) + 2ε > ε
  3mεμ2 
≤exp −4􏰚2/e∥f −f ∥ √εμσ+6c ∥f −f ∥2
 jρ∞ H,ρ,0jρL2ρX
􏰏 3mεμ2 􏰐 ≤exp −4􏰚2/e∥f −f ∥ √εμσ+6c μ2
j ρ∞ H,ρ,0
􏰏 3mε 􏰐
≤exp −4􏰚2/esup ∥f−f∥ σ+6c , f∈H ρ ∞ H,ρ,0
where the last two inequalities follow from the inequality in Lemma 7, the equation that E(fj)−E(fρ)=∥fj−fρ∥2L2ρ ,thefactthatε≥cH,ρ/σ2and
X
μ2 =Eσ(fj)−Eσ(fρ)+2ε≥Eσ(fj)−Eσ(fρ)+cH,ρ/σ2 +ε≥E(fj)−E(fρ)+ε≥ε.
From the choice of fj, we know that for each f ∈ H, there exists some j such that ∥f − fj∥∞ ≤ ε/(􏰚2/eσ). Therefore |Eσ(f) − Eσ(fj)| and |Ezσ(f) − Ezσ(fj)| can be both upper bounded by ε, which yields
|(Ezσ(f) − Ezσ(fρ)) − (Ezσ(fj) − Ezσ(fρ))| 􏰚Eσ(f) − Eσ(fρ) + 2ε ≤
|(Eσ(f) − Eσ(fρ)) − (Eσ(fj) − Eσ(fρ))| 􏰚Eσ(f) − Eσ(fρ) + 2ε ≤


ε (12)
and
ε. (13) The latter inequality together with the fact that ε ≤ Eσ(f) − Eσ(fρ) + 2ε implies
Eσ(fj) − Eσ(fρ) + 2ε = (Eσ(fj) − Eσ(fρ)) − (Eσ(f) − Eσ(fρ)) + Eσ(f) − Eσ(fρ) + 2ε
≤ √ε􏰛(Eσ(f) − Eσ(fρ)) + 2ε + Eσ(f) − Eσ(fρ) + 2ε (14)
≤ 2(Eσ(f) − Eσ(fρ) + 2ε). For any f ∈ H, if the following inequality holds
(Eσ(f) − Eσ(fρ)) − (Ezσ(f) − Ezσ(fρ)) √ 􏰚Eσ(f) − Eσ(fρ) + 2ε > 4 ε,
then combining estimates in (12) and (13) we know that there holds
(Eσ(fj) − Eσ(fρ)) − (Ezσ(fj) − Ezσ(fρ)) √ 􏰚Eσ(f) − Eσ(fρ) + 2ε > 2
ε.
1010

Maximum Correntropy Criterion in Regression
This together with inequality (14) tells us that the following inequality holds
(Eσ(fj) − Eσ(fρ)) − (Ezσ(fj) − Ezσ(fρ)) √ 􏰚Eσ(fj) − Eσ(fρ) + 2ε >
ε.
Consequently, based on the above estimates, we come to the following conclusion
Probz∈Zm J
sup
􏰏
(Eσ(f) − Eσ(fρ)) − (Ezσ(f) − Ezσ(fρ)) √ 􏰐 􏰚 σ σ >4 ε
E (f)−E (fρ)+2ε 􏰏􏰐
(Eσ(fj) − Eσ(fρ)) − (Ezσ(fj) − Ezσ(fρ)) √ 􏰚Eσ(fj) − Eσ(fρ) + 2ε >

􏰓
f∈H Probz∈Zm
ε 􏰍ε􏰎􏰏 3mε 􏰐
j=1
≤N H,􏰚2/eσ exp −4􏰚2/esup ∥f−f ∥ σ+6c . f∈H ρ ∞ H,ρ,0
This completes the proof of Proposition 15.
Proof [Proof of Proposition 11] From the Complexity Assumption I, we know that N 􏰇H, ε􏰆(􏰚2/eσ)􏰈 ≤ exp 􏰘cI,p(􏰚2/e)pσp/εp􏰙 .
This in connection with Proposition 15 yields
􏰏
≤ exp εp
where Ap and BH,ρ are positive constants given by
√ 􏰐 >4 ε
Probz∈Zm 􏰋Apσp
sup
f∈H
(Eσ(f) − Eσ(fρ)) − (Ezσ(f) − Ezσ(fρ)) 􏰚 σ σ
mε + 2c
E (f)−E (fρ)+2ε 􏰌
− σB
H,ρ H,ρ,0
,
By setting
we obtain
􏰋Apσp mε􏰌δ
Ap = cI,p(􏰚2/e)p and BH,ρ = 4􏰚2/e sup ∥f − fρ∥∞/3. f∈H
exp
εp − σB H,ρ
+ 2c ≤ 2, H,ρ,0
Ap(σBH,ρ+2cH,ρ,0)σp ε− m ε− m ≥0.
log(2/δ)(σBH,ρ+2cH,ρ,0) p
Lemma 7.2 in Cucker and Zhou (2007) tells us that the above inequality holds if
p+1
􏰏cH,ρ 2log(2/δ)(σBH,ρ+2cH,ρ,0) 􏰉2Ap(σBH,ρ+2cH,ρ,0)σp􏰊1/(1+p)􏰐 ε≥maxσ2, m , m .
1011

Feng, Huang, Shi, Yang, and Suykens
In view of the above condition, we choose a sufficient large εH,ρ as follows εH,ρ = cH,ρ,1 log(2/δ)(σ−2 + σm−1/(1+p)),
where cH,ρ,1 is a positive constant independent of m, σ or δ and given by cH,ρ,1 = 2cH,ρ + 2(Ap + 1)(BH,ρ + 2cH,ρ,0).
With the above choice of εH,ρ and following the above discussions, we see that for any 0 < δ < 1, with confidence 1 − δ/2, there holds , sup􏰘((Eσ(f)−Eσ(f ))−(Eσ(f)−Eσ(f )))􏰑􏰛Eσ(f)−Eσ(f )+ε 􏰙≤4√ε ρ z z ρ ρ H,ρ H,ρ f∈H which yields (Eσ(f )−Eσ(f ))−(Eσ(f )−Eσ(f ))≤4√ε 􏰛Eσ(f )−Eσ(f )+2ε . z ρ zz zρ H,ρ z ρ H,ρ √ with confidence 1 − δ, there holds2 S2(z) = (Eσ(fz) − Eσ(fρ)) − (Ezσ(fz) − Ezσ(fρ)) ≤ 1(Eσ(fz) − Eσ(fρ)) + 9εH,ρ. (15) 2 Applying the basic inequality ab ≤ (a+b)/2 for a,b ≥ 0, we know that for any 0 < δ < 1, Proposition 9 tells us that Eσ(fz) − Eσ(fρ) = Eσ(fz) − Eσ(fHσ ) + Eσ(fHσ ) − Eσ(fρ) + cH,ρ/σ2, where the above inequality is due to Lemma 7 and the observation that Eσ(fHσ ) − Eσ(fρ) = Eσ(fHσ ) − Eσ(fH) + Eσ(fH) − Eσ(fρ) ≤ Eσ(fH) − Eσ(fρ) (16) ≤ S1(z) + S2(z) + ∥fH − fρ∥2L2ρ X + cH,ρ/σ2. Combining estimates in (15) and (16), we come to the conclusion that for any 0 < δ < 1, ≤ ∥fH − fρ∥2L2ρ with confidence 1 − δ/2, there holds X 1 12 􏰉2􏰊􏰋1σ􏰌 S2(z) ≤ (S1(z) + S2(z)) + ∥fH − fρ∥L2 + CH,ρ,2 log 2 + 1/(1+p) , where CH,ρ,2 is a positive constant independent of m, σ or δ and given by CH,ρ,2 = 22ρX δσm 2cH,ρ + 9cH,ρ,1. This completes the proof of Proposition 11. 2. Similarly, refined estimate can be also derived here by using Young’s inequality ab ≤ ta2 + b2 for a, b ∈ R, 2 2t t > 0. In our proof, again we choose t = 1 for simplification. 1012

there holds
Maximum Correntropy Criterion in Regression
4.3.4 Proof of Theorem 4
Proof From Lemma 7 and Proposition 9, we know that
∥f −f ∥2 ≤S (z)+S (z)+∥f −f ∥2 +2c /σ2. (17)
z ρL2ρ 1 2 H ρL2ρ H,ρ XX
Combining estimates in Proposition 10 and Proposition 11 for the sample error terms S1(z) and S2(z), we know that for any 0 < δ < 1, with confidence 1 − δ, there holds S (z)+S (z)≤2∥f −f ∥2 +(2C +4C )log(2/δ){σ−2 +σm−1/(1+p)}. 1 2 H ρL2ρX H,ρ,1 H,ρ,2 This in connection with the estimate in (17) tells us that for any 0 < δ < 1, with confidence 1 − δ, there holds ∥f −f ∥2 ≤3∥f −f ∥2 +C log(2/δ){σ−2 +σm−1/(1+p)}, z ρ L2ρX H ρ L2ρX H,ρ where CH,ρ = 2CH,ρ,1 + 4CH,ρ,2 + 4cH,ρ. This completes the proof of Theorem 4. 4.3.5 Proof of Theorem 5 The proof of Theorem 5 can be similarly conducted as that of Theorem 4, since the error decomposition in Proposition 9 holds when Y is bounded. Therefore, we also need to bound the two sample error terms S1(z) and S1(z), respectively. Proposition 16 Assume that |y| ≤ M almost surely for some M > 0, and fρ ∈ H. For any 0 < δ < 1, with confidence 1 − δ/2, there holds S (z) ≤ C′ log(2/δ)(σ−2 + m−1), 1 H,ρ,1 where C′ is a positive constant that independent of m, σ or δ and will be given explicitly H,ρ,1 in the proof. Proof We will finish the proof by following similar process as done for Proposition 10. We first introduce the random variable ξ ̄ (z) as follows 1 ξ ̄(z)=−σ2exp􏰂−(y−fσ(x))2/σ2􏰃+σ2exp􏰂−(y−f (x))2/σ2􏰃, z∈Z. 1Hρ It follows from the proof of Proposition 10 and the boundedness of Y that for any z ∈ Z, |ξ ̄(z)|≤􏰄􏰄(2y−fσ(x)−f (x))(fσ(x)−f (x))􏰄􏰄 1 HρHρ ≤􏰇2M+∥f ∥ +sup∥f∥ 􏰈sup∥f−f ∥ . ρ∞∞ρ∞ Consequently, the following estimate holds f∈H f∈H 􏰇􏰈′ 􏰄􏰄ξ ̄ −Eξ ̄􏰄􏰄≤2 2M+∥f ∥ +sup∥f∥ sup∥f−f ∥ :=c . 11 ρ∞ ∞ ρ∞H,ρ,0 f∈H f∈H 1013 var(ξ1) = Eξ12 − (Eξ1)2 ̄ Feng, Huang, Shi, Yang, and Suykens Denote the variance of the random variable ξ ̄ as var(ξ ̄ ). From the proof of Proposition 11 10 and the boundedness of Y , we have ̄ ̄ ̄ ≤ Eξ12 ≤ E 􏰀(fHσ (x) − fρ(x))2(2y − fHσ (x) − fρ(x))2􏰁 􏰇222􏰈􏰔σ2 ≤ 12M + 3 sup ∥f∥∞ + 3∥fρ∥∞ (fH(x) − fρ(x)) dρX (x). f∈H X Recalling the fact that fρ ∈ H, as a consequence of Lemma 7, we obtain 􏰔 σ 2 􏰔 2 2cH,ρ 2cH,ρ (fH(x) − fρ(x)) dρX (x) ≤ (fH(x) − fρ(x)) dρX (x) + σ2 = σ2 . XX Combining the above two estimates, we obtain the following upper bound for the variance of ξ ̄ : var(ξ ̄)≤c′ /σ2 with c′ =2c 􏰇12M2+3sup∥f∥2 +3∥f∥2 􏰈. f∈H and with simple computations, we come to the conclusion that for any 0 < δ < 1, with confidence 1 − δ/2, there holds S (z) ≤ C′ log(2/δ)(σ−2 + m−1), 1 H,ρ,1 1 1 H,ρ,1 H,ρ,1 H,ρ Applying the one-sided Bernstein’s inequality in Lemma 13 to the random variable ξ ̄ where C′ is a positive constant independent of m, σ or δ and given by C′ = H,ρ,1 2 + c′H,ρ,1/2 + 2c′H,ρ,0/3. This completes the proof. H,ρ,1 We now turn to bound the sample error term S2(z) when Y is bounded. Proposition 17 Assume that the Complexity Assumption II with 0 < s < 2 holds, |y| ≤ M almost surely for some M > 0. Let fρ ∈ H and σ ≥ 1. For any f ∈ H and 0 < δ < 1, with confidence 1 − δ/2, there holds {Eσ(f)−Eσ(f )}−{Eσ(f)−Eσ(f )}≤ 1{Eσ(f)−Eσ(f )}+C′ log(2/δ)m− 2 , ρzzρ2 ρH,ρ,2 2+s where C′ is a positive constant independent of m, σ or δ and will be given explicitly in H,ρ,2 the proof. Proof To prove the proposition, we apply Lemma 14 to the function set FH, which is defined as FH =􏰘g􏰄􏰄􏰄g(z)=lσ(y,f(x))−lσ(y,fρ(x))+cH,ρ,f∈H,z∈Z􏰙. σ2 According to the definition of FH, for any g ∈ FH, it can be explicitly expressed as g(z) = −σ2 exp 􏰂−(y − f(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 + cH,ρ , σ2 ∞ ρ∞ 1 1014 2 􏰔 􏰉 2 􏰋 Eg = −σ exp − (y − f(x))2 􏰌 2 􏰋 (y − fρ(x))2 􏰌􏰊2 σ2 Maximum Correntropy Criterion in Regression with z ∈ Z and f ∈ H. Recalling that |y| ≤ M almost surely and σ ≥ 1, simple computa- tions show that ∥g∥ ≤􏰇2M+∥f∥ +sup∥f∥ 􏰈sup∥f−f∥ +c . ∞ ρ∞ ∞ ρ∞H,ρ f∈H f∈H Applying the mean value theorem again as done in the proof of Proposition 10, we get 􏰀−σ2 exp 􏰂−(y − f(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃􏰁2 ≤ 􏰀(y − f(x))2 − (y − fρ(x))2􏰁2 􏰇 􏰈2 2 ≤ 2M + sup ∥f∥∞ + ∥fρ∥∞ (f(x) − fρ(x)) , f∈H where the last inequality is again due to the boundedness of Y . This in connection with Lemma 7 tells us that Z 2cH,ρ + σ2 􏰔 􏰉 2 −σ exp − (y − f(x))2 􏰌 σ2 􏰋 c2H,ρ dρ+ σ4 Z ≤ 2M + sup∥f∥∞ +∥fρ∥∞ +σ exp − 2cH,ρ 􏰇 σ σ2 􏰋 +σ exp − dρ (y − fρ(x))2 􏰌􏰊 2 σ2 (E(f)−E(fρ))+ σ2 E (f)−E (fρ)+ σ2 cH,ρ􏰈 ≤􏰇2M+sup∥f∥ +∥f ∥ 􏰈2􏰇Eσ(f)−Eσ(f )+cH,ρ􏰈+2cH,ρ 􏰇Eσ(f)−Eσ(f )+cH,ρ􏰈 􏰇 􏰍􏰇 f∈H 􏰍􏰇 2M + sup ∥f∥∞ + ∥fρ∥∞ f∈H σ 􏰈2 f∈H f∈H 2M + sup∥f∥∞ +∥fρ∥∞ ∞ ρ∞ ρ σ2 σ2 ρ σ2 = = ≤ where the last inequality is due to the assumption that σ ≥ 1. 􏰍􏰇 􏰈2􏰎 2M + sup ∥f∥∞ + ∥fρ∥∞ + 2cH,ρ Eg, f∈H 􏰈2 2c 􏰎􏰇 + H,ρ c 􏰈 H,ρ σ2 + H,ρ σ2 Eσ(f)−Eσ(fρ)+ Eg σ2 􏰈2 2c 􏰎 For any g1, g2 ∈ FH, there exist f1, f2 ∈ H such that g1(z) = −σ2 exp 􏰂−(y − f1(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 + cH,ρ and σ2 g2(z) = −σ2 exp 􏰂−(y − f2(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 + cH,ρ . σ2 Applying the mean value theorem and noticing the boundedness of Y , we have |g (z)−g (z)|≤2􏰇M+sup∥f∥ 􏰈∥f −f ∥ , z∈Z. 12 ∞12∞ f∈H 1015 Feng, Huang, Shi, Yang, and Suykens Under the Complexity Assumption II with 0 < s < 2, the following relation between the l2-empirical covering numbers of FH and H holds 􏰇 􏰑􏰇 log N2(FH, η) ≤ log N2 H, η 2M + 2 sup ∥f∥∞ f∈H For notation simplification, we denote 􏰈􏰈 􏰇􏰇 ≤ cII,s 2M + 2 sup ∥f∥∞ f∈H 􏰈􏰑􏰈s η . ′􏰇 􏰈2 cH,ρ,2 = 2M + sup ∥f∥∞ + ∥fρ∥∞ + 2cH,ρ, B′ =􏰇2M+sup∥f∥ +∥f ∥ 􏰈sup∥f−f ∥ +c , f∈H f∈H ∞ ρ∞ ρ∞ H,ρ f∈H H,ρ aH,s =cII,s 2M+2sup∥f∥∞ . 􏰇 􏰈s f∈H Applying Lemma 14 to the function set FH, with simple computations, we come to the conclusion that when σ ≥ 1, for any 0 < δ < 1 with confidence 1 − δ/2, there holds {Eσ(f)−Eσ(f )}−{Eσ(f)−Eσ(f )}≤ 1{Eσ(f)−Eσ(f )}+C′ log(2/δ)m− 2 , ρzzρ2 ρH,ρ,2 2+s where C′ is a positive constant independent of m, σ or δ and given by H,ρ,2 C′ = 18B′ + 2c′ + 2a a2/(2+s)(c′ + B′ )(2−s)/(2+s), H,ρ,2 H,ρ H,ρ,2 s H,s H,ρ,2 H,ρ and as is a positive constant depending only on s. This completes the proof of Proposition 17. Proof [Proof of Theorem 5] Following from the estimate in inequality (11), and recalling that fρ ∈ H, we have ∥f −f ∥2 ≤S (z)+S (z)+2c /σ2. (18) zρL2ρ 1 2 H,ρ X As a consequence of Proposition 17, we know that when σ ≥ 1, for any 0 < δ < 1 with confidence 1 − δ/2, there holds {Eσ(f )−Eσ(f )}−{Eσ(f )−Eσ(f )}≤ 1{Eσ(f )−Eσ(f )}+C′ log(2/δ)m− 2 . zρzzzρ2zρH,ρ,2 2+s The above inequality together with Lemma 7 yields S2(z) = {Eσ(fz) − Eσ(fρ)} − {Ezσ(fz) − Ezσ(fρ)} ≤1∥f −f∥22 +cH,ρ +C′ log(2/δ)m− 2 . 2 z ρ LρX 2σ2 H,ρ,2 2+s This in connection with the upper bound for the sample error term S1(z) in Proposition 16 and inequality (18), with the choice σ = m1/(2+s), yields that for any 0 < δ < 1, with confidence 1 − δ, there holds ∥f−f∥22 ≤C′ log(2/δ)m−2 , zρLρXH,ρ 2+s where C′ = 2C′ + C′ + 3cH,ρ. This completes the proof of Theorem 5. H,ρ H,ρ,1 H,ρ,2 1016 Maximum Correntropy Criterion in Regression 4.3.6 Proof of Theorem 6 To prove Theorem 6, we first prove the following conclusion. Lemma 18 Assume that the Noise Assumption holds, and fρ ∈ H. Let σ be fixed and satisfy σ>σ =√2􏰇M+∥f∥+sup∥f∥􏰈. H,ρ 0ρ∞ ∞
f∈H
For any f ∈ H, there exists a positive constant cH,σ,ρ ∈ (0, 1), such that
cH,σ,ρ {E(f) − E(fρ)} ≤ Eσ(f) − Eσ(fρ).
Proof Under the Noise Assumption, when σ > σH,ρ, Theorem 8 shows that for any f ∈ H, Eσ(f) − Eσ(fHσ ) = Eσ(f) − Eσ(fH) = Eσ(f) − Eσ(fρ).
Foranyx∈X,againwedenoteF (u)=1−􏰒M0 exp􏰘−(t−u)2􏰙p (t)dt,then x −M0 σ2 ε|X=x
􏰔
Eσ(f) − Eσ(fρ) = σ2 = σ2
X2
where the last equality follows from the fact that Fx′(0) = 0 and ξx falls between 0 and
f(x)−fρ(x)foranyx∈X. Itiseasytoseethatwhenσisfixedandσ>σH,ρ,wehave ′′ 􏰔 M0 􏰋 (t−ξx)2􏰌􏰉σ2 −2(t−ξx)2􏰊
(Fx(f(x) − fρ(x)) − Fx(0)) dρX (x)
􏰔􏰋 F′′(ξ) 􏰌
X X2
Fx′ (0)(f(x) − fρ(x)) + x x (f(x) − fρ(x))2 dρX (x)
􏰔 σ2F′′(ξ )
= x x (f(x) − fρ(x))2dρX (x),
Fx (ξx) = 2 exp − σ2 σ4 pε|X=x(t)dt −M0
≥ (2σ2 − 2σ2 H,ρ
)/σ4 exp(−σ2 /σ2), for any x ∈ X, H,ρ
where the last inequality is due to the following fact

|t−ξx|≤ 2σH,ρ/2, t∈[−M0,M0],x∈X.
As a result, we come to the conclusion that Eσ(f)−Eσ(fρ)≥cH,σ,ρ {E(f)−E(fρ)},
The proof of Theorem 6 is different from the proofs of Theorem 4 and Theorem 5. This is because when σ is fixed, σ−1 does not tend to zero and consequently we cannot get
where c = (σ2 − σ2 )/σ2 exp(−σ2 /σ2). Noticing that 0 < c < 1, we have verified H,σ,ρ our assertion. H,ρ H,ρ H,σ,ρ 1017 σ is fixed and satisfies Feng, Huang, Shi, Yang, and Suykens meaningful convergence rates via the error decomposition in Proposition 9. However, from Lemma 18, we know that ∥f −f ∥22 ≤c−1 {Eσ(f )−Eσ(f )}=c−1 (S (z)+S (z)), z ρLρX H,σ,ρ z ρ H,σ,ρ 1 2 where the definitions of S1(z) and S2(z) are inherited from Proposition 9. We notice that under the condition that the Noise Assumption holds, and fρ ∈ H, when σ>σ =√2􏰇M+∥f∥+sup∥f∥􏰈, H,ρ 0ρ∞ ∞
f∈H
Theorem 8 tells us that almost surely fHσ = fρ. In this situation, almost surely we have S1(z) = 0. Therefore, to prove Theorem 6, it suffices to bound the sample error term S2(z). This can be done by applying Lemma 14 to the function set
FH =􏰂g􏰄􏰄g(z)=lσ(y,f(x))−lσ(y,fρ(x)):f ∈H,z∈Z􏰃.
Proposition 19 Assume that the Complexity Assumption II with 0 < s < 2 and the Noise Assumption hold. Let fρ ∈ H, σ be fixed and satisfy σ>σ =√2􏰇M+∥f∥+sup∥f∥􏰈.
H,ρ 0ρ∞ ∞ f∈H
For any f ∈ H and 0 < δ < 1, with confidence 1 − δ, there holds {Eσ(f)−Eσ(f )}−{Eσ(f)−Eσ(f )}≤ 1{Eσ(f)−Eσ(f )}+C log(1/δ)m− 2 , ρzzρ2 ρH,σ,ρ,1 2+s where CH,σ,ρ,1 is a positive constant independent of m or δ and will be given explicitly in the proof. Proof For any g ∈ FH, we know from the definition of FH that g can be expressed as g(z) = −σ2 exp 􏰂−(y − f(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 , z ∈ Z, for some f ∈ H. Following from the proof of Proposition 10, we know that ∥g∥∞ ≤ 􏰚2/eσ sup ∥f − fρ∥∞ := BH,σ,ρ. f∈H When the Noise Assumption holds, fρ ∈ H, and σ > σH,ρ, we have
Eg2 ≤E􏰀(f(x)−fρ(x))2(2y−f(x)−fρ(x))2􏰁
􏰔􏰇222􏰈􏰔2
≤ 12y + 3 sup ∥f∥∞ + 3∥fρ∥∞ dρ(y|x) (f(x) − fρ(x)) dρX (x)
Yf∈H X −1 􏰇 􏰔 2 2 2 􏰈
≤ cH,σ,ρ 12 y dρ + 3 sup ∥f∥∞ + 3∥fρ∥∞ Eg := cH,σ,ρ,1Eg, Z f∈H
1018

and
g1(z) = −σ2 exp 􏰂−(y − f1(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 g2(z) = −σ2 exp 􏰂−(y − f2(x))2/σ2􏰃 + σ2 exp 􏰂−(y − fρ(x))2/σ2􏰃 .
Maximum Correntropy Criterion in Regression
where the last inequality follows from Lemma 18. For any g1, g2 ∈ FH, there exist f1, f2 ∈ H such that
From the proof of Proposition 10, we know that |g1 − g2| ≤ 􏰚2/eσ∥f1 − f2∥∞. This in connection with the Complexity Assumption II yields
􏰇 􏰚 􏰈 􏰇􏰚 􏰈s −s log N2(FH, η) ≤ log N2 H, η/( 2/eσ) ≤ cII,s 2/eσ/η := aσ,sη
.
Applying Lemma 14 to the function set FH, with simple computations, we see that for any 0 < δ < 1 with confidence 1−δ, there holds {Eσ(f) − Eσ(fρ)} − {Ezσ(f) − Ezσ(fρ)} ≤ 1 {Eσ(f) − Eσ(fρ)} + CH,σ,ρ,1 log(1/δ)m−2/(2+s), 2 where CH,ρ,σ,1 is a positive constant independent of m or δ and given by C = 18B + 2c + 2a′ a2/(2+s)(c + B )(2−s)/(2+s), H,ρ,σ,1 H,σ,ρ H,σ,ρ,1 s σ,s H,σ,ρ,1 H,σ,ρ and a′s is a positive constant depending only on s. This completes the proof of Proposition 19. Proof [Proof of Theorem 6] As a consequence of Proposition 19, we see that for any 0 < δ < 1, with confidence 1 − δ, there holds S2(z) ≤ 1 {Eσ(fz) − Eσ(fρ)} + CH,σ,ρ,1 log(1/δ)m−2/(2+s). 2 Following from Lemma 18 and recalling that S1(z) = 0, we come to the conclusion that for any 0 < δ < 1, with confidence 1 − δ, there holds ∥f −f ∥22 ≤c−1 {Eσ(f )−Eσ(f )}=c−1 S (z)≤2c−1 C log(1/δ)m−2/(2+s). z ρ LρX H,σ,ρ z ρ H,σ,ρ 2 H,σ,ρ H,σ,ρ,1 By denoting C = 2c−1 C , we complete the proof of Theorem 6. H,σ,ρ H,σ,ρ H,σ,ρ,1 5. Towards the Role that σ Plays We now move our attention to discuss the scale parameter σ in the lσ loss by making some attempts to interpret the role that σ plays from a learning theory viewpoint. The first observation on the parameter σ in the lσ loss is that it determines the ro- bustness of the regression models. For linear regression models, this observation has been quantitatively described in terms of the influence function and finite-sample breakdown 1019 Feng, Huang, Shi, Yang, and Suykens point in Wang et al. (2013). For nonlinear regression models, similar observations on the robustness have been also empirically reported. For instance, the robustness of the regres- sion models induced by the lσ losses can be enhanced with a decreasing value of σ. In fact, this is reasonable if we look at the lσ loss in which a smaller σ would limit the influ- ence of the outliers in the response variable. In addition, in the learning theory literature, the robustness property of kernel-based regression models has been studied by considering the growth type of the loss function and investigating the existence and boundedness of the corresponding influence function (see Christmann and Steinwart, 2007; Steinwart and Christmann, 2008). From Chapter 2 in Steinwart and Christmann (2008), it is easy to check that the lσ loss is of upper growth type 1 due to its Lipschitz continuity property and con- sequently can be used to deal with unbounded Y . It would be also worthwhile to derive a quantitative description on the robustness of the MCCR model (2) in terms of the influence function as done in Christmann and Steinwart (2007) and Christmann and Messem (2008) for convex regression models. However, we remark that due to the non-convexity of the lσ loss, the deduction of the influence function of the MCCR model in H (which is possibly infinite dimensional) can be much involved and is worthy for further study. On the other hand, we realize that in the robustness literature, the scale parameter not only controls the robustness property of the regression model associated with the lσ loss but also specifies its efficiency and plays a trade-off role. Considering the nonparametric setting in our study and given that our primary concern is the convergence rates of the MCCR model (2), we restrict ourselves to discussions of the influence of the scale parameter σ on the convergence rates. To this end, we recall the following relation from the error decomposition in Proposition 9: ∥f −f ∥2 ≤􏰂Eσ(f )−Eσ(fσ)􏰃+∥f −f ∥2 +A zρL2ρ zHHρL2ρH,σ,ρ XX . On the right-hand side of the above inequality, the first term is the excess risk of the empirical estimator modeled by the MCCR model, the convergence of which can be ensured by controlling the complexity of the hypothesis space H and confining the tail behavior of the response variable. The second term ∥f − f ∥2 represents the approximation error H ρL2ρX and is independent of the scale parameter σ. The influence of the scale parameter σ on the convergence rates can be revealed from the bias term AH,σ,ρ. According to Proposition 9, we know that AH,σ,ρ = 2cH,ρ/σ2. Therefore, a decreasing value of σ will lead to increasing bias and consequently yields slower convergence rates. From the above discussions, we can see that the parameter σ in the lσ loss balances the robustness of the MCCR model (2) and its convergence rates. We will continue our discussion on the role that σ plays by trying to extend our preceding analysis for the lσ loss to other robust regression loss functions in the next section. 6. Generalization to Other Robust Loss Functions In the preceding sections, motivated by the information-theoretic interpretation of the max- imum correntropy criterion and its empirical successes in real-world applications, we gener- alize the idea of the maximum correntropy criterion in regression with the lσ loss. We then present a theoretical understanding towards the maximum correntropy criterion in regres- sion by conducting a learning theory analysis for ∥fz − fρ∥2L2ρ 1020 . We conclude that one can X Maximum Correntropy Criterion in Regression rely on the lσ loss to solve regression problems with non-Gaussian as well as Gaussian noise. However, one may argue that from a regression viewpoint, the lσ loss is merely a special case of robust loss functions arise in robust statistics. In view of this, in this section we try to generalize our previous analysis to other robust loss functions and see what happens when a robust loss function is applied into the learning for regression scenarios. The robust loss functions refer to those used to obtain robust M-estimators in linear regression models. As mentioned earlier, the MCCR model can be viewed as a nonpara- metric M-estimator. Therefore, we first give a glimpse of the robust M-estimation methods in linear regression models to distinguish them from the robust nonparametric M-estimator we investigate in this paper. In linear regression models, it is assumed that the observations z are drawn i.i.d from Z = X ×Y with X = Rd and Y = R. In this setting, the regression function f⋆(x) := xTθ⋆, where θ⋆ ∈ Θ := Rd is unknown and one of the main tasks in linear regression problem is to estimate the regression parameter θ⋆. A common approach to obtaining a robust estimator θˆ for θ⋆ is to solve the following optimization problem ˆ θ = arg min θ∈Rd 􏰓m 􏰉yi−xTiθ􏰊 φ , (19) i=1 σ where σ > 0 is the scale parameter and φ is a robust loss function that downweights large residual errors. In fact, by using the above robust loss function φ, concerning the nonlinear regression model (1), one can also propose the following robust nonparametric ERM-based regression scheme
ˆ 􏰓m fz = arg min
f∈H i=1
􏰉 y i − f ( x i ) 􏰊
φ . (20)
σ
Notice that (19) aims at estimating a vector in Rd while (20) is proposed to estimate a function in a function space H that can have an infinite dimension. This gives the main difference between the two models. Denoting φσ(t) := φ(t/σ), besides the lσ loss investigated in this paper, several frequently employed robust loss functions include:
• Huber’s loss: φσ(t) = t2I{|t|≤σ} + (2σ|t| − σ2)I{|t|>σ};
• Cauchyloss: φσ(t)=σ2log􏰀1+t2/σ2􏰁;
• Tukey’s biweight loss: φσ(t) = (σ2/6)(1 − (1 − (t/σ)2)3)I{|t|≤σ} + (σ2/6)I{|t|>σ}.
In the above loss functions, IS is an indicator function which takes the value 1 if S is true and gets the value 0 otherwise.
Recall that our previous analysis on the lσ loss and the MCCR model (2) relies heavily on Lemma 7. From the proof of Lemma 7, we know that similar analysis can be also applied to other robust loss functions that are sufficiently smooth and satisfy certain conditions, e.g., the Cauchy loss given above. On the other hand, although our analysis cannot cover all the robust loss functions, following from our previous analyzing process, we can still get a general view on the robust loss functions and see what happens when a robust loss function is employed from a learning theory viewpoint.
1021

Feng, Huang, Shi, Yang, and Suykens
fH I
fls z
H
II

Figure 2: The statistical learning approach to bounding the L2ρX -distance between fz and fρ for the ERM scheme (6), which is induced by the least squares loss.
H

fz
Figure 3: The statistical learning approach to bounding the L2ρX -distance between fz and
fρ for the ERM scheme induced by a robust loss function φσ.
fHσ fH
IIII II
1022

Maximum Correntropy Criterion in Regression
To illustrate this, we first recall that to analyze the convergence of an ERM scheme
associated with the least squares loss (e.g., the unconstrained regression model (6)), a
typical statistical learning approach is proceeded as follows: instead of directly measuring
the L2 -distance between fls and f , one first introduces the projection of f in H, i.e., f . ρX zρ ρH
With the help of fH, one can decompose the distance into sample error and approximation error as follows:
∥fls−f∥22 ≤∥fls−f ∥22 +∥f −f∥22 . zρLρX zHLρX HρLρX
The idea of the above decomposition is depicted in Figure 2, where I represents the sample error ∥f ls − f ∥2 while II gives the approximation error ∥f − f ∥2 .
z HL2ρ H ρL2ρ XX
However, situations will be quite different if a robust regression loss φσ is employed. To explain this, we redefine fHσ as the target function of the regression model induced by a general robust loss φσ and fz as the corresponding empirical target function, definitions of which are given as follows
σ􏰔 1􏰓m
fH = arg min φσ(y − f(x))dρ and fz = arg min φσ(yi − f(xi)).
f∈H Z f∈Hmi=1
The analysis in our study indicates that to analyze the convergence of a regression model
induced by a robust loss function φσ, one may proceed via the following decomposition ∥fz − fρ∥2L2ρ ≤ ∥fz − fHσ ∥2L2ρ + ∥fHσ − fH∥2L2ρ + ∥fH − fρ∥2L2ρ .
XXXX
Figure 3 gives an intuitive description on the above decomposition. Similarly, in Figure 3, I
represents the sample error term ∥fz − fHσ ∥2L2ρ
, II stands for the approximation error term
∥fH −fρ∥2L2ρ
X
X
while III measures the L2ρX -distance between fHσ and fH. Notice that the bias
term III is caused by the introduction of the scale parameter σ that delivers the robustness
to the model. Due to the non-robustness of LSR and the fact that fH is the target function of LSR, again we conclude that the smaller of the L2ρX -distance between fHσ and fH is, the less robustness the regression model associated with the φσ loss possesses.
Taking the lσ loss for example, we know from our previous analysis that under very specific conditions the two points, fHσ and fH, meet and consequently the bias term III disappears. Technically speaking, a nice point of the lσ loss lies in that it is sufficiently smooth which makes it possible to bound the L2ρX -distance between fHσ and fH explicitly. For instance, when the Moment Assumption holds and fρ ∈ H, as a consequence of Lemma 7, we see that
≤ cH,ρ/σ2.
As mentioned in the previous section, the above estimate reveals that when the value of σ decreases, the upper bound of the bias term III increases.
Based on the above discussions, we conclude that when a robust loss function is employed in nonparametric regression problems, the enhancement of robustness is at the sacrifice of the convergence rate of the model and what one needs to do is to find a good compromise.
∥fHσ − fH∥2L2ρ
X
1023

Feng, Huang, Shi, Yang, and Suykens
7. Numerical Experiments
Studies in this paper are motivated by empirical success of the MCCR model. However, for the sake of completeness, in this section, we carry out numerical experiments on synthetic and real data sets to show the effectiveness of the MCCR model (2).
7.1 Experimental Setup
Notice that the MCCR model (2) is a constrained optimization model since H is assumed to be a compact subset of C(X). As mentioned previously, a typical choice of H is a bounded subset of a certain reproducing kernel Hilbert space HK induced by some Mercer kernel K. However, to determine the diameter of this bounded subset in applications, prior information is usually required. In our experiments, instead of evaluating the optimization model (2), we focus on its unconstrained version
1 􏰓m
fz =arg min lσ(yi,f(xi))+λ∥f∥2K, (21)
where λ is a positive regularization parameter.
The representer theorem ensures that we can search within the function set HK,z for
the minimizer of the optimization model (21), where
􏰏m􏰐 HK,z = 􏰓αiK(x,xi)+b,b∈R,αi ∈R,i=1,···,m ,
i=1
with b being an offset. In our experiments, we use the Gaussian kernel
Kh(xi,xj)=exp􏰀−∥xi −xj∥2/h2􏰁,
with the parameter h to be determined. To show the effectiveness of the MCCR model, we compare the empirical performance of (21) with other robust regression schemes, including robust regression models based on the Huber’s loss and the least absolute deviation loss. These robust regression schemes are obtained by replacing the lσ loss in (21) with the Huber’s loss and the least absolute deviation loss, respectively. Explicit definitions of the two loss functions are given as follows:
Huber 􏰋 (u−v)2, if |u−v|≤a LAD
φa (u,v)= 2a|u−v|, if |u−v|>a and φ (u,v)=|u−v|, u,v∈R.
For notation simplifications, we denote the two robust regression models as Huber and LAD, respectively.
To solve (21), we apply the iteratively reweighted least squares method (IRLS). The basic procedure is to iteratively solve the weighted least squares problem and give weights according to the current solution. Due to the non-convexity of the MCCR model, solving (21) by using IRLS only guarantees a stationary point. In our experiment, we use the result of the least squares method as the starting point.
In our experiment, noise added to the toy examples is given as follows
noise := τ1ε1 + τ2εp2, (22)
f∈HK m i=1
1024

Maximum Correntropy Criterion in Regression
where ε1 follows the standard Gaussian distribution and εp2 is an impulse noise (outliers) defined as
1−p, t=0, Prob(εp2 = t) = p/2, t = 1,
 p/2, t = −1.
τ1 and τ2 are introduced to set the variance of the Gaussian noise and the magnitude of the impulsive noise. In our experiment, we always set p = 0.1, i.e., 10% samples are contaminated by impulsive noise. In addition, in some of our experiments on synthetic data sets, we will also consider the noise ε1 that is drawn from the Student’s t-distribution with 3 degrees of freedom, and Cauchy distribution.
7.2 Example of the Noisy Sinc Function
We first choose the sinc function as the regression function. The one-dimension sinc function is given as
f (x) = sin(πx)/(πx), x ∈ [−4, 4], (23)
which is frequently adopted to illustrate the regression models (see Vapnik, 1998; Suykens et al., 2002a,b; Sch ̈olkopf et al., 2000; Smola and Sch ̈olkopf, 2004).
In our experiment, we first draw a training set of size 100 from the sinc function (23) that are corrupted by the Gaussian noise. We then draw another training set with the same size corrupted by the Gaussian noise and the outliers. With each training set, the fitting results of the sinc function are plotted in Figure 4, in which the red dot-dashed curve is the one reconstructed by MCCR, the blue dashed curve represents the one from Huber while LAD gives the green dotted curve.
From Figure 4, one can see that all of the three models can fit the curve of the sinc function well when the data is only contaminated by the Gaussian noise. When the train- ing data are also corrupted by outliers, all of the three robust regression models can still successfully reconstruct the curve. However, we can see that MCCR gives the best fitting results, especially at positions where data are corrupted by outliers.
7.3 Example of the Noisy Friedman’s Benchmark Functions
Our second numerical experiment on toy examples considers multiple dimensional regression problems. We now use the Friedman’s benchmark functions as our test functions, which were introduced in Friedman (1991) and have become widely employed models when studying regression problems (see Tipping, 2001; Brown et al., 2005; Debruyne et al., 2010).
The Friedman’s benchmark functions are listed as follows: • f1(x) = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10×4 + 5×5;
• f2(x) = 􏰛(x1)2 + (x2x3 − 1/(x2x4))2;
• f3(x) = arctan 􏰀1/x1 􏰀x2x3 − 1/(x2x4)􏰁􏰁.
1025

3
2
1
0
−1
−2
−3
−4 −3.2
3
2
1
0
−1
−2
−3
−4 −3.2
−2.4 −1.6 −0.8 0 x
0.8 1.6 2.4 3.2 4
Feng, Huang, Shi, Yang, and Suykens
−2.4 −1.6 −0.8 0 x
0.8 1.6 2.4 3.2 4
Figure 4: Sinc function (black solid curves) and the regression results (MCCR: red dot- dashed curve; Huber: blue dashed curve; LAD: green dotted curve). (top) The training data (crosses) are corrupted by Gaussian noise; (bottom) Some observed data are outliers (marked by squares).
1026
y = sin(πx)/(πx) y = sin(πx)/(πx)

−5 −10
500 250
−250 −500
MCCR
f2 without outliers
10
−5 −10
500
MCCR Huber LAD f2 with outliers
Maximum Correntropy Criterion in Regression
f1 without outliers
f1 with outliers
10 55 00
Huber LAD
250 00
MCCR
f3 without outliers
MCCR Huber LAD f3 with outliers
Huber LAD
11
00
−1 −1 MCCR Huber LAD
MCCR Huber LAD
residuals
residuals
residuals
residuals
residuals
residuals
Figure 5: Box-plots of the residuals of Friedman’s benchmark functions for the case of Gaus- sian noise. Each box-plot features a lower quartile (25 percentile) line, a median (50 percentile) line and an upper quartile (75 percentile) line for the residuals on test data.
1027
−250 −500

ˆ 􏰓 􏰇 RSSE(f) =
x∈T
where f ̄ is the mean value of f(x) on T. T
̄ 􏰁2 f(x) − fT ,
Feng, Huang, Shi, Yang, and Suykens
For f1, x = (x1,…,x10) where each xj, j = 1,…,10, is uniformly distributed in [0,1] and x6, …, x10 are noisy variables. For f2 and f3, x = (x1,x2,x3,x4) and each is uniformly distributed in the following intervals: x1 ∈ [0, 100], x2 ∈ [40π, 560π], x3 ∈ [0, 1] and x4 ∈ [1, 11].
For each function, 1000 observations are randomly taken from corresponding domain for
training and cross-validating. Another independent 1000 observations are also randomly
drawn as the test set. Noise and outliers are then added according to (22). For f1, we set
τ1 =1. Forf2 andf3,τ1 issetsuchthattheratioofthesignalpowertothepowerofε1 is3.
In the outlier-free cases, we set τ2 = 0. To observe the performance for the three models in
the presence of outliers in the training data sets, we set τ2 = maxx∈D f (x) − minx∈D f (x),
where D is the domain of each benchmark function. For each regression model, the width
of the Gaussian kernel h, the regularization parameter λ and the scale parameter in the loss
function (no scale parameter for the LAD loss) are all tuned via a 10-fold cross-validation
under the mean squared error criterion. The residuals {yi − f(xi)}1000 are recorded. For i=1
the case of Gaussian noise, we boxplot all the residuals in Figure 5. Each box-plot features a lower quartile (25 percentile) line, a median (50 percentile) line and an upper quartile (75 percentile) line.
In Table 1, we also report the relative sum of squared error (RSSE) on the test data set T, i.e.,
ˆ 􏰈2 􏰑 􏰓 􏰀 f(x) − f(x)
x∈T
test function
f1 f2 f3
f1 f2 f3
f1 f2 f3
noise
Gaussian noise, no outliers Gaussian noise, outliers Gaussian noise, no outliers Gaussian noise, outliers Gaussian noise, no outliers Gaussian noise, outliers
Cauchy noise, no outliers Cauchy noise, outliers Cauchy noise, no outliers Cauchy noise, outliers Cauchy noise, no outliers Cauchy noise, outliers
Student noise, no outliers Student noise, outliers Student noise, no outliers Student noise, outliers Student noise, no outliers Student noise, outliers
MCCR Huber LAD
0.048 0.049 0.062 0.073 0.020 0.021 0.023 0.032 0.091 0.117 0.062 0.073
0.042 0.042 0.045 0.049 0.005 0.005 0.006 0.006 0.180 0.195
0.219 0.143 0.040 0.040
0.046 0.075 0.017 0.017 0.023 0.024 0.423 0.429 0.471 0.544
0.103 0.157 0.136 0.156 0.136 0.157
0.116 0.089 0.025 0.021 0.177 0.154
0.101 0.092 0.129 0.123 0.430 0.434
Table 1: The relative sum of squared error on the test data
1028

Maximum Correntropy Criterion in Regression
7.4 Evaluation on Real Data Sets
We also evaluate the three robust regression models on four real data sets downloaded from UCI repository of machine learning databases (see Bache and Lichman, 2013): Concrete Compressive Strength Data Set, Housing Data Set, Yacht Hydrodynamics Data Set and Airfoil Self-Noise Data Set.
For each data set, two third of the instances are used for training and the remaining are used for test. We repeat our experiment as done for the Friedman’s benchmark functions for ten times. The residuals for the three robust regression models are displayed by box- plots in Figure 6, the accuracy of which are measured by RSSE. Experimental results on the RSSEs and the details of training data, including the size of features n and the size of instances m, are reported in Table 2.
data sets
concrete
house yacht-hydrodynamics airfoil
n m
9 686 14 338 7 205
MCCR
0.061
Huber LAD 0.061 0.062
0.126 0.175 0.024 0.159 0.195 0.238
6 1000
Table 2: The relative sum of squared error on real data
0.128 0.022 0.184
In the above numerical evaluations on toy examples and real data sets, our experiments show that when the data is only contaminated by Gaussian noise, a large sigma value in the MCCR model and a large a value in the regression model based on the Huber’s criterion will be selected via cross-validation. However, for other noise and in the presence and absence of outliers, smaller values of the scale parameters in the two regression models will be selected. These coincide with our understandings on the robust regression models.
From the above experimental results, we can see the effectiveness of MCCR especially for the cases in the presence of impulsive noise.
8. Concluding Remarks
In this paper, we presented a statistical learning interpretation of the regression model associated with the correntropy induced regression loss. We investigated its connections with the least squares regression. We found that the correntropy induced loss could help for regression with non-Gaussian noise. Meanwhile, comparable performance could be obtained by applying this regression model when the noise is Gaussian. Convergence rates of the proposed model under various circumstances were derived explicitly. We showed that the scale parameter in the loss function balanced the convergence rates and the robustness of the model. We also made some efforts to extend our analysis to other robust loss functions and gave a general view on analyzing regression models induced by general robust loss functions. It is expected that our observations can shed some light towards future real-life applications.
1029

40 20 0 −20 −40
20
MCCR
Huber LAD
20 0 −20 −40
20
MCCR Huber LAD
Feng, Huang, Shi, Yang, and Suykens
00
−20 −20
MCCR Huber LAD
MCCR Huber LAD
Figure 6: Box-plots of the residuals on four real data sets. Each box-plot features a lower quartile (25 percentile) line, a median (50 percentile) line and an upper quartile (75 percentile) line for the residuals on test data. (top left) concrete; (top right) Boston house; (bottom left) yacht hydrodynamics; (bottom right) airfoil.
1030

Maximum Correntropy Criterion in Regression
Acknowledgments
The authors would like to thank the action editor and the reviewers for their insightful comments and constructive suggestions, which improved the quality of this paper. The authors would also like to thank Dr. Jun Fan from Department of Statistics, University of Wisconsin-Madison for helpful discussions.
EU: The research leading to these results has received funding from the European Re- search Council under the European Union’s Seventh Framework Programme (FP7/2007- 2013) / ERC AdG A-DATADRIVE-B (290923). This paper reflects only the authors’ views, the Union is not liable for any use that may be made of the contained information. Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants. Flemish Government: FWO: projects: G.0377.12 (Structured systems), G.088114N (Tensor based data similarity); PhD/Postdoc grants. IWT: projects: SBO POM (100031); PhD/Postdoc grants. iMinds Medical Information Technologies SBO 2014. Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, control and optimiza- tion, 2012-2017). L. Shi is supported by the National Natural Science Foundation of China Project No. 11201079), the Joint Research Fund by National Natural Science Foundation of China and Research Grants Council of Hong Kong (Project No. 11461161006 and Project No. CityU 104012) and the Fundamental Research Funds for the Central Universities of China (Project No. 20520133238, Project No. 20520131169). Johan Suykens is a professor at KU Leuven, Belgium.
References
Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999.
Jean-Yves Audibert and Olivier Catoni. Robust linear least squares regression. Annals of Statistics, 39(5):2766–2794, 2011.
Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. URL http: //archive.ics.uci.edu/ml.
Sergei N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.
Ricardo J. Bessa, Vladimiro Miranda, and Jo ̃ao Gama. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. Power Systems, IEEE Transactions on, 24(4):1657–1666, 2009.
Gavin Brown, Jeremy L. Wyatt, and Peter Tinˇo. Managing diversity in regression ensem- bles. Journal of Machine Learning Research, 6:1621–1650, 2005.
Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
1031

Feng, Huang, Shi, Yang, and Suykens
Andreas Christmann and Arnout Van Messem. Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9:915– 936, 2008.
Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based re- gression in convex risk minimization. Bernoulli, 13(3):799–819, 2007.
Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory View- point. Cambridge University Press, Cambridge, 2007.
Patrick L. Davies. Aspects of robust linear regression. Annals of Statistics, 21(4):1843–1899, 1993.
Kris De Brabanter, Kristiaan Pelckmans, Jos De Brabanter, Michiel Debruyne, Johan A. K. Suykens, Mia Hubert, and Bart De Moor. Robustness of kernel based regression: a comparison of iterative weighting schemes. In International Conference on Artificial Neural Networks–ICANN 2009, pages 100–110. Springer, 2009.
Michiel Debruyne, Mia Hubert, and Johan A. K. Suykens. Model selection in kernel based regression using the influence function. Journal of Machine Learning Research, 9:2377– 2400, 2008.
Michiel Debruyne, Andreas Christmann, Mia Hubert, and Johan A. K. Suykens. Robustness of reweighted least squares kernel based regression. Journal of Multivariate Analysis, 101 (2):447–463, 2010.
John E. Dennis and Roy E. Welsch. Techniques for nonlinear least squares and robust regression. Communications in Statistics-Simulation and Computation, 7(4):345–359, 1978.
Jun Fan, Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Consistency analysis of an empirical minimum error entropy algorithm. Applied and Computational Harmonic Analysis, in press. doi: 10.1016/j.acha.2014.12.005.
Jerome H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1): 1–141, 1991.
Aysegul Gunduz and Jos ́e C. Pr ́ıncipe. Correntropy as a novel measure for nonlinearity tests. Signal Processing, 89(1):14–23, 2009.
Zheng-Chu Guo and Ding-Xuan Zhou. Concentration estimates for learning with unbounded sampling. Advances in Computational Mathematics, 38(1):207–223, 2013.
Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, New York, 1986.
Ran He, Wei-Shi Zheng, and Bao-Gang Hu. Maximum correntropy criterion for robust face recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(8): 1561–1576, 2011a.
1032

Maximum Correntropy Criterion in Regression
Ran He, Wei-Shi Zheng, Bao-Gang Hu, and Xiang-Wei Kong. A regularized correntropy framework for robust pattern recognition. Neural Computation, 23(8):2074–2100, 2011b.
Paul W. Holland and Roy E. Welsch. Robust regression using iteratively reweighted least- squares. Communications in Statistics-Theory and Methods, 6(9):813–827, 1977.
Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Learning theory approach to minimum error entropy criterion. Journal of Machine Learning Research, 14(1):377–397, 2013.
Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 2014.
Peter J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.
Weifeng Liu, Puskal P. Pokharel, and Jos ́e C. Pr ́ıncipe. Correntropy: properties and appli- cations in non-gaussian signal processing. Signal Processing, IEEE Transactions on, 55 (11):5286–5298, 2007.
Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. Annals of Statis- tics, 38(1):526–565, 2010.
Jos ́e C. Pr ́ıncipe. Information Theoretic Learning: R ́enyi’s Entropy and Kernel Perspec- tives. Springer, New York, 2010.
Ignacio Santamar ́ıa, Puskal P. Pokharel, and Jos ́e C. Pr ́ıncipe. Generalized correlation function: definition, properties, and application to blind equalization. Signal Processing, IEEE Transactions on, 54(6):2187–2197, 2006.
Bernhard Sch ̈olkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207–1245, 2000.
Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1(01):17–41, 2003.
Alex J. Smola and Bernhard Sch ̈olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, 2004.
Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, New York, 2008.
Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Conference on Learning Theory, 2009.
Johan A. K. Suykens, Jos De Brabanter, Lukas Lukas, and Joos Vandewalle. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1):85–105, 2002a.
Johan A. K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Van- dewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002b.
1033

Feng, Huang, Shi, Yang, and Suykens
Michael E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.
Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998. Cheng Wang and Ding-Xuan Zhou. Optimal learning rates for least squares regularized
regression with unbounded sampling. Journal of Complexity, 27(1):55–67, 2011.
Xueqin Wang, Yunlu Jiang, Mian Huang, and Heping Zhang. Robust variable selection with exponential squared loss. Journal of the American Statistical Association, 108(502): 632–643, 2013.
Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6(2):171–192, 2006.
Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Multi-kernel regularized classifiers. Journal of Complexity, 23(1):108–134, 2007.
Ding-Xuan Zhou. The covering number in learning theory. Journal of Complexity, 18(3): 739–767, 2002.
1034