代写 R C algorithm deep learning math scala database graph statistic network CHAPTER 23 Linear Regression

CHAPTER 23 Linear Regression
Given a set of attributes or variables X1,X2, ,Xd, called the predictor, explanatory, or independent variables, and given a realvalued attribute of interest Y, called the response or dependent variable, the aim of regression is to predict the response variable based on the independent variables. That is, the goal is to learn a regression function f , such that
Y f X1 , X2 , , Xd f X
where X X1 , X2 , , Xd T is the multivariate random variable comprising the predictor attributes, and is a random error term that is assumed to be independent of X. In other words, Y is comprised of two components, one dependent on the observed predictor attributes, and the other, coming from the error term, independent of the predictor attributes. The error term encapsulates inherent uncertainty in Y, as well as, possibly the effect of unobserved, hidden or latent variables.
In this chapter we discuss linear regression, where the regression function f is assumed to be a linear function of the parameters of the model. We also discuss regularized linear regression models considering both L2 ridge regression and L1 Lasso regularization. Finally, we use the kernel trick to perform kernel ridge regression that can handle nonlinear models.
23.1 LINEAR REGRESSION MODEL
In linear regression the function f is assumed to be linear in its parameters, that is
d i1
Here, the parameter is the true unknown bias term, the parameter i is the true unknownregressioncoefficientorweightforattributeXi,and1,2,,dT is the true ddimensional weight vector. Observe that f specifies a hyperplane in Rd1, where is the the weight vector that is normal or orthogonal to the hyperplane, and
fX1X1 2X2 dXd
iXi TX 23.1
589

590 Linear Regression
is the intercept or offset term see Section 6.1. We can see that f is completely specified by the d 1 parameters comprising and i , for i 1, , d .
The true bias and regression coefficients are unknown. Therefore, we have to estimate them from the training dataset D comprising n points xi Rd in a ddimensional space, and the corresponding response values yi R, for i 1,2, ,n. Let b denote the estimated value for the true bias , and let wi denote the estimated value for the true regression coefficient i , for i 1, 2, , d . Let w w1 , w2 , , wd T denote the vector of estimated weights. Given the estimated bias and weight values, we can predict the response for any given input or test point x x1,x2, ,xdT, as follows
y bw1x1 wdxd bwTx 23.2
Due to the random error term, the predicted value y will not in general match the observed response y for the given input x. This is true even for the training data. The difference between the observed and predicted response, called the residual error, is given as
o y y y b w T x 2 3 . 3
Note that the residual error o is different from the random statistical error , which measures the difference between the observed and the unknown true response. The residual error o is an estimator of the random error term .
A common approach to predicting the bias and regression coefficients is to use the method of least squares. That is, given the training data D with points xi and response values yi for i 1, ,n, we seek values b and w, so as to minimize the sum of squared residual errors SSE
nn2n 2
SSE oi2 yi yi yi bwTxi 23.4
i1 i1 i1
In the following sections, we will estimate the unknown parameters, by first considering the case of a single predictor variable, and then looking at the general case of multiple predictors.
23.2 BIVARIATE REGRESSION
Let us first consider the case where the input data D comprises a single predictor attribute, X x1,x2, ,xnT, along with the response variable, Y y1,y2, ,ynT. Since f is linear, we have
yi fxibwxi 23.5
Thus, we seek the straight line fx with slope w and intercept b that best fits the data. The residual error, which is the difference between the predicted value also

Bivariate Regression 591
called fitted value and the observed value of the response variable, is given as
o i y i y i 2 3 . 6
Note that oi denotes the vertical distance between the fitted and observed response. The best fitting line minimizes the sum of squared errors
n n n
minSSE oi2 yi yi2 yi bwxi2 23.7
b,w
i1 i1 i1
To solve this objective, we differentiate it with respect to b and set the result to 0,
to obtain
n
bSSE2 yi bwxi0
i1
n n n
b yiw xi i1 i1 i1
1 n
b n
i1
1 n
y i w n bY wX
x i
Therefore, we have
i1
23.8 where Y is the sample mean for the response and X is the sample mean for the
predictor attribute. Similarly, differentiating with respect to w, we obtain n
wSSE2 xiyi bwxi0 i1
n n n
x i y i b x i w x i2 0
i1 i1 i1 substituting b from Eq. 23.8, we have
n n n n
x i y i Y x i w X x i w x i2 0
i1 i1 i1 i1
2 3 . 9
n n w xi2 n2X xi yi
nX Y
23.10
i1 i1 ni1 xi yinX Y
w n x2n2 i1 i X
The regression coefficient w can also be written as
ni1xi Xyi Y XY covX,Y w ni1xiX2 X2 varX
23.11

592 Linear Regression where X2 is the variance of X and XY is the covariance between X and Y. Noting that
the correlation between X and Y is given as XY XY , we can also write w as X Y
w Y XY X
23.12
Observe that the fitted line must pass through the mean value of Y and X; plugging in the optimal value of b from Eq. 23.8 into the regression equation Eq. 23.5, we have
yi bwxi Y wX wxi Y wxi X
That is, when xi X, we have yi Y. Thus, the point X,Y lies on the regression
line.
Example23.1BivariateRegression. Figure23.1showsthescatterplotbetweenthe two attributes petal length X; the predictor variable and petal width Y; the response variable in the Iris dataset. There are a total of n 150 data points. The mean values for these two variables are
X 150
1 1 5 0
Y 150
1 1 5 0
5 6 3 . 8
xi 150 3.7587
1 7 9 . 8
yi 150 1.1987
i1 i1
2.5 2.0 1.5 1.0 0.5

0 0.5

0 1.0
2.0
3.0
X: petal length
6.0
7.0

o35

x35

o9
x9

Figure 23.1. Scatterplot: petal length X versus petal width Y. Solid circle black shows the mean point; residual error is shown for two sample points: x9 and x35.
4.0 5.0
Y: petal width

Bivariate Regression 593
The variance and covariance is given as
1 1 5 0
X2 1 50
1 1 5 0
xi X2 3.0924
yi Y2 0.5785
Y2 1 50
1 1 5 0
i1
i1 i1
xi Xyi Y1.2877
Assuming a linear relationship between the response and predictor variables, we use
XY 150
Eq. 23.8 and Eq. 23.10 to obtain the slope and intercept terms as follows
w XY 1.2877 0.4164 X2 3.0924
bY wX 1.19870.41643.75870.3665 Thus, the fitted regression line is
y 0.3665 0.4164 x
Figure 23.1 plots the bestfitting or regression line; we can observe that the mean point X,Y 3.759,1.199 lies on the line. The figure also shows the residual errors, o9 and o35, for the points x9 and x35, respectively.
Finally, we can compute the SSE value see Eq. 23.4 as follows:
150 150
SSEoi2 yi yi2 6.343
i1 i1
23.2.1 GeometryofBivariateRegression
We now turn to the attributecentric view, which provides important geometric insight for bivariate regression. Recall that we are given n equations in two unknowns, namely yi b w xi, for i 1, ,n. Let X x1,x2, ,xnT be the ndimensional vector denoting the training data sample, Y y1 , y2 , , yn T the sample vector for the response variable, and Y y1,y2, ,ynT the vector of predicted values, then we can expressthenequations,yi bwxi fori1,2,,n,asasinglevectorequation:
Y b 1 w X
2 3 . 1 3
where 1 Rn is the ndimensional vector of all ones. This equation indicates that the predicted vector Y is a linear combination of 1 and X, i.e., it must lie in the column space spanned by 1 and X, given as span1,X. On the other hand, the response vector Y will not usually lie in the same column space. In fact, the residual error vector

594
Linear Regression
x1 Y
o Y Y
1
xn
ndimensional space spanned by the n data points. The plane illustrates the subspace spanned by 1 and X.
Y
Figure 23.2. Geometry of bivariate regression: nonorthogonal basis. All the vectors conceptually lie in the
X
x2
X
X
1

X 1
Figure 23.3. Orthogonal decomposition of X into X and X 1.
oo1,o2,,onT capturesthedeviationbetweentheresponseandpredictedvectors
o Y Y
The geometry of the problem, shown in Figure 23.2, makes it clear that the optimal Y that minimizes the error is the orthogonal projection of Y onto the subspace spanned by 1 and X. The residual error vector o is thus orthogonal to the subspace spanned by 1 and X, and its squared length or magnitude equals the SSE value see Eq. 23.4, since
n n o2 YY2 yi yi2
i1 i1
oi2 SSE
At this point it is worth noting that even though 1 and X are linearly independent and form a basis for the column space, they need not be orthogonal see Figure 23.2. We can create an orthogonal basis by decomposing X into a component along 1 and a component orthogonal to 1 as shown in Figure 23.3. Recall that the scalar projection

XX 1

Bivariate Regression
595
x1
Y
o Y Y
1
Y xn
X
x2
Figure 23.4. Geometry of bivariate regression: orthogonal basis. X X X 1 is the centered attribute vector. The plane illustrates the subspace spanned by the orthogonal vectors 1 andX.
of a vector b onto vector a see Eq. 1.12 is given as projab bTa
aTa
and the orthogonal projection of b on a see Eq. 1.11 is given as
projababTaa aTa
where X X X 1 is the centered attribute vector, obtained by subtracting the mean X from all points.
The two vectors 1 and X form an orthogonal basis for the subspace. We can thus obtain the predicted vector Y by projecting Y onto 1 andX, and summing up these two components, as shown in Figure 23.4. That is,
Yproj Y1proj YXYT11YTXX 1YTXX 23.14 1 X 1T1 XTX Y XTX
On the other hand, from Eq. 23.13, we know that
Yb1wXb1w X 1X bwX1wX 23.15
Since both Eq. 23.14 and Eq. 23.15 are expressions for Y, we can equate them to obtain
bw or b w w YTX YXYX XTX
Now, consider the projection of X onto 1; we have
p r o j 1 X 1 X T 1 1 ni 1 x i 1 X 1
Thus, we can rewrite X as
1T1 n
XX 1XX 1X 1X
Y

w

596 Linear Regression where the bias term b Y w X matches Eq. 23.8, and the weight w also matches
Eq. 23.10, since
Y T X Y T X Y T X X 1 ni 1 y i x i n X Y
wXTXX2XX12 ni1xi2n2X
Example 23.2 Geometry of Regression. Let us consider the regression of petal length X on petal width Y for the Iris dataset, with n 150. First, we center X by subtracting the mean X 3.759. Next, we compute the scalar projections of Y onto 1 andX, to obtain
Y proj1Y YT1 179.8 1.1987 1T1 150
wprojXYYTX 193.16 0.4164 T
Thus, the bias term b is given as
bY wX 1.19870.41643.75870.3665
These values for b and w match those in Example 23.1. Finally, we can compute the SSE value see Eq. 23.4 as the squared length of the residual error vector
SSEo2 YY2 YYTYY6.343
X X 463.86
23.3 MULTIPLE REGRESSION
We now consider the more general case called multiple regression1 where we have multiple predictor attributes X1,X2, ,Xd and a single response attribute Y. The training data sample D Rnd comprises n points xi xi1,xi2, ,xid T in a ddimensional space, along with the corresponding observed response value yi. The vector Y y1 , y2 , , yn T denotes the observed response vector. The predicted response value for input xi is given as
yi bw1xi1 w2xi2 wdxid bwTxi 23.16
whereww1,w2,,wdT istheweightvectorcomprisingtheregressioncoefficients or weights wj along each attribute Xj . Eq. 23.16 defines a hyperplane in Rd1 with bias term b and normal vector w.
Instead of dealing with the bias b separately from the weights wi for each attribute, we can introduce a new constant valued attribute X0 whose value is always fixed at 1, so that each input point xi xi1,xi2, ,xid T Rd is mapped to an augmented
1 We follow the usual terminology and reserve the term multivariate regression for the case when there are multipleresponseattributesY1,Y2,,Yq andmultiplepredictorattributesX1,X2,,Xd.

Multiple Regression 597
point x i xi0,xi1,xi2, ,xid T Rd1, where xi0 1. Likewise, the weight vector w w1,w2, ,wdT is mapped to an augmented weight vector w w0,w1,w2, ,wdT, where w0 b. The predicted response value for an augmented d 1 dimensional point x i can be written as
yi w0xi0 w1xi1 w2xi2 wdxid w Tx i 23.17
Since there are n points, in fact we have n such equations, one per point, and there are d 1 unknowns, namely the elements of the augmented weight vector w . We can compactly write all these n equations as a single matrix equation, given as
Y D w 23.18 nd1
where D R is the augmented data matrix, which includes the constant attribute X0 inadditiontothepredictorattributesX1,X2,,Xd,andYy1,y2,,ynT isthe vector of predicted responses.
The multiple regression task can now be stated as finding the best fitting hyperplane defined by the weight vector w that minimizes the sum of squared errors
2 3 . 1 9
n w i1
minSSE
oi2 o2 YY2
YYTYYYTY2YT YYT Y Y T Y 2 Y T D w D w T D w
Y T Y 2 w T D T Y w T D T D w
where we substituted Y D w from Eq. 23.18, and we used the fact that YT D w D w T Y w T D T Y .
To solve the objective, we differentiate the expression in Eq. 23.19 with respect to w and set the result to 0, to obtain
SSE2D TY2D TD w 0 w D T D w D T Y
Therefore, the optimal weight vector is given as
w D T D 1 D T Y
Substituting the optimal value of w into Eq. 23.18, we get YD w D D TD 1D TYHY
2 3 . 2 0
where H D D TD 1D T is called the hat matrix, since it puts the hat on Y! Notice also that D TD is the uncentered scatter matrix for the training data.

598
Linear Regression
X1
Y
150.0 876.50 563.80 D TD 876.5 5223.85 3484.25
563.8 3484.25 2583.00 We also compute D TY, given as
0.793 0.176 0.064 D TD 1 0.176 0.041 0.017

Figure23.5. Multipleregression:sepallengthX1andpetallengthX2withresponseattributepetal width Y. The vertical bars show the residual error for the points. Points in white are above the plane, whereas points in gray are below the plane.
Example 23.3 Multiple Regression. Figure 23.5 shows the multiple regression of sepal length X1 and petal length X2 on the response attribute petal width Y for the Iris dataset with n 150 points. We first add an extra attribute X0 1150, which is a vector of all ones in R150. The augmented dataset D R1503 comprises n 150 points along three attributes X0, X1, and X2.
Next,wecomputetheuncentered33scattermatrixD TD anditsinverse
179.80 D TY 1127.65
868.97 The augmented weight vector w is then given as
w2 0.45 The bias term is therefore b w0 0.014, and the fitted model is
Y0.0140.082X10.45X2
Figure 23.5 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals i.e., oi 0 or y i yi , whereas
w0
0.014 w w 1 D T D 1 D T Y 0 . 0 8 2

X2

0.064 0.017 0.009

Multiple Regression 599
the gray points have negative residual values i.e., oi 0 or yi y. The SSE value for the model is 6.18.
23.3.1 Geometry of Multiple Regression
Let D be the augmented data matrix comprising the d independent attributes Xi , along
with the new constant attribute X0 1 Rn, given as
D X0 X1 X2 Xd

Let w w0 , w1 , , wd T Rd 1 be the augmented weight vector that incorporates
the bias term b w0. Recall that the predicted response vector is given as
d i0
This equation makes it clear that the predicted vector must lie in the column space of the augmented data matrix D , denoted colD , i.e., it must be a linear combination of the attribute vectors Xi, i 0, ,d.
To minimize the error in prediction, Y must be the orthogonal projection of Y onto the subspace colD . The residual error vector o Y Y is thus orthogonal to the subspace colD , which means that it is orthogonal to each attribute vector Xi . That is,
X Ti o 0
X Ti Y Y 0
X Ti Y X Ti Y
X Ti D w X Ti Y
w0 XTi X0 w1XTi X1 wd XTi Xd XTi Y
We thus have d 1 equations, called the normal equations, in d 1 unknowns, namely the regression coefficients or weights wi including the bias term w0. The solution to these simultaneous equations yields the weight vector w w0 , w1 , , wd T . The d 1 normal equations are
Y b 1 w 1 X 1 w 2 X 2 w d X d
w i X i D w
w0XT0X0w1XT0X1wd XT0Xd XT0Y w0XT1X0w1XT1X1wd XT1Xd XT1Y
. . w0XTdX0w1XTdX1wd XTdXd XTdY
23.21

600 Linear Regression which can be written compactly in matrix notation to solve for w as follows
X T0 X 0 X T1 X 0 . . .
X Td X 0
X T0 X 1 X T1 X 1 . . .
X Td X 1

X T0 X d
X T1 X d T
. . . w D Y
X Td X d
D T D w D T Y
w D TD 1D TY
More insight can be obtained by noting that the attribute vectors comprising the
This matches the expression in Eq. 23.20.
column space of D are not necessarily orthogonal, even if we assume they are linearly

independent. To obtain the projected vector Y , we first need to construct an orthogonal
basis for colD .
Let U0,U1, ,Ud denote the set of orthogonal basis vectors for colD . We can
construct these vectors in a stepwise manner via GramSchmidt orthogonalization, as follows
where
U0 X0
U1 X1 p10 U0
U2 X2 p20 U0 p21 U1
. .
Ud Xd pd0 U0 pd1 U1 pd,d1 Ud1
p j i p r o j U X j X jT U i i Ui2
23.22
denotes the scalar projection of attribute Xj onto the basis vector Ui. Essentially, to obtain Uj , we subtract from vector Xj its scalar projections along all previous basis vectors U0,U1, ,Uj1.
Rearranging the equations above, so that Xj is on the left hand side, we get
X0 U0
X1 p10 U0 U1
X2 p20 U0 p21 U1 U2
. .
Xd pd0 U0 pd1 U1 pd,d1 Ud1 Ud
The GramSchmidt method thus results in the socalled QRfactorization2 of the data matrix, namely D QR, where by construction Q is a n d 1 matrix withi
2 In QRfactorization, the matrix Q is orthogonal, with orthonormal columns, i.e., with orthogonal columns that are normalized to be unit length. However, we keep the basis vectors unnormalized for ease of presentation.

Multiple Regression 601 orthogonal columns
Q U0 U1 U2 Ud
and R is the d 1 d 1 uppertriangular matrix
1 p 1 0 p 2 0 p d 0 0 1 p21 pd1
R0 0 1 pd2 . . . … .
0 0 0 1 p d , d 1 00001
So that, in the column view the QRfactorization of the augmented data matrix is given
as:
0 1 p21 pd1
1 p 1 0 p 2 0 p d 0 X X X XU U U U0 01pd2
0 1 2 d 0 1 2 d . . . … . 0 0 0 1 pd,d1
D Q 00001

R
SincethenewbasisvectorsU0,U1,,Ud formanorthogonalbasisforthecolumn space of D , we can obtain the predicted response vector as a sum of the projections of Y along each new basis vector, given as
Y p r o j U 0 Y U 0 p r o j U 1 Y U 1 p r o j U d Y U d 2 3 . 2 3
Bias Term The geometric approach makes it easy to derive an expression for the bias term b. Note that each of the predictor attributes can be centered by removing its projection along the vector 1. Define Xi to be the centered attribute vector
Xi Xi Xi 1
All the centered vectors Xi lie in the n 1 dimensional space, comprising the orthogonal complement of the span of 1.
From the expression of Y, we have
Y b 1 w 1 X 1 w 2 X 2 w d X d
b1w1 X1 X1 1 wd Xd Xd 1
bw1X1 …wd Xd1w1X1…wd Xd
23.24
On the other hand, since 1 is orthogonal to all Xi , we can obtain another expression for Y in terms of the projection of Y onto the subspace spanned by the vectors

602 Linear Regression
1,X1, ,Xd. Let the new orthogonal basis for these centered attribute vectors be U0,U1,,Ud,whereU0 1.Thus,Y canalsobewrittenas
d i1
d i1
YprojU0YU0
In particular, equating the scalar projections along 1 in Eq. 23.24 and Eq. 23.25, we
projUiYUi proj1Y1 proj1YY bw1X1 …wd Xd, whichimplies
projUiYUi 23.25
get:
bY w1 X1 …wd Xd Y
where we use the fact that the scalar projection of any attribute vector onto 1 yields
the mean value of that attribute. For example,
YT1 1n
proj1Y 1T1 n yi Y i1
23.3.2 Multiple Regression Algorithm
The pseudocode for multiple regression is shown in Algorithm 23.1. The algorithm starts by the QRdecomposition of D QR, where Q is a matrix with orthogonal columns that make up an orthogonal basis, and R is an upper triangular matrix, which can be obtained via GramSchmidt orthogonalization. Note that, by construction, the matrix QTQ is given as
U02 0 0 QTQ 0 U12 0
0 0 … 0 0 0 Ud2
d i1
wi Xi 23.26
Algorithm 23.1: Multiple Regression Algorithm MULTIPLEREGRESSION D,Y:
1 D 1 Daugmenteddatawith X0 1Rn
2 Q,RQRfactorizationD QU0 U1 Ud
1 00 U0 2
0 1 0
3 1 U1 2 reciprocal squared norms
0 0 … 0 0 0 1
1 T Ud2
4 Rw Q Y solve for w by backsubstitution
5 YQ1QTY

Multiple Regression 603
We can observe that the matrix QTQ, denoted , is a diagonal matrix that contains the squared norms of the new orthogonal basis vectors U0 , U1 , , Ud .
Recall that the solution to multiple regression is given via the normal equations Eq. 23.21, which can be compactly written as D T w w D T Y Eq. 23.22; plugging in the QRdecomposition, we get
D T D w D T Y QRTQRw QRTY RTQTQRw RTQTY RTRw RTQTY R w Q T Y R w 1 Q T Y
Note that 1 is a diagonal matrix that records the reciprocal of the squared norms of the new basis vectors U0 , U1 , , Ud . Furthermore, since R is uppertriangular, it is straightforward to solve for w by backsubstitution. Note also that we can obtain the predicted vector Y as follows
Y D w QRR11QTY Q1QTY
It is interesting to note that 1QTY gives the vector of scalar projections of Y onto
each of the orthogonal basis vectors
projU0Y
1QTY projU1Y .
projUd Y
Therefore, Q1QTY, yields the projection formula in Eq. 23.23
projU0Y
YQprojU1Yproj YU proj YU proj YU
. U0 0 U1 1 Ud d projUdY
Example 23.4 Multiple Regression: QRFactorization and Geometric Approach.
Consider the multiple regression of sepal length X1 and petal length X2 on the response attribute petal width Y for the Iris dataset with n 150 points, as shown in Figure 23.5. The augmented dataset D R1503 comprises n 150 points along three attributes X0, X1, and X2, where X0 1. The GramSchmidt orthogonalization results in the following QRfactorization:
1 5.843 3.759 X0 X1 X2 U0 U1 U20 1 1.858
001
D Q R

604 Linear Regression Note that Q R1503 and therefore we do not show the matrix. The matrix , which
records the squared norms of the basis vectors, and its inverse matrix, is given as
150 0 0 0.00667 0 0 0 102.17 0 1 0 0.00979 0
0 0 111.35
We can use backsubstitution to solve for w , as follows
R w 1 Q T Y
1 5.843 3.759 w0 1.1987 0 1 1.858 w1 0.7538 0 0 1 w2 0.4499
In backsubstitution, we start with w2, which is easy to compute from the equation above; it is simply
Next, w1 is given as:
w2 0.4499
w1 1.858 w2 0.7538
w1 0.7538 0.8358 0.082 Finally, w0 can be computed as
w0 5.843w1 3.759w2 1.1987 w0 1.19870.47861.69110.0139
Thus, the multiple regression model is given as
Y 0 . 0 1 4 X 0 0 . 0 8 2 X 1 0 . 4 5 X 2
2 3 . 2 7
which matches the model in Example 23.3.
It is also instructive to construct the new basis vectors U0,U1, ,Ud in terms of
theoriginalattributesX0,X1,,Xd.SinceD QR,wehaveQD R1.Theinverse of R is also uppertriangular, and is given as
1 5.843 7.095 R1 0 1 1.858
001 Therefore, we can write Q in terms of the original attributes as
1 5.843 7.095 U0 U1 U2 X0 X1 X20 1 1.858
001
Q D R1

0 0 0.00898

Multiple Regression 605
which results in
U0 X0
U1 5.843X0X1
U2 7.095X01.858X1X2
The scalar projection of the response vector Y onto each of the new basis vectors yields:
projU0Y 1.199 projU1Y 0.754 projU2Y 0.45 Finally, the fitted response vector is given as:
Y projU0Y U0 projU1Y U1 projU2Y U2
1.199 X0 0.754 5.843 X0 X1 0.45 7.095 X0 1.858 X1 X2
1.1994.4063.193X00.7540.836X10.45X2 0.014 X0 0.082 X1 0.45 X2
which matches Eq. 23.27.
23.3.3 Multiple Regression: Stochastic Gradient Descent
Instead of using the QRfactorization approach to exactly solve the multiple regression problem, we can also employ the simpler stochastic gradient algorithm. Consider the SSE objective given in Eq. 23.19 multiplied by 12:
minSSE 1YTY2w TD TYw TD TD w w 2
The gradient of the SSE objective is given as
w SSED TYD TD w w
23.28
Using gradient descent, starting from an initial weight vector estimate w 0, we can iteratively update w as follows
w t1w tw w tD TYD w t
where w t is the estimate of the weight vector at step t.
In stochastic gradient descent SGD, we update the weight vector by considering
only one random point at each time. Restricting Eq. 23.28 to a single point x k in the training data D , the gradient at the point x k is given as
w x k x kyk x kx Tk w yk x Tk w x k

606
Linear Regression
Algorithm 23.2: Multiple Regression: Stochastic Gradient Descent MULTIPLE REGRESSION: SGD D,Y,,o:
1 2 3 4 5 6 7
8 9
D 1 D augment data
t 0 stepiteration counter
w t random vector in Rd1 initial weight vector repeat
foreach k 1,2, ,n in random order do
w x k y k x Tk w t x k c o m p u t e g r a d i e n t a t x k w t1 w t w x kupdateestimatefor w
tt1 untilwtwt1o
Therefore, the stochastic gradient update rule is given as w t 1 w t w x k
w t y k x Tk w t x k
Algorithm 23.2 shows the pseudocode for the stochastic gradient descent algorithm for multiple regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance o.
Example 23.5 Multiple Regression: SGD. We continue Example 23.4 for multiple regression of sepal length X1 and petal length X2 on the response attribute petal width Y for the Iris dataset with n 150 points.
Using the exact approach the multiple regression model was given as Y 0 . 0 1 4 X 0 0 . 0 8 2 X 1 0 . 4 5 X 2
Using stochastic gradient descent we obtain the following model with 0.001 and o 0.0001:
Y 0 . 0 3 1 X 0 0 . 0 7 8 X 1 0 . 4 5 X 2
The results from the SGD approach as essentially the same as the exact method, with a slight difference in the bias term. The SSE value for the exact method is 6.179, whereas for SGD it is 6.181.
23.4 RIDGE REGRESSION
We have seen that for linear regression, the predicted response vector Y lies in the span of the column vectors comprising the augmented data matrix D . However, often the data is noisy and uncertain, and therefore instead of fitting the model to the data

Ridge Regression 607
exactly, it may be better to fit a more robust model. One way to achieve this is via r e g u l a r i z a t i o n , w h e r e w e c o n s t r a i n t h e s o l u t i o n v e c t o r w t o h a v e a s m a l l n o r m . In o t h e r words, instead of trying to simply minimize the squared residual error Y Y 2 , we add a regularization term involving the squared norm of the weight vector w 2 as follows:
min Jw YY2w 2 YD w 2w 2 23.29 w
Here 0 is a regularization constant that controls the tradeoff between minimizing the squared norm of the weight vector and the squared error. Recall that w 2 di1wi2 is the L2norm of w . For this reason ridge regression is also called L2 regularized regression. When 0, there is no regularization, but as increases there
is more emphasis on minimizing the regression coefficients.
The solve the new regularized objective we differentiate Eq. 23.29 with respect
23.30 23.31
23.32
to w and set the result to 0 to obtain
Jw YD w 2w 20
w w
YTY2w TD TYw TD TD w w Tw 0
w T T
2D Y2D Dw 2w 0
D TD Iw D TY Therefore, the optimal solution is
w D TD I1D TY
where I Rd 1d 1 is the identity matrix. The matrix D T D I is always invertible or nonsingular for 0 even if D T D is not invertible or singular. This is because ifi isaneigenvalueofD TD ,theni isaneigenvalueofD TD I.SinceD TD is positive semidefinite it has nonnegative eigenvalues. Thus, even if an eigenvalue of D TD iszero,e.g.,i0,thecorrespondingeigenvalueofD TD Iisi0.
Regularized regression is also called ridge regression since we add a ridge along the main diagonal of the D TD matrix, i.e., the optimal solution depends on the regularized matrix D TD I. Another advantage of regularization is that if we choose a small positive we are always guaranteed a solution.
Example 23.6 Ridge Regression. Figure 23.6 shows the scatterplot between the two attributes petal length X; the predictor variable and petal width Y; the response variable in the Iris dataset. There are a total of n 150 data points. The uncentered scatter matrix is given as
D TD 150.0 563.8 563.8 2583.0

608
Linear Regression
2.5 2.0 1.5 1.0 0.5
0 0.5

0 10
100

0 1.0
Figure 23.6. Scatterplot: petal length X versus petal width Y. Ridge regression lines for
0, 10, 100.
2.0
3.0
X: petal length
6.0
7.0
4.0 5.0
Using Eq. 23.32 we obtain different lines of best fit for different values of the regularization constant :
0 :Y 0.367 0.416 X, w 2 0.367, 0.416T2 0.308, SSE 6.34 10 :Y 0.244 0.388 X, w 2 0.244, 0.388T2 0.210, SSE 6.75 100 :Y 0.021 0.328 X, w 2 0.021, 0.328T2 0.108, SSE 9.97
Figure 23.6 shows these regularized regression lines. We can see that as increases there is more emphasis on minimizing the squared norm of w . However, since w 2 is more constrained as increases, the fit of the model decreases, as seen from the increase in SSE values.
Unpenalized Bias Term Often in L2 regularized regression we do not want to penalize the bias term w0, since it simply provides the intercept information, and there is no real justification to minimize it. To avoid penalizing the bias term, consider the new regularized objective with w w1,w2, ,wd T, and without w0, given as
min JwYw0 1Dw2 w2 23.33
w
d 2 d Y w 0 1 w i X i w i2
i1 i1
Y: petal width

Ridge Regression
609
Recall from Eq. 23.26 that the bias w0 b is given as
d
w0 bY wi Xi Y Tw
i1
where X1 , X2 , , Xd T is the multivariate mean of
Substituting w0 into the new L2 objective in Eq. 23.33, we get min JwYw0 1Dw2 w2
unaugmented D.
w
Therefore, we have
YY Tw1Dw2 w2 YY 1D1Tw2 w2
min JwYDw2w2 w
whereYYY1isthecenteredresponsevector,andDD1T isthecentered data matrix. In other words, we can exclude w0 from the L2 regularization objective by simply centering the response vector and the unaugmented data matrix.
23.34
Example 23.7 Ridge Regression: Unpenalized Bias. We continue from Exam ple 23.6. When we do not penalize w0, we obtain the following lines of best fit for different values of the regularization constant :
0:Y0.3650.416X 10:Y0.3330.408X 1 0 0 : Y 0 . 0 8 9 0 . 3 4 3 X
w02 w12 0.307 w02 w12 0.277 w 02 w 12 0 . 1 2 5
SSE6.34 SSE6.38 S S E 8 . 8 7
From Example 23.6, we observe that for 10, when we penalize w0, we obtain the following model:
10:Y0.2440.388X w02 w12 0.210 SSE6.75 As expected, we obtain a higher bias term when we do not penalize w0.
23.4.1 Ridge Regression: Stochastic Gradient Descent
Instead of inverting the matrix D T D I as called for in the exact ridge regression solution in Eq. 23.32, we can employ the stochastic gradient descent algorithm. Consider the gradient of the regularized objective given in Eq. 23.30, multiplied by 12 for convenience; we get:
w Jw D TYD TD w w 23.35 w

610
Linear Regression
Algorithm 23.3: Ridge Regression: Stochastic Gradient Descent RIDGE REGRESSION: SGD D,Y,,o:
1 2 3 4 5 6 7
8 9
D 1 D augment data
t 0 stepiteration counter
w t random vector in Rd1 initial weight vector repeat
foreach k 1,2, ,n in random order do w x kykx Tkw tx k w computegradientatx k
n
w t1 w t w x kupdateestimatefor w
tt1 untilwtwt1o
Using batch gradient descent, we can iteratively compute w as follows w t1w tw 1w tD TYD w t
In stochastic gradient descent SGD, we update the weight vector by considering only one random point at each time. Restricting Eq. 23.35 to a single point x k, the gradient at the point x k is given as
w x kx kyk x kx Tk w w yk x Tk w x k w 23.36 nn
Here, we scale the regularization constant by dividing it by n, the number of points in the training data, since the original ridge value is for all the n points, whereas we are now considering only one point at a time. Therefore, the stochastic gradient update rule is given as
w t 1 w t w x k 1 w t y k x Tk w t x k n
Algorithm 23.3 shows the pseudocode for the stochastic gradient descent algorithm for ridge regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance o. The code is for the penalized bias case. It is easy to adapt it for unpenalized bias by centering the unaugmented data matrix and the response variable.
Example 23.8 Ridge Regression: SGD. We apply ridge regression on the Iris dataset n 150, using petal length X as the independent attribute, and petal width Y as the response variable. Using SGD with 0.001 and o 0.0001 we obtain different lines of best fit for different values of the regularization constant , which essentially match the results from the exact method in Example 23.6:
0 : Y 0 . 3 6 6 0 . 4 1 3 X S S E 6 . 3 7 10:Y0.2440.387X SSE6.76
1 0 0 : Y 0 . 0 2 2 0 . 3 2 7 X S S E 1 0 . 0 4

Kernel Regression 611
23.5 KERNEL REGRESSION
We now consider how to generalize linear regression to the nonlinear case, i.e., finding a nonlinear fit to the data to minimize the squared error, along with regularization. For this we will adopt the kernel trick, i.e., we will show that all relevant operations can be carried out via the kernel matrix in input space.
Let correspond to a mapping from the input space to the feature space, so that each point xi in feature space is a mapping for the input point xi. To avoid explicitly dealing with the bias term, we add the fixed value 1 as the first element of xi to obtain the augmented transformed point xi T 1 xi T . Let D denote the augmented dataset in feature space, comprising the transformed points xi for i 1, 2, , n. The augmented kernel function in feature space is given as
K xi,xj xiT xj1xiTxj1Kxi,xj
where Kxi , xj is the standard, unaugmented kernel function.
Let Y denote the observed response vector. Following Eq. 23.18, we model the
predicted response as
Y D w 2 3 . 3 7
where w is the augmented weight vector in feature space. The first element of w denotes the bias in feature space.
For regularized regression, we have to solve the following objective in feature space:
min Jw YY2w 2YD w 2w 2 23.38 w
where 0 is the regularization constant.
Taking the derivative of Jw with respect to w and setting it to the zero vector, we get
Jw YD w 2w 20 w w
YTY2w TD TYw TD TD w w Tw 0 w T T
2DY2DDw 2w 0 w D T Y D T D w
w D T 1 Y D w

n
w D T c c i x i
2 3 . 3 9 wherecc1,c2,,cnT 1YD w .Eq.23.39indicatesthattheweightvectorw
is a linear combination of the feature points, with c specifying the mixture coefficients for the points.
i1

612
Linear Regression
Rearranging the terms in the expression for c, we have c 1 Y D w

c Y D w
Now, plugging in the form of w from Eq. 23.39 we get c Y D D T c
D D T I c Y
c D D T I 1 Y
The optimal solution is therefore given as
cK I1Y
23.40
23.41 where I Rnn is the n n identity matrix, and D D T is the augmented kernel matrix
K , since
D D T xiT xji,j1,2,,n K xi,xji,j1,2,,n K
Putting it all together, we can substitute Eq. 23.41 and Eq. 23.39 into Eq. 23.37 to obtain the expression for the predicted response
Y D w D D T c
T 1 DD KI Y
1 K KI Y
23.42
where K K I1 is the kernel hat matrix. Notice that using 0 ensures that the inverse always exists, which is another advantage of using kernel ridge regression, in addition to the regularization.
Algorithm 23.4 shows the pseudocode for kernel regression. The main step is to compute the augmented kernel matrix K Rnn, and the vector of mixture coefficients c Rn. The predicted response on the training data is then given as Y K c. As for prediction for a new test point z, we use Eq. 23.39 to predict the response, which is given as:
n z T c i x i

y z T w z T D T c
i1
ci K z,xicTK z
n

i1
ci zT xi
n i1

Kernel Regression 613
Algorithm 23.4: Kernel Regression Algorithm KERNELREGRESSION D,Y,K,:
1 K Kxi,xj i,j1,…,n standard kernel matrix
2 K K 1 augmented kernel matrix
3 c K I1 Y compute mixture coefficients
4 Y K c
TESTING z,D,K,c:
5 K z 1Kz,xixiD
6 ycTK z
That is, we compute the vector K z comprising the augmented kernel values of z with respect to all of the data points in D, and take its dot product with the mixture coefficient vector c to obtain the predicted response.
Example23.9. ConsiderthenonlinearIrisdatasetshowninFigure23.7,obtainedvia a nonlinear transformation applied to the centered Iris data. In particular, the sepal length A1 and sepal width attributes A2 were transformed as follows:
XA2
Y 0.2A21 A2 0.1A1A2
We treat Y as the response variable and X is the independent attribute. The points show a clear quadratic nonlinear relationship between the two variables.
1.5
1.0
0.5
0
0.5
1.5 1.0 0.5
0 0.5 1.0 1.5 X

Figure 23.7. Kernel regression on nonlinear Iris dataset.
Y

614 Linear Regression We find the lines of best fit using both a linear and an inhomogeneous quadratic
kernel, with regularization constant 0.1. The linear kernel yields the following fit
Y 0 . 1 6 8 X
On the other hand, using the quadratic inhomogeneous kernel over X comprising constant 1, linear X, and quadratic terms X2, yields the fit
Y0.0860.026X0.922X2
The linear in gray and quadratic in black fit are both shown in Figure 23.7. The SSE error, Y Y2, is 13.82 for the linear kernel and 4.33 for the quadratic kernel. It is clear that the quadratic kernel as expected gives a much better fit to the data.
Example 23.10 Kernel Ridge Regression. Consider the Iris principal components dataset shown in Figure 23.8. Here X1 and X2 denote the first two principal components. The response variable Y is binary, with value 1 corresponding to Irisvirginica points on the top right, with Y value 1 and 0 corresponding to Irissetosa and Irisversicolor other two groups of points, with Y value 0.
X1

X1

Y

X2
a Linear Kernel
Y

X2
b Inhomogeneous Quadratic Kernel

Figure 23.8. Kernel ridge regression: linear and inhomogeneous quadratic kernels.

L1 Regression: Lasso 615
Figure 23.8a shows the fitted regression plane using a linear kernel with ridge value 0.01:
Y0.3330.167X10.074X2
Figure 23.8b shows the fitted model when we use an inhomogeneous quadratic
kernel with 0.01:
Y 0 . 0 3 0 . 1 6 7 X 1 0 . 1 8 6 X 2 0 . 0 9 2 X 21 0 . 1 X 1 X 2 0 . 0 2 9 X 2 2
The SSE error for the linear model is 15.47, whereas for the quadratic kernel it is 8.44, indicating a better fit for the training data.
23.6 L1 REGRESSION: LASSO
The Lasso, which stands for least absolute selection and shrinkage operator, is a regularization method that aims to sparsify the regression weights. Instead of using the L2 or Euclidean norm for weight regularization as in ridge regression see Eq. 23.34, the Lasso formulation uses the L1 norm for regularization
23.43
1 2
min Jw YDw w1
w2
where 0 is the regularization constant and
d w1 wi
i1
Note that, we have added the factor 1 for convenience; it does not change the objective.
2
Furthermore, we assume that the data comprising the d independent attributes X1, X2,
. . . , Xd , and the response attribute Y, have all been centered. That is, we assume that DD1T
Y Y Y 1
where1Rn isthevectorofallones,X1,X2,,XdT isthemultivariatemean for the data, and Y is the mean response value. One benefit of centering is that we do not have to explicitly deal with the bias term b w0 , which is important since we usually do not want to penalize b. Once the regression coefficients have been estimated, we can obtain the bias term via Eq. 23.26, as follows:
d j1
The main advantage of using the L1 norm is that it leads to sparsity in the solution vector. That is, whereas ridge regression reduces the value of the regression coefficients
b w0 Y
wj Xj

616
Linear Regression
3 2 1 0
1 2
Figure 23.9. Absolute value function in black and two of its subgradients in gray.
wi , they may remain small but still nonzero. On the other hand, L1 regression can drive the coefficients to zero, resulting in a more interpretable model, especially when there are many predictor attributes.
T h e L a s s o o b j e c t i v e c o m p r i s e s t w o p a r t s , t h e s q u a r e d e r r o r t e r m Y D w 2 w h i c h is convex and differentiable, and the L1 penalty term w1 di1wi, which is convex but unfortunately nondifferentiable at wi 0. This means that we cannot simply compute the gradient and set it to zero, as we did in the case of ridge regression. However, these kinds of problems can be solved via the generalized approach of subgradients.
23.6.1 Subgradients and Subdifferential Consider the absolute value function f : R R
fww
which is plotted in Figure 23.9 black line.
When w 0, we can see that its derivative is fw 1, and when w 0, its
derivative is f w 1. On the other hand, there is a discontinuity at w 0 where the derivative does not exist.
However, we use the concept of a subgradient that generalizes the notion of a derivative. For the absolute value function, the slope m of any line that passes through w 0 that remains below or touches the graph of f is called a subgradient of f at w 0. Figure 23.9 shows two such subgradients, namely the slopes m 0.5 and m 0.25. The corresponding lines are shown in gray. The set of all the subgradients at w is called the subdifferential, denoted as w. Thus, the subdifferential at w 0 is given as
w1,1
since only lines with slope between 1 and 1 remain below or partially coincide with the absolute value graph.
m0.25
m0.5
3 2 1 0 1 2 3 w
w

L1 Regression: Lasso 617 Considering all the cases, the subdifferential for the absolute value function is
given as:
1 iff w 0
w 1 iff w 0 23.44
1,1 iff w 0
We can see that when the derivative exists, the subdifferential is unique and corresponds to the derivative or gradient, and when the derivative does not exist the subdifferential corresponds to a set of subgradients.
23.6.2 Bivariate L1 Regression
We first consider bivariate L1 regression, where we have a single independent attribute X and a response attribute Y both centered. The bivariate regression model is given
as
y i w x i
The Lasso objective from Eq. 23.43 can then be written as
1 n
min Jw yi wxi2 w
23.45
w 2 i1
We can compute the subdifferential of this objective as follows:
1 n
Jw 2 2yi wxixiw
n
i1
x i y i w
n i1
x i2 w
i1
XTYwX2 w
We can solve for w by setting the subdifferential to zero; we get
Jw0
w X 2 w X TY wwXTY
23.46
where 1X2 0 is a scaling constant.
Corresponding to the three cases for the subdifferential of the absolute value
function in Eq. 23.44, we have three cases to consider: CaseIw0andw1: Inthiscaseweget
w X TY
S i n c e w 0 , t h i s i m p l i e s X T Y o r X T Y .

618 Linear Regression CaseIIw0andw1: Inthiscasewehave
w X TY
S i n c e w 0 , t h i s i m p l i e s X T Y o r X T Y .
CaseIIIw0andw1,1: Inthiscase,weget
w X TY , X TY
H o w e v e r , s i n c e w 0 , t h i s i m p l i e s X TY 0 a n d X TY 0 . I n o t h e r w o r d s , X T Y a n d X T Y . T h e r e f o r e , X T Y .
Let 0 be some fixed value. Define the softthreshold function S : R R as follows:
S z signz max0, z 23.47 Then the above three cases can be written compactly as:
w S X TY 2 3 . 4 8 with , where w is the optimal solution to the bivariate L1 regression problem
in Eq. 23.45.
23.6.3 Multiple L1 Regression
Consider the L1 regression objective from Eq. 23.43 1d 2
min Jw Y wiXi w1 w 2i1
1d dd d 2 YTY2 wiXiTY wiwjXiTXj wi
i1 i1 j1 i1
We generalize the bivariate solution to the multiple L1 formulation by optimizing for each wk individually, via the approach of cyclical coordinate descent. We rewrite the L1 objective by focusing only on the wk terms, and ignoring all terms not involving wk , which are assumed to be fixed:
1 d
min JwkwkXkTY wk2Xk2wk wjXkTXjwk 23.49
wk 2 jk

L1 Regression: Lasso 619 Setting the subdifferential of Jwk to zero, we get
Jwk 0
d
w k X k 2 w k X kT Y w k X k 2 w k X kT Y
j1 wkXk2wkwkXkT2XkT YDw
We can interpret the above equation as specifying an iterative solution for wk . In essence, we let the wk on the left hand side be the new estimate for wk, whereas we treat the wk on the right hand side as the previous estimate. More concretely, let w t represent the weight vector at step t, with wkt denoting the estimate for wk at time t. The new estimate for wk at step t 1 is then given as
t1 1 t1 t 1 T t wk Xk2 wk wkXk2 Xk YDw
wt1 wt1wt XTYDwt k kkk
j k d
w j X kT X j
w j X kT X j w k X kT X k
23.50 where 1Xk2 0 is just a scaling constant. Based on the three cases for wt1
and the subdifferential wt1, following a similar approach as in the bivariate case k
Eq. 23.48, the new estimate for wk can be written compactly as t1 t T t
The pseudocode for L1 regression is shown in Algorithm 23.5. The algorithm starts with a random estimate for w at step t 0, and then cycles through each dimension to estimate wk until convergence. Interestingly, the term XkTY Dwt is in fact the gradient at wk of the squared error term in the Lasso objective, and thus the update equation is the same as gradient descent with step size , followed by the softthreshold operator. Note also that since is just a positive scaling constant, we make it a parameter of the algorithm, denoting the step size for gradient descent.
k
wk S wkXk YDw 23.51
Example 23.11 L1 Regression. We apply L1 regression to the full Iris dataset with n 150 points, and four independent attributes, namely sepalwidth X1, sepallength X2, petalwidth X3, and petallength X4. The Iris type attribute comprises the response variable Y. There are three Iris types, namely Irissetosa, Irisversicolor, and Irisvirginica, which are coded as 0, 1 and 2, respectively.

620
Linear Regression
Algorithm 23.5: L1 Regression Algorithm: Lasso L1REGRESSION D,Y,,,o:
1 2
3 4 5 6 7 8 9 10
11 12 13
meanD compute mean DD1Tcenterthedata
Y YY 1 center the response
t 0 stepiteration counter
w t random vector in Rd initial weight vector repeat
foreach k 1,2, ,d do
w kt X kT Y D w t c o m p u t e g r a d i e n t a t w k
wt1 wt wtupdateestimatefor w kkk k
wt1 S wt1 apply softthreshold function k k
tt1
untilwtwt1o
bY wtTcomputethebiasterm
The L1 regression estimates for different values of with 0.0001 are shown below:
0: Y0.1920.109X10.045X20.226X30.612X4
1: Y0.0770.076X10.015X20.253X30.516X4
5: Y0.5530.0X10.0X20.359X30.170X4 SSE8.82
10: Y0.5750.0X10.0X20.419X30.0X4
The L1 norm values for the weight vectors excluding the bias term are 0.992, 0.86, 0.529, and 0.419, respectively. It is interesting to note the sparsity inducing effect of Lasso, as observed for 5 and 10, which drives some of the regression coefficients to zero.
We can contrast the coefficients for L2 ridge and L1 Lasso regression by comparing models with the same level of squared error. For example, for 5, the L1 model has SSE 8.82. We adjust the ridge value in L2 regression, with 35 resulting in a similar SSE value. The two models are given as follows:
L1 : Y0.5530.0X10.0X20.359X30.170X4 w1 0.529 L2 : Y0.3940.019X10.051X20.316X30.212X4 w1 0.598
where we exclude the bias term when computing the L1 norm for the weights. We can observe that for L2 regression the coefficients for X1 and X2 are small, and therefore less important, but they are not zero. On the other hand, for L1 regression, the coefficients for attributes X1 and X2 are exactly zero, leaving only X3 and X4; Lasso can thus act as an automatic feature selection approach.
SSE6.96 SSE7.09
SSE10.15

Further Reading 621
23.7 FURTHER READING
For a geometrical approach to multivariate statistics see Wickens 2014, and Saville and Wood 2012. For a description of the class of generalized linear models, of which linear regression is a special case, see Agresti 2015. An excellent overview of Lasso and sparsity based methods is given in Hastie, Tibshirani, and Wainwright 2015. For a description of cyclical coordinate descent for L1 regression, and also other approaches for sparse statistical models see Hastie, Tibshirani, and Wainwright 2015.
Agresti, A. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John Wiley Sons.
Hastie, T., Tibshirani, R., and Wainwright, M. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton, FL: CRC press.
Saville, D. J. and Wood, G. R. 2012. Statistical Methods: The Geometric Approach. New York: Springer Science Business Media.
Wickens, T. D. 2014. The Geometry of Multivariate Statistics. New York: Psychology Press, Taylor Francis Group.
23.8 EXERCISES
Q1. Consider the data in Table 23.1, with Y as the response variable and X as the independent variable. Answer the following questions:
a Compute the predicted response vector Y for least square regression using the
geometric approach.
b Basedonthegeometricapproachextractthevalueofthebiasandslope,andgive
the equation of the best fitting regression line. Table 23.1. Data for Q1
X
Y
5 0 2 1 2
2 1 1 1 0
Q2. Given data in Table 23.2, let 0.5 be the regularization constant. Compute the equation for ridge regression of Y on X, where both the bias and slope are
penalized. Use the fact that the inverse of the matrix A a b is given as
1 1db cd A detA c a ,withdetAadbc.

622 Linear Regression Table 23.2. Data for Q2
X
Y
1 2 4 6
1 3 4 3
Q3. ShowthatEq.23.11holds,thatis
ni 1 x i y i n X Y ni 1 x i X y i Y
w ni1xi2n2X ni1xi X2
Q4. Derive an expression for the bias term b and the weights w w1,w2, ,wdT in
multiple regression using Eq. 23.16, without adding the augmented column.
Q5. Show analytically i.e., without using the geometry that the bias term in multiple regression, as shown in Eq. 23.26, is given as
w0Yw1X1 w2X2 wdXd
Q6. ShowthatYTo0.
Q7. Provethato2Y2Y2.
Q8. Show that if i is an eigenvalue of DTD corresponding to the eigenvector ui, then
i is an eigenvalue of DT D I for the same eigenvector ui .
Q9. Show that 1QTY gives the vector of scalar projections of Y onto each of the
orthogonal basis vectors in multiple regression projU0Y
1QTY projU1Y .
projUd Y
Q10. FormulateasolutiontotheridgeregressionproblemviaQRfactorization.
Q11. Derive the solution for the weight vector w w1,w2, ,wdT and bias b w0 in ridge regression without subtracting the means from Y and the independent attributes X1 , X2 , , Xd , and without adding the augmented column.
Q12. Showthatthesolutionforridgeregressionandkernelridgeregressionareexactlythe same when we use a linear kernel.
Q13. ShowthatwSXTYEq.23.48isthesolutionforbivariateL1regression.
Q14. Derive the three cases for the subdifferential in Eq. 23.50, and show that they
correspond to the softthreshold update in Eq. 23.51.
Q15. Show that that the gradient at wk of the SSE term in the L1 formulation is given as
wk 1YDw2XkTYDw wk 2
where Y is the response vector, and Xk is the kth predictor vector.

CHAPTER 24 LogisticRegression
GivenasetofpredictorattributesorindependentvariablesX1,X2,,Xd,andgivena categorical response or dependent variable Y, the aim of logistic regression is to predict the probability of the response variable values based on the independent variables. Logistic regression is in fact a classification technique, that given a point xj Rd predicts P ci xj for each class ci in the domain of Y the set of possible classes or values for the response variable. In this chapter, we first consider the binary class problem, where the response variable takes on one of two classes 0 and 1, or positive and negative, and so on. Next, we consider the multiclass case, where there are K classes for the response variable.
24.1 BINARY LOGISTIC REGRESSION
In logistic regression, we are given a set of d predictor or independent variables X1,X2, ,Xd, and a binary or Bernoulli response variable Y that takes on only two values, namely, 0 and 1. Thus, we are given a training dataset D comprising n points xi Rd and the corresponding observed values yi 0, 1. As done in Chapter 23, we augment the data matrix D by adding a new attribute X0 that is always fixed at the value 1foreachpoint,sothatx i 1,×1,x2,,xdT Rd1 denotestheaugmentedpoint,and the multivariate random vector X , comprising all the independent attributes is given as X X0,X1, ,XdT. The augmented training dataset is given as D comprising the n augmentedpointsx i alongwiththeclasslabelsyi fori1,2,,n.
Since there are only two outcomes for the response variable Y, its probability mass function for X x is given as:
PY1X x x PY0X x 1x
where x is the unknown true parameter value, denoting the probability of Y 1
given X x . Further, note that the expected value of Y given X x is
EYX x 1PY1X x 0PY0X x P Y 1 X x x
623

624 Logistic Regression
Therefore, in logistic regression, instead of directly predicting the response value, the goal is to learn the probability, P Y 1X x , which is also the expected value of Y g i v e n X x .
Since P Y 1X x is a probability, it is not appropriate to directly use the linear regression model
fx 0 x0 1 x1 2 x2 d xd Tx
w h e r e 0 , 1 , , d T R d 1 i s t h e t r u e a u g m e n t e d w e i g h t v e c t o r , w i t h 0 t h e true unknown bias term, and i the true unknown regression coefficient or weight for attributeXi.ThereasonwecannotsimplyusePY1X x fx isduetothefact that f x can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value, and thus we need a model that results in an output that lies in the interval 0,1. The name logistic regression comes from the logistic function also called the sigmoid function that meets this requirement. It is defined as follows:
z 1 expz 24.1 1 expz 1 expz
The logistic function squashes the output to be between 0 and 1 for any scalar input z see Figure 24.1. The output z can therefore be interpreted as a probability.
Example 24.1 Logistic Function. Figure 24.1 shows the plot for the logistic func tion for z ranging from to . In particular consider what happens when z is , and 0; we have
1
1 exp
1
1 exp
1 1 1
1 0
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0
z
Figure 24.1. Logistic function.
z

Binary Logistic Regression 625
0 1 1 0.5 1 exp0 2
As desired, z lies in the range 0, 1, and z 0 is the threshold value in the sense that for z 0 we have z 0.5, and for z 0, we have z 0.5. Thus, interpreting z as a probability, the larger the z value, the higher the probability.
Another interesting property of the logistic function is that
1z1 expz 1expzexpz 1 z 24.2 1 expz 1 expz 1 expz
Using the logistic function, we define the logistic regression model as follows: PY1X x x fx Tx exp Tx 24.3
1 e x p T x
Thus, the probability that the response is Y 1 is the output of the logistic function for
the input T x . On the other hand, the probability for Y 0 is given as PY0X x 1PY1X x Tx 1
1 e x p T x w h e r e w e u s e d E q . 2 4 . 2 , t h a t i s , 1 z z f o r z T x .
Combining these two cases the full logistic regression model is given as PYX x Tx Y Tx 1Y
24.4
since Y is a Bernoulli random variable that takes on either the value 1 or 0. We can observe that PY X x Tx when Y 1 and PY X x Tx when Y 0, as desired.
LogOdds Ratio
Define the odds ratio for the occurrence of Y 1 as follows: oddsY1X x PY1X x Tx
PY0X x Tx exp Tx 1 exp Tx
1 e x p T x e x p T x
The logarithm of the odds ratio, called the logodds ratio, is therefore given as: lnoddsY1X x ln PY1X x lnexp Tx Tx
2 4 . 5
24.6
1 P Y 1 X x 0 x0 1 x1 d xd

626 Logistic Regression The logodds ratio function is also called the logit function, defined as
logitz ln z 1z
It is the inverse of the logistic function. We can see that lnoddsY1X x logitPY1X x
The logistic regression model is therefore based on the assumption that the logodds ratio for Y 1 given X x is a linear function or a weighted sum of the independent attributes. In particular, let us consider the effect of attribute Xi by fixing the values for all other attributes in Eq. 24.6; we get
lnoddsY1X x i xi C
oddsY1X x expi xi Cexpi xiexpC expi xi
where C is a constant comprising the fixed attributes. The regression coefficient i can therefore be interpreted as the change in the logodds ratio for Y 1 for a unit change in Xi , or equivalently the odds ratio for Y 1 increases exponentially per unit change in Xi.
24.1.1 Maximum Likelihood Estimation
Let D be the augmented training dataset comprising the n augmented points x i along with their labels yi. Let w w0,w1, ,wdT be the augmented weight vector for estimating . Note that w0 b denotes the estimated bias term, and wi the estimated weight for attribute Xi. We will use the maximum likelihood approach to learn the weight vector w . Likelihood is defined as the probability of the observed data given the estimated parameters w . We assume that the binary response variables yi are all independent. Therefore, the likelihood of the observed responses is given as
n
n T T
ln Lw yi ln w x i 1yiln w x i 24.7
i1
The negative of the loglikelihood can also be considered as an error function, the
crossentropy error function, given as follows:
n 1 1
Ew ln Lw yi ln w Tx i 1yiln 1w Tx i 24.8 i1
The task of maximizing the loglikelihood is therefore equivalent to minimizing the crossentropy error.
w Tx iyi w Tx i1yi
Instead of trying to maximize the likelihood, we can maximize the logarithm of the
Lw PYw
i1
Pyix i
likelihood, called loglikelihood, to convert the product into a summation as follows:
n i1

Binary Logistic Regression 627
Typically, to obtain the optimal weight vector w , we would differentiate the loglikelihood function with respect to w , set the result to 0, and then solve for w . However, for the loglikelihood formulation in Eq. 24.7 there is no closed form solution to compute the weight vector w . Instead, we use an iterative gradient ascent method to compute the optimal value, since Eq. 24.7 is a concave function, and has a unique global optimal.
The gradient ascent method relies on the gradient of the loglikelihood function, which can be obtained by taking its partial derivative with respect to w , as follows:
n
w w ln Lw w yi lnzi1yilnzi 24.9
i1
where zi w T x i . We use the chain rule to obtain the derivative of ln zi with respect
to w . We note the following facts:
lnzi 1
zi zi
lnzi ln1zi 1
zi zi 1zi zi zi1zizizi
zi
z i w T x i x i
w w As per the chain rule, we have
lnzi lnzi zi zi w z i z i w
1 zizix i zix i zi
Likewise, using the chain rule, we have
lnzi lnzi zi zi
24.10
w z i z i w
1 zi1zix i zix i
1zi Substituting the above into Eq. 24.9 we get
n i1
n
yi zizi x i zix i
i1 n
sincezizi1
i1
yi zix i 1yizix i
w
yi zix i, i1
nT
y i w x i x i
2 4 . 1 1

628
Logistic Regression
Algorithm 24.1: Logistic Regression: Stochastic Gradient Ascent LOGISTICREGRESSIONSGA D,,o:
1 2
3 4 5 6 7 8
9 10 11
foreachxiDdox Ti 1 xTimaptoRd1 t 0 stepiteration counter
w 0 0,…,0T Rd1 initial weight vector repeat
w w t makeacopyof w t foreach x i D in random order do
w ,x iyi w Tx ix i computegradientat x i w w w ,x iupdateestimatefor w
w t 1 w u p d a t e w t 1
tt1 untilw tw t1o
The gradient ascent method starts at some initial estimate for w , denoted w 0. At each step t, the method moves in the direction of steepest ascent, which is given by the gradient vector. Thus, given the current estimate w t , we can obtain the next estimate as follows:
w t 1 w t w t 2 4 . 1 2
Here, 0 is a userspecified parameter called the learning rate. It should not be too large, otherwise the estimates will vary wildly from one iteration to the next, and it should not be too small, otherwise it will take a long time to converge. At the optimal value of w , the gradient will be zero, i.e., w 0, as desired.
Stochastic Gradient Ascent
The gradient ascent method computes the gradient by considering all the data points, and it is therefore called batch gradient ascent. For large datasets, it is typically much faster to compute the gradient by considering only one randomly chosen point at a time. The weight vector is updated after each such partial gradient step, giving rise to stochastic gradient ascent SGA for computing the optimal weight vector w .
The pseudocode for SGA based logistic regression is shown in Algorithm 24.1. Given a randomly chosen point x i , the pointspecific gradient see Eq. 24.11 is given as
w ,x iyi w Tx ix i 24.13
Unlike batch gradient ascent that updates w by considering all the points, in stochastic gradient ascent the weight vector is updated after observing each point, and the updated values are used immediately in the next update. Computing the full gradient in the batch approach can be very expensive. In contrast, computing the partial gradient at each point is very fast, and due to the stochastic updates to w , typically SGA is much faster than the batch approach for very large datasets.

Binary Logistic Regression
629
X
1

X1

b Linear Regression
Y

X2
a Logistic Regression
Y

X2
Figure 24.2. Logistic versus linear regression: Iris principal components data. Misclassified point are shown
in dark gray color. Circles denote Irisvirginica and triangles denote the other two Iris types.
Once the model has been trained, we can predict the response for any new
augmented test point z as follows:
y 1 if w Tz 0.5 24.14
0 ifw Tz 0.5
Example 24.2 Logistic Regression. Figure 24.2a shows the output of logistic regression modeling on the Iris principal components data, where the independent attributes X1 and X2 represent the first two principal components, and the binary response variable Y represents the type of Iris flower; Y 1 corresponds to Irisvirginica, whereas Y 0 corresponds to the two other Iris types, namely Irissetosa and Irisversicolor.
The fitted logistic model is given as
w w0,w1,w2T 6.79,5.07,3.29T
PY1x w Tx 1 1exp6.795.07×1 3.29×2
Figure 24.2a plots P Y 1x for various values of x .

630 Logistic Regression
Given x , if PY 1x 0.5, then we predict y 1, otherwise we predict y 0. Figure 24.2a shows that five points shown in dark gray are misclassified. For example, for x 1, 0.52, 1.19T we have:
PY1x w Tx 0.240.44 PY0x 1PY1x 0.54
Thus, the predicted response for x is y 0, whereas the true class is y 1.
Figure 24.2 also contrasts logistic versus linear regression. The plane of best fit in
linear regression is shown in Figure 24.2b, with the weight vector: w 0.333,0.167,0.074T
y fx 0.3330.167×10.074×2
Since the response vector Y is binary, we predict the response class as y 1 if f x 0.5, and y 0 otherwise. The linear regression model results in 17 points being misclassified dark gray points, as shown in Figure 24.2b.
Since there are n 150 points in total, this results in a training set or insample accuracy of 88.7 for linear regression. On the other hand, logistic regression misclassifies only 5 points, for an insample accuracy of 96.7, which is a much better fit, as is also apparent in Figure 24.2.
24.2 MULTICLASS LOGISTIC REGRESSION
We now generalize logistic regression to the case when the response variable Y can take on K distinct nominal categorical values called classes, i.e., Y c1,c2, ,cK. We model Y as a Kdimensional multivariate Bernoulli random variable see Section 3.1.2. Since Y can assume only one of the K values, we use the onehot encoding approach to map each categorical value ci to the Kdimensional binary vector
i1 Ki
T
ei 0,…,0,1,0,…,0
whose ith element eii 1, and all other elements eij 0, so that Kj1 eij 1. Henceforth, we assume that the categorical response variable Y is a multivariate BernoullivariableYe1,e2,,eK,withYj referringtothejthcomponentofY.
The probability mass function for Y given X x is
PY eiX x ix , for i 1,2,…,K
where i x is the unknown probability of observing class ci given X x . Thus, there are K unknown parameters, which must satisfy the following constraint:
K i1
ix
K i1
PYeiX x 1

Multiclass Logistic Regression 631 Given that only one element of Y is 1, the probability mass function of Y can be
written compactly as
PYX x K jx Yj 24.15 j1
NotethatifYei,onlyYi 1andtherestoftheelementsYj 0forji.
In multiclass logistic regression, we select one of the values, say cK, as a reference or base class, and consider the logodds ratio of the other classes with respect to cK; we assume that each of these logodd ratios are linear in X , but with a different augmented weight vector i , for class ci . That is, the logodds ratio of class ci with respect to class
cK is assumed to satisfy
lnoddsYeiX x lnPYeiX x lnix Ti x
where i0 i is the true bias value for class ci .
We can rewrite the above set of equations as follows:
i x e x p Ti x K x
i x e x p Ti x K x G i v e n t h a t Kj 1 j x 1 , w e h a v e
K jx 1
j 1
e x p jT x K x K x 1
2 4 . 1 6
j K
K x
1
1 jK exp jTx
P Y e K X x K x i0 x0 i1 x1 id xd
Substituting the above into Eq. 24.16, we have
T e x p Ti x
ix exp i x Kx 1jKexp jTx
Finally, setting K 0, we have exp TK x 1, and thus we can write the full model for
multiclass logistic regression as follows:
ix exp Ti x , foralli1,2,,K 24.17 Kj 1 e x p j T x

632 Logistic Regression
This function is also called the softmax function. When K 2, this formulation yields exactly the same model as in binary logistic regression.
It is also interesting to note that the choice of the reference class is not important, since we can derive the logodds ratio for any two classes ci and cj as follows:
lnix ln ix Kx j x K x j x
l n i x l n K x
K x l n i x
K x
j x j x
K x
l n
That is, the logodds ratio between any two classes can be computed from the
difference of the corresponding weight vectors.
24.2.1 Maximum Likelihood Estimation
Let D be the augmented dataset comprising n points x i and their labels yi . We assume that yi is a onehot encoded multivariate Bernoulli response vector, so that yij denotesthejthelementofyi.Forexample,ifyi ea,thenyij 1forja,andyij 0 for all j a. We assume that all the yis are independent. Let w i Rd1 denote the estimated augmented weight vector for class ci , with wi0 bi denoting the bias term.
To find the K sets of regression weight vectors w i, for i 1,2, ,K, we use the gradient ascent approach to maximize the loglikelihood function. The likelihood of the data is given as
n nK y LW PYW PyiX x i jx i ij
i1 i1 j1
where W w 1,w 2, ,w K is the set of K weight vectors. The loglikelihood is then
given as:
n K
ln LW
i1 j1
yij lnjx i
n K expw jTx i
yij ln K expw Tx i 24.18
i1 j1 a1 a
Ti x jT x i j T x
Note that the negative of the loglikelihood function can be regarded as an error function, commonly known as crossentropy error.
We note the following facts:
l n j x i 1
jx i ax i1ax ix i ifja w a a x i j x i x i if j a
j x i j x i

Multiclass Logistic Regression 633 Let us consider the gradient of the loglikelihood function with respect to the weight
v e c t o r w a :
w a
l n L W w a
n K l n j x i j x i
yij jx i w a
n yia 1 ax i1ax ix i yij 1
i1 j1
i1 ax i ja j x i
ax ijx ix i Kj 1 yij 1, since only one element of yi can
n yia yia ax iyij ax ix i i1 ja
nK
yia yijax i x i
i1 j1
n yia ax ix i
i1
The last step follows from the fact that be 1.

For stochastic gradient ascent, we update the weight vectors by considering only onepointatatime.Thegradientoftheloglikelihoodfunctionwithrespecttow j ata given point x i is given as
w j,x iyij jx ix i 24.19 which results in the following update rule for the j th weight vector:
w t1 w t w t, x i 24.20 jjj
where w jt denotes the estimate of w j at step t, and is the learning rate. The pseudocode for the stochastic gradient ascent method for multiclass logistic regression is shown in Algorithm 24.2. Notice that the weight vector for the base class cK is never updated; it remains w K 0, as required.
Once the model has been trained, we can predict the class for any new augmented test point z as follows:
y argmaxiz argmaxexpw Ti z 24.21 c c K e x p w T z
That is, we evalute the softmax function, and then predict the class of z as the one with the highest probability.
i ij1j

634
Logistic Regression
Algorithm 24.2: Multiclass Logistic Regression Algorithm LOGISTICREGRESSIONMULTICLASS D,,o:
1 2 3
4 5 6
7 8 9
10 11
12
13 14
15 16
17 18
foreachxTi,yiDdo
x Ti 1 x Ti m a p t o R d 1
yi ej if yi cj map yi to Kdimensional Bernoulli vector
t 0 stepiteration counter foreach j 1,2, ,K do
w jt 0,…,0T Rd1 initial weight vector
repeat
foreach j 1, 2, , K 1 do
w j w jtmakeacopyofw jt
X1

foreach x i D in random order do foreach j 1, 2, , K 1 do
e x p w jT x i
j x i Ka 1 e x p w Ta x i
w j,x iyij jx ix i computegradientat w j w j w j w j,x iupdateestimatefor w j
foreach j 1, 2, , K 1 do
w t 1 w u p d a t e w t 1
jjj t t 1
until K1w tw t1o j1 j j
x 1

Y
3 x
2 x

X2
Figure 24.3. Multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. All the points actually lie in the X1,X2 plane, but c1 and c2 are shown displaced along Y with respect to the base class c3 purely for illustration purposes.
Example 24.3. Consider the Iris dataset, with n 150 points in a 2D space spanned by the first two principal components, as shown in Figure 24.3. Here, the response variable takes on three values: Y c1 corresponds to Irissetosa shown as

Further Reading 635
squares, Y c2 corresponds to Irisversicolor as circles and Y c3 corresponds to Irisvirginica as triangles. Thus, we map Y c1 to e1 1, 0, 0T , Y c2 to e2 0,1,0T and Y c3 to e3 0,0,1T.
The multiclass logistic model uses Y c3 Irisvirginica; triangles as the reference or base class. The fitted model is given as:
w 1 3.52,3.62,2.61T
w 2 6.95,5.18,3.40T w 3 0 , 0 , 0 T
Figure 24.3 plots the decision surfaces corresponding to the softmax functions:
1 x
2 x
3 x
e x p w T1 x
1 expw T1 x expw T2 x
e x p w T2 x
1 expw T1 x expw T2 x 1
1 expw T1 x expw T2 x
The surfaces indicate regions where one class dominates over the others. It is important to note that the points for c1 and c2 are shown displaced along Y to emphasize the contrast with c3, which is the reference class.
Overall, the training set accuracy for the multiclass logistic classifier is 96.7, since it misclassifies only five points shown in dark gray. For example, for the point x 1, 0.52, 1.19T, we have:
1x 0 2x 0.448 3x 0.552
Thus, the predicted class is y argmaxci ix c3, whereas the true class is y c2.
24.3 FURTHER READING
For a description of the class of generalized linear models, of which logistic regression
is a special case, see Agresti 2015.
Agresti, A. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John Wiley Sons.
24.4 EXERCISES
Q1. Show that z z z, where is the logistic function.
z
Q2. Showthatthelogitfunctionistheinverseofthelogisticfunction.

636
Logistic Regression
Q3. Giventhesoftmaxfunction:
Show that
x j
e x p w jT x Ki 1 e x p w Ti x
jx ax 1ax x w a a x j x x
ifj a if j a

CHAPTER 25 Neural Networks
Artificial neural networks or simply neural networks are inspired by biological neuronal networks. A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an axon that leads to synaptic terminals. A neuron transmits information via electrochemical signals. When there is enough concentration of ions at the dendrites of a neuron it generates an electric pulse along its axon called an action potential, which in turn activates the synaptic terminals, releasing more ions and thus causing the information to flow to dendrites of other neurons. A human brain has on the order of 100 billion neurons, with each neuron having between 1,000 to 10,000 connections to other neurons. Thus, a human brain is a neuronal network with 100 trillion to a quadrillion 1015 interconnections! Interestingly, as far as we know, learning happens by adjusting the synaptic strengths, since synaptic signals can be either excitatory or inhibitory, making the postsynaptic neuron either more or less likely to generate an action potential, respectively.
Artificial neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighted directed graph G V, E, with each node vi V representing a neuron, and each directed edge vi , vj E representing a synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes the synaptic strength. Neural networks are characterized by the type of activation function used to generate an output, and the architecture of the network in terms of how the nodes are interconnected. For example, whether the graph is a directed acyclic graph or has cycles, whether the graph is layered or not, and so on. It is important to note that a neural network is designed to represent and learn information by adjusting the synaptic weights.
25.1 ARTIFICIAL NEURON: ACTIVATION FUNCTIONS
An artificial neuron acts as a processing unit, that first aggregates the incoming signals via a weighted sum, and then applies some function to generate an output. For example, a binary neuron will output a 1 whenever the combined signal exceeds a threshold, or 0 otherwise.
637

638
Neural Networks
1 x0 bk
zk wk di 1 w i k x i b k
x1
w1k x2 w2k
. wdk .
xd
Figure 25.1 shows the schematic of a neuron zk that has incoming edges from neurons x1, ,xd. For simplicity, both the name and the output value of a neuron are denoted by the same symbol. Thus, xi denotes neuron i, and also the value of that neuron. The net input at zk , denoted netk , is given as the weighted sum
d i1
wherewk w1k,w2k,,wdkT Rd andxx1,x2,,xdT Rd isaninputpoint. Notice that x0 is a special bias neuron whose value is always fixed at 1, and the weight from x0 to zk is bk, which specifies the bias term for the neuron. Finally, the output value of zk is given as some activation function, f , applied to the net input at zk
zk fnetk
The value zk is then passed along the outgoing edges from zk to other neurons. Neurons differ based on the type of activation function used. Some commonly used
activation functions, illustrated in Figure 25.2, are:
LinearIdentityFunction: Identityisthesimplestactivationfunction;itsimplyreturns
its argument, and is also called a linear activation function:
f netk netk 25.2
Figure 25.2a plots the identity function. To examine the role of the bias term, note that netk 0 is equivalent to wTx bk. That is, the output transitions from negative to positive when the weighted sum of the inputs exceeds bk, as shown in Figure 25.2b.
StepFunction: Thisisabinaryactivationfunction,wheretheneuronoutputsa0ifthe net value is negative or zero, and 1 if the net value is positive see Figure 25.2c.
f netk 0 if netk 0 25.3 1 ifnetk0
Figure 25.1. Artificial neuron: aggregation and activation.
netk bk
wik xi bk wTx 25.1

25.1 Artificial Neuron: Activation Functions
639

00

1.0
0.5
0
netk aLinearzk versusnetk

1.0
0.5
bk 0 wTx
bLinearzk versuswTx
bk 0 wTx
d Step zk versus wTx
bk 0 wTx
f Rectified Linear zk versus wTx
00
0 netk
c Step zk versus netk

00

1.0
0.5

0
netk
e Rectified Linear zk versus netk

1.0
0.5
00
0 bk0
netk wTx
g Sigmoid zk versus netk h Sigmoid zk versus wTx
11
00
1
0
netk
i Hyperbolic Tangent zk versus netk
1
bk0
wTx
j Hyperbolic Tangent zk versus wT x
Figure 25.2. Neuron activation functions; also illustrating the effect of bias.
zk zk zk zk zk
zk
zk
zk
zk
zk

640 Neural Networks It is interesting to note that the transition from 0 to 1 happens when the weighted
sum of the inputs exceeds bk, as shown in Figure 25.2d.
Rectified Linear Unit ReLU: Here, the neuron remains inactive if the net input is less than or equal to zero, and then increases linearly with netk, as shown in Figure 25.2e.
fnetk0 ifnetk 0 25.4 netk if netk 0
An alternative expression for the ReLU activation is given as f netk max0, netk . The transition from zero to linear output happens when the weighted sum of the inputs exceeds bk see Figure 25.2f.
Sigmoid: The sigmoid function, illustrated in Figure 25.2g squashes its input so that the output ranges between 0 and 1
f netk 1 25.5 1 expnetk
When netk 0, we have f netk 0.5, which implies that the transition point where the output crosses 0.5 happens when the weighted sum of the inputs exceeds bk see Figure 25.2h.
Hyperbolic Tangent tanh: The hyperbolic tangent or tanh function is similar to the sigmoid, but its output ranges between 1 and 1 see Figure 25.2i.
fnetk expnetkexpnetk exp2netk1 25.6 expnetk expnetk exp2 netk 1
When netk 0, we have f netk 0, which implies that the output transitions from negative to positive when the weighted sum of the inputs exceeds bk, as shown in Figure 25.2j.
Softmax: Softmax is a generalization of the sigmoid or logistic activation function. Softmax is mainly used at the output layer in a neural network, and unlike the other functions it depends not only on the net input at neuron k, but it depends on the net signal at all other neurons in the output layer. Thus, given the net input vector,
zk
netj netk
Figure 25.3. Softmax netk versus netj .

Artificial Neuron: Activation Functions 641 net net1,net2, ,netpT, for all the p output neurons, the output of the softmax
function for the kth neuron is given as
fnet net expnetk 25.7
Figure 25.3 plots the softmax activation for netk versus the net signal netj , with all other net values fixed at zero. The output behaves similar to a sigmoid curve for any given value of netj .
25.1.1 DerivativesforActivationFunctions
For learning using a neural network, we need to consider the derivative of an activation function with respect to its argument. The derivatives for the activation functions are given as follows:
IdentityLinear: The identity or linear activation function has a derivative of 1 with respect to its argument, giving us:
f netj 1 25.8 netj
Step: The step function has a derivative of 0 everywhere except for the discontinuity at 0, where the derivative is .
ReLU: The ReLU function Eq. 25.4 is nondifferentiable at 0, nevertheless for other values its derivative is 0 if netj 0 and 1 if netj 0. At 0, we can set the derivative to be any value in the range 0,1, a simple choice being 0. Putting it all together, we have
fnetj0 ifnetj 0 25.9 netj 1 ifnetj0
Even though ReLU has a discontinuity at 0 it is a popular choice for training deep neural networks.
k pi1 expneti
Sigmoid: ThederivativeofthesigmoidfunctionEq.25.5isgivenas
fnetj fnetj1fnetj 25.10
n e tj
HyperbolicTangent: ThederivativeofthetanhfunctionEq.25.6isgivenas
fnetj 1fnetj2 25.11 n e tj
Softmax: The softmax activation function Eq. 25.7 is a vector valued function, which maps a vector input net net1 , net2 , , netp T to a vector of probability

642 Neural Networks values. Softmax is typically used only for the output layer. The partial derivative
offnetjwithrespecttonetj isgivenas
fnetjnet fnetj1fnetj
netj
whereas the partial derivative of f netj with respect to netk , with k j is given as
fnetjnet fnetkfnetj netk
Since softmax is used at the output layer, if we denote the ith output neuron as oi, then f neti oi , and we can write the derivative as:
fnetjnet oj oj1oj ifkj 25.12 netk netk ok oj if k j
25.2 NEURAL NETWORKS: REGRESSION AND CLASSIFICATION
Networks of artificial neurons are capable of representing and learning arbitrarily
complex functions for both regression and classification tasks.
25.2.1 Regression
Consider the multiple linear regression problem, where given an input xi Rd , the
goal is to predict the response as follows
yi bw1xi1 w2xi2 wdxid
Here,bisthebiasterm,andwj istheregressioncoefficientorweightforattributeXj. Given a training data D comprising n points xi in a ddimensional space, along with their corresponding true response value yi, the bias and weights for linear regression are chosen so as to minimize the sum of squared errors between the true and predicted response over all data points
n i1
As shown in Figure 25.4a, a neural network with d 1 input neurons x0 , x1 , , xd , including the bias neuron x0 1, and a single output neuron o, all with identity activation functions and with y o, represents the exact same model as multiple linear regression. Whereas the multiple regression problem has a closed form solution, neural networks learn the bias and weights via a gradient descent approach that minimizes the squared error.
Neural networks can just as easily model the multivariate linear regression task, where we have a pdimensional response vector yi Rp instead of a single value yi. That is, the training data D comprises n points xi Rd and their true response vectors yi Rp . As shown in Figure 25.4b, multivariate regression can be modeled by a neural
S S E
y i y i 2

Neural Networks: Regression and Classification
643
1 x0
x1 .
xd
1 x0
o x1 w11
a Single Output
Figure 25.4. Linear and logistic regression via neural networks.
b1
bp
b w1
wd
w1p wd1
o1 .
op
.
xd wdp
network with d 1 input neurons, and p output neurons o1,o2, ,op, with all input and output neurons using the identity activation function. A neural network learns the weights by comparing its predicted output y o o1,o2, ,opT with the true response vector y y1,y2, ,ypT. That is, training happens by first computing the error function or loss function between o and y. Recall that a loss function assigns a score or penalty for predicting the output to be o when the desired output is y. When the prediction matches the true output the loss should be zero. The most common loss function for regression is the squared error function
1 1p
Ex 2yo2 2 yj oj2
j1
where Ex denotes the error on input x. Across all the points in a dataset, the total sum
of squared errors is
n 1n
E Exi 2 yioi2 i1 i1
b Multiple Outputs
Example 25.1 Neural Networks for Multiple and Multivariate Regression.
Consider the multiple regression of sepal length and petal length on the dependent attribute petal width for the Iris dataset with n 150 points. From Example 23.3 we find that the solution is given as
y 0.0140.082×10.45×2
The squared error for this optimal solution is 6.179 on the training data.
Using the neural network in Figure 25.4a, with linear activation for the output and minimizing the squared error via gradient descent, results in the following learned parameters, b 0.0096, w1 0.087 and w2 0.452, yielding the regression
model
o0.00960.087×10.452×2
with a squared error of 6.18, which is very close to the optimal solution.

644 Neural Networks
Multivariate Linear Regression For multivariate regression, we use the neural network architecture in Figure 25.4b to learn the weights and bias for the Iris dataset, where we use sepal length and sepal width as the independent attributes, and petal length and petal width as the response or dependent attributes. Therefore, each input point xi is 2dimensional, and the true response vector yi is also 2dimensional. That is, d 2 and p 2 specify the size of the input and output layers. Minimizing the squared error via gradient descent, yields the following parameters:
b1 b2 1.83
w11 w12 1.72 0.72 o2 1.470.72×10.50×2
1.47 o1 1.831.72×1 1.46×2 w21 w22 1.46 0.50
The squared error on the training set is 84.9. Optimal least squared multivariate regression yields a squared error of 84.16 with the following parameters
y12.561.78×11.34×2 y2 1.590.73×1 0.48×2
25.2.2 Classification
Networks of artificial neurons can also learn to classify the inputs. Consider the binary classification problem, where y 1 denotes that the point belongs to the positive class, and y 0 means that it belongs to the negative class. Recall that in logistic regression, we model the probability of the positive class via the logistic or sigmoid function:
x P y 1x 1
1 expb wTx
where b is the bias term and w w1,w2, ,wd T is the vector of estimated weights or regression coefficients. On the other hand
Py 0x1Py 1x1x
A simple change to the neural network shown in Figure 25.4a allows it to solve the logistic regression problem. All we have to do is use a sigmoid activation function at the output neuron o, and use the crossentropy error instead of squared error. Given input x, true response y, and predicted response o, recall that the crossentropy error see Eq. 24.8 is defined as
Ex ylno1yln1o
Thus, with sigmoid activation, the output of the neural network in Figure 25.4a is
given as
ofnetosigmoidbwTx 1 x 1 expb wTx
which is the same as the logistic regression model.

Neural Networks: Regression and Classification 645
Multiclass Logistic Regression In a similar manner, the multiple output neural network architecture shown in Figure 25.4b can be used for multiclass or nominal logistic regression. For the general classification problem with K classes c1,c2, ,cK, the true response y is encoded as a onehot vector. Thus, class c1 is encoded as e1 1,0, ,0T, class c2 is encoded as e2 0,1, ,0T, and so on, with ei 0,1K for i 1,2, ,K. Thus, we encode y as a multivariate vector y e1,e2, ,eK. Recall that in multiclass logistic regression see Section 24.2 the task is to estimate the per class bias bi and weight vector wi Rd , with the last class cK used as the base class with fixed bias bK 0 and fixed weight vector wK 0,0, ,0T Rd. The probability vector across all K classes is modeled via the softmax function see Eq. 24.17:
ix expbi wTi x , foralli1,2,,K Kj 1 e x p b j w j T x
Therefore, the neural network shown in Figure 25.4b with p K can solve the multiclass logistic regression task, provided we use a softmax activation at the outputs, and use the Kway crossentropy error see Eq. 24.18, defined as
Ex y1 lno1yK lnoK
where x is an input vector, o o1,o2, ,oKT is the predicted response vector, and y y1,y2, ,yKT is the true response vector. Note that only one element of y is 1, and the rest are 0, due to the onehot encoding.
With softmax activation, the output of the neural network in Figure 25.4b with p K is given as
o Pye xfnet net expneti x i i i pj 1 e x p n e t j i
which matches the multiclass logistic regression task. The only restriction we have to impose on the neural network is that the weights on edges into the last output neuron should be zero to model the base class weights wK. However, in practice, we can relax this restriction, and just learn a regular weight vector for class cK.
Example 25.2 Logistic Regression: Binary and Multiclass. We applied the neural network in Figure 25.4a, with logistic activation at the output neuron and crossentropy error function, on the Iris principal components dataset. The output is a binary response indicating Irisvirginica Y 1 or one of the other Iris types Y 0. As expected, the neural network learns an identical set of weights and bias as shown for the logistic regression model in Example 24.2, namely:
o 6.79 5.07 x1 3.29 x2
Next, we we applied the neural network in Figure 25.4b, using a softmax activation and crossentropy error function, to the Iris principal components data with three classes: Irissetosa Y 1, Irisversicolor Y 2 and IrisvirginicaY 3. Thus, we need K 3 output neurons, o1, o2, and o3. Further, to obtain the same model as in the multiclass logistic regression from Example 24.3,

646
Neural Networks
X1

1 x
Y
3 x
2 x

X2
Figure 25.5. Neural networks for multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. Points in class c1 and c2 are shown displaced with respect to the base class c3 only for illustration.
we fix the incoming weights and bias for output neuron o3 to be zero. The model is given as
o1 3.493.61×12.65×2 o2 6.955.18×13.40×2 o3 00x10x2
which is essentially the same as in Example 24.3.
If we do not constrain the weights and bias for o3 we obtain the following model:
o1 0.894.54×11.96×2 o2 3.385.11×12.88×2 o3 4.240.52×10.92×2
The classification decision surface for each class is illustrated in Figure 25.5. The points in class c1 are shown as squares, c2 as circles, and c3 as triangles. This figure should be contrasted with the decision boundaries shown for multiclass logistic regression in Figure 24.3, which has the weights and bias set to 0 for the base class c3.
25.2.3 Error Functions
Typically, for a regression task, we use squared error as the loss function, whereas for classification, we use the crossentropy loss function. Furthermore, when learning from neural networks, we will require the partial derivatives of the error function with respect to the output neurons. Thus, the commonly used error functions and their derivatives are listed below:

Neural Networks: Regression and Classification 647
SquaredError: GivenaninputvectorxRd,thesquarederrorlossfunctionmeasures the squared deviation between the predicted output vector o Rp and the true response y Rp , defined as follows:
1 1p
Ex 2yo2 2 yj oj2 25.13
j1
The partial derivative of the squared error function with respect to a particular
where Ex denotes the error on input x. output neuron oj is
Ex 12yjoj1ojyj oj 2
Across all the output neurons, we can write this as
Ex oy o
25.14
25.15
CrossEntropyError: Forclassificationtasks,withKclassesc1,c2,,cK,weusually set the number of output neurons p K, with one output neuron per class. Furthermore,eachoftheclassesiscodedasaonehotvector,withclassci encodedas the ith standard basis vector ei ei1,ei2, ,eiKT 0,1K, with eii 1 and eij 0 for all j i. Thus, given input x Rd, with the true response y y1,y2, ,yKT, where y e1,e2, ,eK, the crossentropy loss is defined as

The partial derivative of the crossentropy error function with respect to a particular output neuron oj is
Ex yj 25.17 oj oj
The vector of partial derivatives of the error function with respect to the output neurons is therefore given as
Ex Ex Ex Ex T y1 y2 yK T
o o ,o ,, o o ,o ,,o 25.18
12K12K
Binary CrossEntropy Error: For classification tasks with binary classes, it is typical to encode the positive class as 1 and the negative class as 0, as opposed to using a onehot encoding as in the general Kclass case. Given an input x Rd , with
K i1
Ex
Note that only one element of y is 1 and the rest are 0 due to the onehot encoding.
yi lnoi y1 lno1yK lnoK 25.16 Thatis,ifyei,thenonlyyi 1,andtheotherelementsyj 0forji.

648 Neural Networks true response y 0,1, there is only one output neuron o. Therefore, the binary
crossentropy error is defined as
Ex ylno1yln1o 25.19
Here y is either 1 or 0. The partial derivative of the binary crossentropy error function with respect to the output neuron o is
25.20
Ex ylno1yln1o o o
y 1y 1 y1o1yo o 1o o1o
oy o1o
25.3 MULTILAYER PERCEPTRON: ONE HIDDEN LAYER
A multilayer perceptron MLP is a neural network that has distinct layers of neurons. The inputs to the neural network comprise the input layer, and the final outputs from the MLP comprise the output layer. Any intermediate layer is called a hidden layer, and an MLP can have one or many hidden layers. Networks with many hidden layers are called deep neural networks. An MLP is also a feedforward network. That is, information flows in the forward direction, and from a layer only to the subsequent layer. Thus, information flows from the input to the first hidden layer, from the first to the second hidden layer, and so on, until it reaches the output layer from the last hidden layer. Typically, MLPs are fully connected between layers. That is, each neuron in the input layer is connected to all the neurons in the first hidden layer, and each neuron in the first hidden layer is connected to all neurons in the second hidden layer, and so on, and finally, each neuron in the last hidden layer is connected to all neurons in the output layer.
For ease of explanation, in this section, we will consider an MLP with only one hidden layer, and we will later generalize the discussion to deep MLPs. For example, Figure 25.6 shows an MLP with one hidden layer. The input layer has d neurons, x1,x2,,xd, and an additional neuron x0 that specifies the biases for the hidden layer. The hidden layer has m neurons, z1,z2, ,zm, and an additional neuron z0 that specifies the biases for the output neurons. Finally, the output layer has p neurons, o1,o2, ,op. The bias neurons have no incoming edges, since their value is always fixed at 1. Thus, in total there are d m m p weight parameters wij and a further m p bias parameters bi that need to be learned by the neural network. These parameters also correspond to the total number of edges in the MLP.
25.3.1 Feedforward Phase
Let D denote the training dataset, comprising n input points xi Rd and corresponding true response vectors yi Rp . For each pair x, y in the data, in the feedforward phase,

Multilayer Perceptron: One Hidden Layer
649
Input Layer
1 x0
Hidden Layer
1 z0
Output Layer
bk
x1 z1 o1
w1k
wik
. .
wk1
zk wkj oj
wkp
. xi .
Figure 25.6. Multilayer perceptron with one hidden layer. Input and output links for neuron zk are shown in bold. Neurons x0 and z0 are bias neurons.
thepointxx1,x2,,xdT Rd issuppliedasaninputtotheMLP.Theinputneurons do not use any activation function, and simply pass along the supplied input values as their output value. This is equivalent to saying that the net at input neuron k is netk xk , and the activation function is the identity function f netk netk , so that the output value of neuron k is simply xk .
Given the input neuron values, we compute the output value for each hidden neuron zk as follows:
d zkfnetkf bk wikxi
i1
where f is some activation function, and wik denotes the weight between input neuron xi and hidden neuron zk . Next, given the hidden neuron values, we compute the value for each output neuron oj as follows:
m oj fnetjf bj wij zi
i1
where wij denotes the weight between hidden neuron zi and output neuron oj .
We can write the feedforward phase as a series of matrixvector operations. For this we define the d m matrix Wh comprising the weights between input and hidden layer neurons, and vector bh Rm comprising the bias terms for hidden layer neurons,
.
xd zm op
wdk
.

650
Neural Networks
given as
w11 w12 w1m
w21 w22 w2m Wh . . .
wd1 wd2 wdm
b1 b2
bh . 25.21 bm
where wij denotes the weight on the edge between input neuron xi and hidden neuron zj , and bi denotes the bias weight from x0 to zi . The net input and the output for all the hidden layer neurons can be computed via a matrixvector multiplication operation, as follows:
n e t h b h W hT x 2 5 . 2 2 z f neth f bh WhTx 25.23
Here, neth net1, ,netmT denotes the net input at each hidden neuron excluding the bias neuron z0 whose value is always fixed at z0 1, and z z1,z2, ,zmT denotes the vector of hidden neuron values. The activation function f applies to, or distributes over, each element of neth, i.e., f neth f net1, ,f netmT Rm. Typically, all neurons in a given layer use the same activation function, but they can also be different if desired.
Likewise, let Wo Rmp denote the weight matrix between the hidden and output layers, and let bo Rp be the bias vector for output neurons, given as
w11 w12 w1p b1 w21 w22 w2p b2
Wo . . . bo . 25.24 wm1 wm2 wmp bp
where wij denotes the weight on the edge between hidden neuron zi and output neuron oj , and bi the bias weight between z0 and output neuron oi . The output vector can then be computed as follows:
n e t o b o W oT z 2 5 . 2 5 o f neto f bo WoTz 25.26
To summarize, for a given input x D with desired response y, an MLP computes the output vector via the feedforward process, as follows:
o f b o W oT z f b o W oT f b h W hT x 2 5 . 2 7 where, o o1,o2, ,opT is the vector of predicted outputs from the single hidden
layer MLP.
25.3.2 Backpropagation Phase
Backpropagation is the algorithm used to learn the weights between successive layers in an MLP. The name comes from the manner in which the error gradient is propagated

Multilayer Perceptron: One Hidden Layer 651
backwards from the output to input layers via the hidden layers. For simplicity of exposition, we will consider backpropagation for an MLP with a single hidden layer with m neurons, with squared error function, and with sigmoid activations for all neurons. We will later generalize to multiple hidden layers, and other error and activation functions.
LetDdenotethetrainingdataset,comprisingninputpointsxi xi1,xi2,,xidT Rd and corresponding true response vectors yi Rp. Let Wh Rdm denote the weight matrix between the input and hidden layer, and bh Rm the vector of bias terms for the hidden neurons from Eq. 25.21. Likewise, let Wo Rmp denote the weight matrix between the hidden and output layer, and bo Rp the bias vector for output neurons from Eq. 25.24.
For a given input pair x, y in the training data, the MLP first computes the output vector o via the feedforward step in Eq. 25.27. Next, it computes the error in the predicted output visavis the true response y using the squared error function
1 1p
Ex 2yo2 2 yj oj2 25.28
j1
The basic idea behind backpropagation is to examine the extent to which an output neuron, say oj , deviates from the corresponding target response yj , and to modify the weights wij between each hidden neuron zi and oj as some function of the error large error should cause a correspondingly large change in the weight, and small error should result in smaller changes. Likewise, the weights between all input and hidden neurons should also be updated as some function of the error at the output, as well as changes already computed for the weights between the hidden and output layers. That is, the error propagates backwards.
The weight update is done via a gradient descent approach to minimize the error. Let wij be the gradient of the error function with respect to wij , or simply the weight gradient at wij . Given the previous weight estimate wij , a new weight is computed by taking a small step in a direction that is opposite to the weight gradient at wij
wij wij wij 25.29 Inasimilarmanner,thebiastermbj isalsoupdatedviagradientdescent
bj bj bj 25.30 wherebj isthegradientoftheerrorfunctionwithrespecttobj,whichwecallthebias
gradient at bj .
Updating Parameters Between Hidden and Output Layer
Consider the weight wij between hidden neuron zi and output neuron oj , and the bias term bj between z0 and oj . Using the chain rule of differentiation, we compute the weightgradientatwij andbiasgradientatbj,asfollows:
wijEx Ex netjjzi wij netj wij
bjExEx netjj bj netj bj
25.31

652 Neural Networks whereweusethesymbolj todenotethepartialderivativeoftheerrorwithrespectto
net signal at oj , which we also call the net gradient at oj
j Ex 25.32
n e tj Furthermore,thepartialderivativeofnetj withrespecttowij andbj isgivenas
netjm netjm
w w bj wkj zk zi b b bj wkj zk 1
ij ij k1 j j k1
whereweusedthefactthatbj andallwkj forkiareconstantswithrespecttowij. Next, we need to compute j , the net gradient at oj . This can also be computed via
the chain rule
j Ex Ex fnetj 25.33 netj f netj netj
Notethatfnetjoj.Thus,j iscomposedoftwoterms,namelythepartialderivative of the error term with respect to the output or activation function applied to the net signal, and the derivative of the activation function with respect to its argument. Using the squared error function and from Eq. 25.14, for the former, we have
ExEx1p fnetjoj oj 2 ykok2 ojyj
k1
where we used the observation that all ok for k j are constants with respect to oj .
Since we assume a sigmoid activation function, for the latter, we have via Eq. 25.10
fnetj oj 1oj netj
Putting it all together, we get
j oj yjoj 1oj
Let o 1,2,…,pT denote the vector of net gradients at each output neuron, which we call the net gradient vector for the output layer. We can write o as
o o 1o oy 25.34
where denotes the elementwise product also called the Hadamard product between the vectors, and where o o1,o2, ,opT is the predicted output vector, y y1,y2, ,ypT is the true response vector, and 1 1, ,1T Rp is the pdimensional vector of all ones.
Letzz1,z2,,zmT denotethevectorcomprisingthevaluesofallhiddenlayer neurons after applying the activation function. Based on Eq. 25.31, we can compute thegradientswij forallhiddentooutputneuronconnectionsviatheouterproductof

Multilayer Perceptron: One Hidden Layer
653
z and o:
w11 w12 w21 w22
Wo . . wm1 wm2
w1p
w2p zT 25.35
where Wo Rmp is the matrix of weight gradients. The vector of bias gradients is given as:
bo b1,b2,,bpT o 25.36 Once the gradients have been computed, we can update all the weights and biases
. o wmp
wherebo Rp. as follows
Wo Wo Wo bo bo bo
where is the step size also called the learning rate for gradient descent.
Updating Parameters Between Input and Hidden Layer
25.37
Consider the weight wij between input neuron xi and hidden neuron zj , and the bias term between x0 and zj . The weight gradient at wij and bias gradient at bj is computed similarly to Eq. 25.31
wijEx Ex netjjxi wij netj wij
bjExEx netjj bj netj bj
25.38
which follows from
netjm netjm
bj wkj xk xi b b bj wkj xk 1 ij ij k1 j j k1
w w
To compute the net gradient j at the hidden neuron zj we have to consider the error gradients that flow back from all the output neurons to zj . Applying the chain rule, we get:
jEx p Ex netkzj zj p Ex netk netj k1 netk zj netj netj k1 netk zj
p k1
zj 1zj
zj 1 zj , since we assume a sigmoid activation function for the hidden
k wjk
neurons. The chain rule leads to a natural interpretation for backpropagation, namely,
where zj n e tj

654
Neural Networks
Input Layer
1 x0
bj
xi wij
Hidden Layer
Output Layer
o1 1 .
ok k .
op p
p k1
zj
k wjk
wj 1 wj k wjp
Figure 25.7. Backpropagation of gradients from output to hidden layer.
tofindthenetgradientatzj wehavetoconsiderthenetgradientsateachoftheoutput neurons k but weighted by the strength of the connection wjk between zj and ok, as illustrated in Figure 25.7. That is, we compute the weighted sum of gradients pk1 k wj k , which is used to compute j , the net gradient at hidden neuron zj .
Let o 1,2,…,pT denote the vector of net gradients at the output neurons, and h 1,2,…,mT the net gradients at the hidden layer neurons. We can write h compactly as
h z 1z Wo o 25.39
where is the elementwise product, 1 1,1, ,1 Rm is the vector of all ones and z z1,z2, ,zmT is the vector of hidden layer outputs. Furthermore, Wo o Rm is the vector of weighted gradients at each hidden neuron, since
p p p T Woo kw1k, kw2k, , kwmk
k1 k1 k1
Let x x1,x2, ,xdT denote the input vector, then based on Eq. 25.38 we can compute the gradients wij for all input to hidden layer connections via the outer product:
w11 w1m
w21 w2mxT 25.40
Wh . . h wd1 wdm
where Wh Rdm is the matrix of weight gradients. The vector of bias gradients is given as:
bh b1,b2,,bmT h 25.41
where bh Rm.

Multilayer Perceptron: One Hidden Layer 655
Algorithm 25.1: MLP Training: Stochastic Gradient Descent MLPTRAINING D,m,,maxiter:
Initialize bias vectors
1 bh random mdimensional vector with small values
2 bo random pdimensional vector with small values
Initialize weight matrices
3 Wh random d m matrix with small values
4 Wo random m p matrix with small values
5 t 0 iteration counter
6 repeat
9 o i f b o W oT z i
7 foreach xi , yi D in random order do Feedforward phase
8 zi fbhWTxi h
Backpropagation phase: net gradients
10 o oi 1oi oi yi
11 hzi 1ziWoo
Gradient descent for bias vectors
12 boo; bobobo
13 bhh; bhbhbh
Gradient descent for weight matrices
14 WoziTo; WoWoWo
15 WhxiTh; WhWhWh
16 tt1
17 until t maxiter
Once the gradients have been computed, we can update all the weights and biases as follows
Wh Wh Wh bh bh bh
where is the step size or learning rate. 25.3.3 MLP Training
25.42
Algorithm 25.1 shows the pseudocode for learning the weights considering all of the input points in D via stochastic gradient descent. The code is shown for an MLP with a single hidden layer, using a squared error function, and sigmoid activations for all hidden and output neurons. The approach is called stochastic gradient descent since we compute the weight and bias gradients after observing each training point in random order.
The MLP algorithm takes as input the dataset D with points xi and desired responsesyi fori1,2,,n,thenumberofhiddenlayerneuronsm,thelearningrate

656 Neural Networks
, and an integer threshold maxiter that specifies the maximum number of iterations. The size of the input d and output p layers is determined automatically from D. The MLP first initializes the d m input to hidden layer weight matrix Wh, and the m p hidden to output layer matrix Wo to small values, for example, uniformly random in the range 0.01,0.01. It is important to note that weights should not be set to 0, otherwise, all hidden neurons will be identical in their values, and so will be the output neurons.
The MLP training takes multiple iterations over the input points. For each input xi, the MLP computes the output vector oi via the feedforward step. In the backpropagation phase, we compute the error gradient vector o with respect to the net at output neurons, followed by h for hidden neurons. In the stochastic gradient descent step, we compute the error gradients with respect to the weights and biases, which are used to update the weight matrices and bias vectors. Thus, for every input vector xi , all the weights and biases are updated based on the error incurred between the predicted outputoi andthetrueresponseyi.Aftereachinputhasbeenprocessed,thatcompletes one iteration of training, called an epoch. Training stops when the maximum number of iterations, maxiter, has been reached. On the other hand, during testing, for any input x, we apply the feedforward steps and print the predicted output o.
In terms of computational complexity, each iteration of the MLP training algorithm takes Odm mp time for the feedforward phase, p mp m Omp time for the backpropagation of error gradients, and Odm mp time for updating the weight matrices and bias vectors. Thus, the total training time per iteration is Odmmp.
Example 25.3 MLP with one hidden layer. We now illustrate an MLP with a hid den layer using a nonlinear activation function to learn the sine curve. Figure 25.8a shows the training data the gray points on the curve, which comprises n 25 points xi sampled randomly in the range 10, 10, with yi sinxi . The testing data comprises 1000 points sampled uniformly from the same range. The figure also shows the desired output curve thin line. We used an MLP with one input neuron d 1, ten hidden neurons m 10 and one output neuron p 1. The hidden neurons use tanh activations, whereas the output unit uses an identity activation. The step size is 0.005.
The input to hidden weight matrix Wh R110 and the corresponding bias vector bh R101 are given as:
Wh 0.68,0.77,0.42,0.72,0.93,0.42,0.66,0.70,0.62,0.50 bh 4.36,2.43,0.52,2.351.64,3.98,0.31,4.45,1.03,4.77T
The hidden to output weight matrix Wo R101 and the bias term bo R are given as: Wo 1.82,1.69,0.82,1.37,0.14,2.37,1.64,1.92,0.78,2.17T
bo 0.16
Figure 25.8a shows the output of the MLP on the test set, after the first iteration of training t 1. We can see that initially the predicted response deviates

1.0 0.5 0 0.5 1.0 1.5 2.0 2.5
1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25
1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25

1.00 0.75

25.3 Multilayer Perceptron: One Hidden Layer
657

0 0.25 0.50

a t 1
1.25 1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25

b t 1000

0.75
1.00

1.25 XX
12 10 8 6 4 2 0 2 4 6 8 10 12 e t 15000
12 10 8 6 4 2 0 2 4 6 8 10 12 f t 30000

12 10 8 6 4 2 0 2 4 6 8 10 12 XX

0.50 0.25 0 0.25 0.50 0.75 1.00 1.25

12 10 8 6 4 2 0 2 4 6 8 10 12

12 10 8 6 4 2 0 2 4 6 8 10 12 XX
c t 5000
12 10 8 6 4 2 0 2 4 6 8 10 12 d t 10000

1.00 0.75 0.50 0.25

2.0 1.5 1.0 0.5
0 0.5 1.0 1.5 2.0 2.5

20 15 10 5
g Test range 20,20
15 20
0 5 10
X
Figure 25.8. MLP for sine curve: 10 hidden neurons with hyperbolic tangent activation functions. The gray dots represent the training data. The bold line is the predicted response, whereas the thin line is the true response. af: Predictions after different number of iterations. g: Testing outside the training range. Good fit within the training range 10, 10 shown in the box.

Y
YYY
YYY

658 Neural Networks
significantly from the true sine response. Figure 25.8af show the output from the MLP after different number of training iterations. By t 15000 iterations the output on the test set comes close to the sine curve, but it takes another 15000 iterations to get a closer fit. The final SSE is 1.45 over the 1000 test points.
We can observe that, even with a very small training data of 25 points sampled randomly from the sine curve, the MLP is able to learn the desired function. However, it is also important to recognize that the MLP model has not really learned the sine function; rather, it has learned to approximate it only in the specified range 10, 10. We can see in Figure 25.8g that when we try to predict values outside this range, the MLP does not yield a good fit.
Example 25.4 MLP for handwritten digit classification. In this example, we apply an MLP with one hidden layer for the task of predicting the correct label for a handwritten digit from the MNIST database, which contains 60,000 training images that span the 10 digit labels, from 0 to 9. Each grayscale image is a 28 28 matrix of pixels, with values between 0 and 255. Each pixel is converted to a value in the interval 0, 1 by dividing by 255. Figure 25.9 shows an example of each digit from the MNIST dataset.
Since images are 2dimensional matrices, we first flatten them into a vector x R784 with dimensionality d 28 28 784. This is done by simply concatenating all of the rows of the images to obtain one long vector. Next, since the output labels are categorical values that denote the digits from 0 to 9, we need to convert them into binary numerical vectors, using onehot encoding. Thus, the label 0 is encoded as e1 1,0,0,0,0,0,0,0,0,0T R10, the label 1 as e2 0,1,0,0,0,0,0,0,0,0T R10, and so on, and finally the label 9 is encoded as e10 0,0,0,0,0,0,0,0,0,1T R10. That is, each input image vector x has a corresponding target response vector y e1,e2,,e10.Thus,theinputlayerfortheMLPhasd784neurons,andtheoutput layer has p 10 neurons.
For the hidden layer, we consider several MLP models, each with a different number of hidden neurons m. We try m 0,7,49,98,196,392, to study the effect of increasing the number of hidden neurons, from small to large. For the hidden layer, we use ReLU activation function, and for the output layer, we use softmax activation, since the target response vector has only one neuron with value 1, with the rest being 0. Note that m 0 means that there is no hidden layer the input layer is directly connected to the output layer, which is equivalent to a multiclass logistic regression model. We train each MLP for t 15 epochs, using step size 0.25.
During training, we plot the number of misclassified images after each epoch, on the separate MNIST test set comprising 10,000 images. Figure 25.10 shows the number of errors from each of the models with a different number of hidden neurons m, after each epoch. The final test error at the end of training is given as
m
0
7
10
49
98
196
392
errors
1677
901
792
546
495
470
454

Multilayer Perceptron: One Hidden Layer 659 00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
a label 0 b label 1 c label 2 d label 3 e label 4
00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
f label 5 g label 6 h label 7 i label 8 j label 9 Figure 25.9. MNIST dataset: Sample handwritten digits.
3,000
2,500
2,000
1,500
1,000
500
We can observe that adding a hidden layer significantly improves the prediction accuracy. Using even a small number of hidden neurons helps, compared to the logistic regression model m 0. For example, using m 7 results in 901 errors or error rate 9.01 compared to using m 0, which results in 1677 errors or error rate 16.77. On the other hand, as we increase the number of hidden neurons, the error rate decreases, though with diminishing returns. Using m 196, the error rate is 4.70, but even after doubling the number of hidden neurons m 392, the error rate goes down to only 4.54. Further increasing m does not reduce the error rate.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 epochs
Figure 25.10. MNIST: Prediction error as a function of epochs.
m0
m7 m10 m49 m98 m196 m392

660 Neural Networks 25.4 DEEP MULTILAYER PERCEPTRONS
We now generalize the feedforward and backpropagation steps for many hidden layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers as shown in Figure 25.11. We assume that the input to the MLP comprises n points xi Rd with the corresponding true response vector yi Rp . We denote the input neurons as layer l 0, the first hidden layer as l 1, the last hidden layer as l h, and finally the output layer as layer l h1. We use nl to denote the number of neurons in layer l. Since the input points are ddimensional, this implies n0 d, and since the true response vector is pdimensional, we have nh1 p. The hidden layers have n1 neurons for the first hidden layer, n2 for the second layer, and nh for the last hidden layer. The vector of neuron values for layer l for l 0, ,h1 is denoted as
z l z 1l , , z nl T l
Each layer except the output layer has one extra bias neuron, which is the neuron at
index 0. Thus, the bias neuron for layer l is denoted z0l
and its value is fixed at z0l 1.
l 0 1 x0
x 1 . x i . x d
l 0
W0,b0
l 1 1 z01
z 1 1 . z i1 .
l 2 1 z02
z 12 . z i2 .

1
.

l h 1 z0h
z 1h . z ih .
l h1
o 1 . o i . o p
l h1
Wh,bh
.
z n1 z n2 z nh
12h
a detail view
l 1 l 2 … l h
W1,b1
b layer view
Figure 25.11. Deep multilayer perceptron, with h hidden layers.
x
z1
z2

zh
o

Deep Multilayer Perceptrons 661
Figure 25.11a displays a detailed view of an MLP with h hidden layers, showing the individual neurons in each layer including the bias neuron. Note that the vector of input neuron values is also written as
xx1,x2,,xdT z10,z20,,zd0T z0 and the vector of output neuron values is also denoted as
o o1,o2, ,opT zh1,zh1, ,zh1T zh1 12p
Theweightmatrixbetweenlayerlandlayerl1neuronsisdenotedWl Rnlnl1, andthevectorofbiastermsfromthebiasneuronz0l toneuronsinlayerl1isdenoted bl Rnl1 , for l 0,1, ,h. Thus, W0 Rdn1 is the weight matrix between the input and first hidden layer, W1 Rn1n2 is the weight matrix between the first and second hidden layer, and so on; finally, Wh Rnh p is the weight matrix between the last hidden layerandtheoutputlayer.Forthebiasvectors,b0Rn1 specifiesthebiasesforneurons inthefirsthiddenlayer,b1Rn2 thebiasesforneuronsinthesecondhiddenlayer,and so on. Thus, bh Rp specifies the biases for the output neurons. Figure 25.11b shows a layer view for an MLP with h hidden layers. This is a more compact representation that clearly specifies the architecture or topology of the MLP. Each layer l is represented as a rectangular node, and is marked with the vector of neuron values, zl. The bias neurons are not shown, but are assumed to be present in each layer except the output. An edge between layer l and l 1 is labeled with the weight matrix Wl and bias vector bl that specify the parameters between those layers.
For training deep MLPs, we will refer to several partial derivative vectors, described next. Define il as the net gradient, i.e., the partial derivative of the error function with respect to the net value at zil
il Ex neti
andletl denotethenetgradientvectoratlayerl,forl1,2,,h1 l1l,,nl T
l
25.43
25.44
Let f l denote the activation function for layer l, for l 0,1, ,h 1, and further let fl denote the vector of the derivatives of the activation function with respect to neti for all neurons zil in layer l:
l flnet1 flnetnlT
f net ,, net 25.45
1 nl
Finally, let Ex denote the vector of partial derivatives of the error function with respect
to the values oi for all output neurons:
Ex Ex Ex T
Ex o,o,,o 25.46
12p

662 Neural Networks 25.4.1 Feedforward Phase
Typically in a deep MLP, the same activation function f l is used for all neurons in a given layer l. The input layer always uses the identity activation, so f 0 is the identity function. Also, all bias neurons also use the identity function with a fixed value of 1. The output layer typically uses sigmoid or softmax activations for classification tasks, or identity activations for regression tasks. The hidden layers typically use sigmoid, tanh, or ReLU activations. In our discussion, we assume a fixed activation function f l for all neurons in a given layer. However, it is easy to generalize to a neuron specific activation function f l for neuron zl in layer l.
feedforward process:
o f h 1 b h W hT z h
fh1b WTfhb WT zh1 h h h1 h1
ii
For a given input pair x, y D, the deep MLP computes the output vector via the
.
fh1b WT fhb WT fh1f2b WT f1b WT x
h h h1 h1 1 1 0 0 Note that each f l distributes over its argument. That is,
f l b l 1 W Tl 1 x f l n e t 1 , f l n e t 2 , , f l n e t n l T 25.4.2 Backpropagation Phase
Consider the weight update between a given layer and another, including between the
input and hidden layer, or between two hidden layers, or between the last hidden layer
and the output layer. Let zl be a neuron in layer l, and zl1 a neuron in the next layer ij
l1.Letwl betheweightbetweenzl andzl1,andletbl denotethebiastermbetween ij ijj
zl andzl1.Theweightandbiasareupdatedusingthegradientdescentapproach 0j
wl wl l blbll ij ij wij j j bj
where wl is the weight gradient and bl is the bias gradient, i.e., the partial derivative ij j
of the error function with respect to the weight and bias, respectively:
l Ex lEx
wij wl bj
ij j
As noted earlier in Eq. 25.31, we can use the chain rule to write the weight and bias gradient, as follows
l Ex Ex netjl1zlzll1 wij wl netj wl j i i j
l Ex Ex netj l1 b j b jl n e t j b jl j
bl
ij ij

Deep Multilayer Perceptrons 663 where l1 is the net gradient Eq. 25.43, i.e., the partial derivative of the error
j
function with respect to the net value at zl1, and we have j
nn n e t j l n e t j l
bl wl zl zl
wl wl j kj k i bl bl j kj k
bl wl zl 1 ij ij k0 j j k0
Given the vector of neuron values at layer l, namely zl z1l , ,znl T, we can l
compute the entire weight gradient matrix via an outer product operation
and the bias gradient vector as:
Wl zll1T 25.47 bl l1 25.48
withl0,1,,h.Herel1 isthenetgradientvectoratlayerl1Eq.25.44. This also allows us to update all the weights and biases as follows
gradients for layer l we need to compute the net gradients l1 at layer l 1. 25.4.3 Net Gradients at Output Layer
Let us consider how to compute the net gradients at the output layer h 1. If all of the output neurons are independent for example, when using linear or sigmoid activations, the net gradient is obtained by differentiating the error function with respect to the net signal at the output neurons. That is,
h1 Ex Ex fh1netjEx fh1netj j netj f h1netj netj oj netj
Thus, the gradient depends on two terms, the partial derivative of the error function with respect to the output neuron value, and the derivative of the activation function with respect to its argument. The net gradient vector across all output neurons is given as
h1 fh1 Ex 25.50
where is the elementwise or Hadamard product, fh1 is the vector of derivatives of the activation function with respect to its argument Eq. 25.45 at the output layer l h 1, and E x is the vector of error derivatives with respect to the output neuron values Eq. 25.46.
Wl Wl Wl bl bl bl
25.49 where is the step size. However, we observe that to compute the weight and bias

664 Neural Networks
On the other hand, if the output neurons are not independent for example, when using a softmax activation, then we have to modify the computation of the net gradient at each output neuron as follows:
h1 Ex p Ex fh1neti j netj i1 fh1neti netj
Across all output neurons, we can write this compactly as follows:
h1 Fh1 Ex 25.51
whereFh1 isthematrixofderivativesofoi fh1netiwithrespecttonetj forall
i,j 1,2, ,p, given as
o1 o1 o1
net1 net2 netp
o2 o2 o2 net1 net2 netp
Typically, for regression tasks, we use the squared error function with linear activation function at the output neurons, whereas for logistic regression and classification, we use the crossentropy error function with a sigmoid activation for binary classes, and softmax activation for multiclass problems. For these common cases, the net gradient vector at the output layer is given as follows:
F
h1
. .
op op op
SquaredError: FromEq.25.15,theerrorgradientisgivenas ExEx oy
.
net1 net2 netp

o The net gradient at the output layer is given as
h1 fh1 Ex
where fh1 depends on the activation function at the output. Typically, for regression tasks, we use a linear activation at the output neurons. In that case, we have f h1 1 see Eq. 25.8.
CrossEntropyErrorbinaryoutput,sigmoidactivation: Considerthebinarycasefirst, with a single output neuron o with sigmoid activation. Recall that the binary crossentropy error Eq. 25.19 is given as
Ex ylno1yln1o From Eq. 25.20 we have
ExEx oy o o1o

Deep Multilayer Perceptrons 665 Further, for sigmoid activation, we have
fh1 fneto o1o neto
Therefore, the net gradient at the output neuron is
h1 Ex fh1 oy o1ooy
o1o
CrossEntropy Error K outputs, softmax activation: Recall that the crossentropy
error function Eq. 25.16 is given as

K i1
Ex Ex ExT y1 y2 yKT Ex o ,o ,, o o ,o ,,o
Ex
Using Eq. 25.17, the vector of error derivatives with respect to the output neurons
is given as
12K12K where p K is the number of output neurons.
Crossentropy error is typically used with the softmax activation so that we get a normalized probability value for each class. That is,
yi lnoi y1 lno1yK lnoK
expnetj oj softmaxnetj
so that the output neuron values sum to one, Kj 1 oj 1. Since an output neuron depends on all other output neurons, we need to compute the matrix of derivatives of each output with respect to each of the net signals at the output neurons see Equations 25.12 and 25.18:
o1 o1 o1
net1 net2 netK o1 1o1
o1 o2 o2 1o2
o2 oK
Ki1 expneti
o2 o2 o2 o1 o2
o1 oK
.
Fh1 net1 net2 . .
netK . . .
.
.
. . . . o1oK o2oK oK1oK
oK oK oK net1 net2 netK

666
Neural Networks
Therefore, the net gradient vector at the output layer is h1 Fh1 Ex

o1 1o1
. . . . . . o1 oK o2 oK
o1 o2 o 1o
25.52 y1
2Ko2 . . . . . .
o o
12 2 2
o1 oK o o
o1
y2
y1 o1 Ki1 yi .K .K
yKyKoK iKyioK yKoK i1yi
y1 o1
y 2 o 2 K
. ,since yK oK
oy
25.4.4 Net Gradients at Hidden Layers
Let us assume that we have already computed the net gradients at layer l 1, namely
l1.Sinceneuronzjl inlayerlisconnectedtoalloftheneuronsinlayerl1except
for the bias neuron zl1, to compute the net gradient at zl , we have to account for the 0j
error from each neuron in layer l 1, as follows:
n
i1
yi 1
oK 1oK yK oK
y1 y1 o1 Ki1 yi o1
y2y2o2Ki2yi o2 y2o2Ki1yi
l
E l1 E net f net
jlx x k j

k1
k1 netk flnetj netj
netj
Sothenetgradientatzjl inlayerldependsonthederivativeoftheactivationfunction
with respect to its netj , and the weighted sum of the net gradients from all the neurons
zl1 at the next layer l 1. k
We can compute the net gradients for all the neurons in level l in one step, as follows:
l fl Wl l1 25.53
where is the elementwise product, and fl is the vector of derivatives of the activation function with respect to its argument Eq. 25.45 at layer l. For the commonly used activation functions at the hidden layer, using the derivatives from
n l fnetj l1
netj
l1 wl k jk

Deep Multilayer Perceptrons
667
Section 25.1.1, we have
For ReLU, we have to apply Eq. 25.9 to each neuron. Note that softmax is generally not used for hidden layers.
The net gradients are computed recursively, starting from the output layer h 1, then hidden layer h, and so on, until we finally compute the net gradients at the first hidden layer l 1. That is,
hfh Whh1 h1 fh1 Wh1 hfh1 Wh1 fh Wh h1
.
1 f1 W1 f2 W2 fh Wh h1
25.4.5 Training Deep MLPs
Algorithm 25.2 shows the pseudocode for learning the weights and biases for a deep MLP. The inputs comprise the dataset D, the number of hidden layers h, the step size or learning rate for gradient descent , an integer threshold maxiter denoting the number of iterations for training, parameters n1,n2, ,nh that denote the number of neurons excluding bias, which will be added automatically for each of the hidden layers l 1,2,,h,andthetypeofactivationfunctionsf1,f2,,fh1 foreachofthelayers other than the input layer that uses identity activations. The size of the input d and output p layers is determined directly from D.
The MLP first initializes the nl nl1 weight matrices Wl between layers l and l 1 with small values chosen uniformly at random, e.g., in the range 0.01, 0.01. The MLP considers each input pair xi , yi D, and computes the predicted response oi via the feedforward process. The backpropagation phase begins by computing the error between oi and true response yi, and computing the net gradient vector h1 at the output layer. These net gradients are backpropagated from layer h 1 to layer h, from h to h 1, and so on until we obtain the net gradients at the first hidden layer l1.ThesenetgradientsareusedtocomputetheweightgradientmatrixWl atlayer l, which can in turn be used to update the weight matrix Wl . Likewise, the net gradients specify the bias gradient vector bl at layer l, which is used to update bl. After each point has been used to update the weights, that completes one iteration or epoch of training. The training stops when maxiter epochs have been reached. On the other hand, during testing, for any input x, we apply the feedforward steps and print the predicted output o.
It is important to note that Algorithm 25.2 follows a stochastic gradient descent approach, since the points are considered in random order, and the weight and bias gradients are computed after observing each training point. In practice, it is common
1
fl zl1zl
for linear
for sigmoid 1zl zl for tanh

668
Neural Networks
Algorithm 25.2: Deep MLP Training: Stochastic Gradient Descent
1 2
3 4 5
6 7 8
9 10
11 12
13 14 15 16
17 18
19 20 21
22 23 24
25 26
DEEPMLPTRAINING D,h,,maxiter,n1,n2, ,nh,f 1,f 2, ,f h1: n0 d input layer size
nh1 p output layer size
Initialize weight matrices and bias vectors
for l 0,1,2, ,h do
bl random nl1 vector with small values
Wl random nl nl1 matrix with small values
t 0 iteration counter repeat
foreach xi , yi D in random order do FeedForward Phase
z0 xi
for l 0,1,2,…,h do
zl1 fl1bl WTl zl
oi zh1
Backpropagation Phase if independent outputs then
h1 fh1 Exi net gradients at output else
Gradient Descent Step
h1 Fh1 Exi net gradients at output for l h, h 1, , 1 do
lfl Wll1netgradientsatlayerl
for l 0,1, ,h do
Wl zll1Tweightgradientmatrixatlayerl bl l1biasgradientvectoratlayerl
for l 0,1, ,h do
Wl Wl Wl update Wl bl bl bl update bl
tt1 until t maxiter
to update the gradients by considering a fixed sized subset of the training points called a minibatch instead of using single points. That is, the training data is divided into minibatches using an additional parameter called batch size, and a gradient descent step is performed after computing the bias and weight gradient from each minibatch. This helps better estimate the gradients, and also allows vectorized matrix operations over the minibatch of points, which can lead to faster convergence and substantial speedups in the learning.

Deep Multilayer Perceptrons 669
One caveat while training very deep MLPs is the problem of vanishing and exploding gradients. In the vanishing gradient problem, the norm of the net gradient can decay exponentially with the distance from the output layer, that is, as we backpropagate the gradients from the output layer to the input layer. In this case the network will learn extremely slowly, if at all, since the gradient descent method will make minuscule changes to the weights and biases. On the other hand, in the exploding gradient problem, the norm of the net gradient can grow exponentially with the distance from the output layer. In this case, the weights and biases will become exponentially large, resulting in a failure to learn. The gradient explosion problem can be mitigated to some extent by gradient thresholding, that is, by resetting the value if it exceeds an upper bound. The vanishing gradients problem is more difficult to address. Typically sigmoid activations are more susceptible to this problem, and one solution is to use alternative activation functions such as ReLU. In general, recurrent neural networks, which are deep neural networks with feedback connections, are more prone to vanishing and exploding gradients; we will revisit these issues in Section 26.2.
Example 25.5 Deep MLP. We now examine deep MLPs for predicting the labels for the MNIST handwritten images dataset that we considered in Example 25.4. Recall that this dataset has n 60000 grayscale images of size 28 28 that we treat as d 784 dimensional vectors. The pixel values between 0 and 255 are converted to the range 0 and 1 by dividing each value by 255. The target response vector is a onehot encoded vector for class labels 0,1,…,9. Thus, the input to the MLP xi has dimensionality d 784, and the output layer has dimensionality p 10. We use softmax activation for the output layer. We use ReLU activation for the hidden layers, and consider several deep models with different number and sizes of the hidden layers. We use step size 0.3 and train for t 15 epochs. Training was done using minibatches, using batch size of 1000.
During the training of each of the deep MLPs, we evaluate its performance on the separate MNIST test datatset that contains 10,000 images. Figure 25.12 plots the
n1 392
n1 196,n2 49
n1 392,n2 196,n3 49
n1 392,n2 196,n3 98,n4 49
5,000
4,000
3,000
2,000
1,000
epochs
Figure 25.12. MNIST: Deep MLPs; prediction error as a function of epochs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
errors

670 Neural Networks
number of errors after each epoch for the different deep MLP models. The final test error at the end of training is given as
n1 392
n1 196,n2 49
n1 392,n2 196,n3 49
n1 392,n2 196,n3 98,n4 49
We can observe that as we increase the number of layers, we do get performance improvements. The deep MLP with four hidden layers of sizes n1 392, n2 196, n3 98, n4 49 results in an error rate of 2.78 on the training set, whereas the MLP with a single hidden layer of size n1 392 has an error rate of 3.96. Thus, the deeper MLP significantly improves the prediction accuracy. However, adding more layers does not reduce the error rate, and can also lead to performance degradation.
hidden layers
errors
396
303
290
278
25.5 FURTHER READING
Artificial neural networks have their origin in the work of McCulloch and Pitts 1943. The first application of a single neuron, called a perceptron, to supervised learning was by Rosenblatt 1958. Minsky and Papert 1969 pointed out limitations of perceptrons, which were not overcome until the development of the backpropagation algorithm, which was introduced by Rumelhart, Hinton, and Williams 1986 to train general multilayer perceptrons.
McCulloch, W. S. and Pitts, W. 1943. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5 4, 115133.
Minsky, M. and Papert, S. 1969. Perceptron: An Introduction to Computational Geometry. Cambridge, MA: The MIT Press.
Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65 6, 386.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by backpropagating errors. Nature, 323 6088, 533.
25.6 EXERCISES
Q1. Consider the neural network in Figure 25.13. Let bias values be fixed at 0, and let the weight matrices between the input and hidden, and hidden and output layers, respectively, be:
W w1,w2,w3 1,1,1 W w1 ,w2 ,w3 T 0.5,1,2T
Assume that the hidden layer uses ReLU, whereas the output layer uses sigmoid activation. Assume SSE error. Answer the following questions, when the input is x 4 and the true response is y 0:

Exercises
671
z1
w1
x w2 z2 w2 o
w1
w3 Figure 25.13. Neural network for Q1.
a Useforwardpropagationtocomputethepredictedoutput. b Whatisthelossorerrorvalue?
c Computethenetgradientvectorofortheoutputlayer.
d Computethenetgradientvectorhforthehiddenlayer.
e ComputetheweightgradientmatrixW betweenthehiddenandoutputlayers.
f ComputetheweightgradientmatrixWbetweentheinputandhiddenlayers.
Q2. Show that the derivative of the sigmoid function Eq. 25.5 with respect to its
argument is given as
fz fz1fz z
Q3. ShowthatthederivativeofthehyperbolictangentfunctionEq.25.6withrespect to its argument is given as
fz 1fz2 z
Q4. Showthatthederivativeofthesoftmaxfunctionisgivenas fzizfzi1fzi ifji
w3
z3
zj f zi f zj if j i
Q5. Derive an expression for the net gradient vector at the output neurons when using
where z z1,z2, ,zp.
softmax activation with the squared error function.
Q6. Showthatiftheweightmatrixandbiasvectorsareinitializedtozero,thenallneurons in a given layer will have identical values in each iteration.
Q7. Provethatwithlinearactivationfunctions,amultilayerednetworkisequivalenttoa singlelayered neural network.
Q8. Compute the expression for the net gradient vector at the output layer, h1, assuming crossentropy error, Ex Ki1 yi lnoi , with K independent binary output neurons that use sigmoid activation, that is, oi sigmoidneti .
Q9. GivenanMLPwithonehiddenlayer,derivetheequationsforhandousingvector derivatives, i.e., by computing Ex and Ex , where neth and neto are the net input
neth neto vectors at the hidden and output layers.

CHAPTER 26 Deep Learning
In this chapter, we first look at deep neural networks that include feedback from one layer to another. Such network are called recurrent neural networks RNNs, and they can typically be trained by unfolding or unrolling the recurrent connections, resulting in deep networks whose parameters can be learned via the backpropagation algorithm. Since RNNs are prone to the vanishing and exploding gradients problem, we next consider gated RNNs that introduce a new type of layer that endows the network with the ability to selectively read, store, and write the hidden state via an internal memory layer. Gated RNNs are highly effective for prediction tasks on sequence inputs. Finally, we consider convolutional neural networks CNNs that are deep MLPs that exploit spatial or temporal relationships between different elements of each layer to construct a hierarchy of features that can be used for regression or classification tasks. Unlike regular MLPs whose layers are fully connected, CNNs have layers that are localized and sparse. In particular, CNNs are highly effective for image inputs.
26.1 RECURRENT NEURAL NETWORKS
Multilayer perceptrons are feedforward networks in which the information flows in only one direction, namely from the input layer to the output layer via the hidden layers. In contrast, recurrent neural networks RNNs are dynamically driven e.g., temporal, with a feedback loop between two or more layers, which makes such networks ideal for learning from sequence data. For example, Figure 26.1 shows a simple RNN where there is a feedback loop from the hidden layer ht to itself via a temporal delay of one time unit, denoted by the 1 on the loop.
Let X x1,x2, ,x denote a sequence of vectors, where xt Rd is a ddimensional vector t 1,2, ,. Thus, X is an input sequence of length , with xt denoting the input at time step t. Let Y y1,y2, ,y denote a sequence of vectors, with yt Rp a pdimensional vector. Here, Y is the desired target or response sequence, with yt denoting the response vector at time t. Finally, let O o1,o2, ,o denote the predicted or output sequence from the RNN. Here ot Rp is also a pdimensional vector to match the corresponding true response yt . The task of an RNN is to learn a function that predicts the target sequence Y given the input sequence X . That is, the
672

Recurrent Neural Networks
673
Wh,bh 1
xt
ht
Wo,bo
Figure 26.1. Recurrent neural network. Recurrent connection shown in gray.
predicted output ot on input xt should be similar or close to the target response yt , for each time point t.
To learn dependencies between elements of the input sequence, an RNN maintains a sequence of mdimensional hidden state vectors ht Rm , where ht captures the essential features of the input sequences up to time t, as illustrated in Figure 26.1. The hidden vector ht at time t depends on the input vector xt at time t and the previous hidden state vector ht 1 from time t 1, and it is computed as follows:
h t f h W i T x t W hT h t 1 b h 2 6 . 1
Here, f h is the hidden state activation function, typically tanh or ReLU. Also, we need an initial hidden state vector h0 that serves as the prior state to compute h1. This is usually set to the zero vector, or seeded from a prior RNN prediction step. The matrix Wi Rdm specifies the weights between the input vectors and the hidden state vectors. The matrix Wh Rmm specifies the weight matrix between the hidden state vectors at time t 1 and t , with bh Rm specifying the bias terms associated with the hidden states. Note that we need only one bias vector bh associated with the hidden state neurons; we do not need a separate bias vector between the input and hidden neurons.
Given the hidden state vector at time t, the output vector ot at time t is computed as follows:
ot foWoTht bo 26.2
Here, Wo Rmp specifies the weights between the hidden state and output vectors, with bias vector bo . The output activation function f o typically uses linear or identity activation, or a softmax activation for onehot encoded categorical output values.
It is important to note that all the weight matrices and bias vectors are independent of the time t. For example, for the hidden layer, the same weight matrix Wh and bias vector bh is used and updated while training the model, over all time steps t. This is an example of parameter sharing or weight tying between different layers or components of a neural network. Likewise, the input weight matrix Wi , the output weight matrix Wo and the bias vector bo are all shared across time. This greatly reduces the number of parameters that need to be learned by the RNN, but it also relies on the assumption that all relevant sequential features can be captured by the shared parameters.
Figure 26.1 shows an illustration of an RNN, with a recurrent hidden layer ht , denoted by the feedback loop with a time delay of 1 noted on the loop. The figure also shows the shared parameters between the layers. Figure 26.2a shows the same
Wi
ot

674
Deep Learning
t 0
t 1
t 2

t 1
Wo,bo
Wh,bh
Wi
t
Wo,bo
Wi
ot
o1
o2

o 1
o
Wh,bh
Wo,bo 1
Wi
a RNN
Wo,bo
Wh,bh Wi
Wo,bo
Wi
ht
h0
h1
h2

h 1
h
Wh,bh
xt
x1
x2
b RNN: unfolded in time Figure 26.2. RNN unfolded in time.
RNN in vertical format, where the layers are stacked vertically, and Figure 26.2b shows the RNN with the layers unfolded in time; it shows the input xt , hidden ht , and output ot layers at each time step t . We can observe that the feedback loop has been unfolded for time steps, starting with t 1 and ending at t . The time step t 0 is used to denote the previous or initial hidden state h0. We can also explicitly observe the parameter sharing across time, since all weight matrices and bias vectors are independent of t. Figure 26.2b makes it clear that an RNN is a deep neural network with layers, where is the maximum input sequence length.
The training data for the RNN is given as D Xi,Yini1, comprising n input sequences Xi and the corresponding target response sequences Yi, with sequence length i . Given each pair X , Y D, with X x1 , x2 , , x and Y y1 , y2 , , y , the RNN has to update the model parameters Wi,Wh,bh,Wo,bo for the input, hidden and output layers, to learn the corresponding output sequence O o1 , o2 , , o . For training the network, we compute the error or loss between the predicted and response vectors over all time steps. For example, the squared error loss is given as
1
EX Ext 2 yt ot 2
t1 t1
On the other hand, if we use a softmax activation at the output layer, then we use the
crossentropy loss, given as

x 1
x
p
EX Ext yti lnoti
t1 t1 i1
where yt yt1,yt2, ,ytpT Rp and ot ot1,ot2, ,otpT Rp. On training input of length we first unfold the RNN for steps, following which the parameters can be learned via the standard feedforward and backpropagation steps, keeping in mind the connections between the layers.

Recurrent Neural Networks 675 26.1.1 FeedforwardinTime
The feedforward process starts at time t 0, taking as input the initial hidden state vector h0, which is usually set to 0 or it can be userspecified, say from a previous prediction step. Given the current set of parameters, we predict the output ot at each time step t 1,2, , via Eq. 26.1 and Eq. 26.2:
o t f o W oT h t b o
f o W oT f h W i T x t W hT h t 1 b h b o
ht .
foWoTfhWiTxt WhTfhfhWiTx1 WhTh0 bhbhbo
h1
We can observe that the RNN implicitly makes a prediction for every prefix of the inputsequence,sinceot dependsonallthepreviousinputvectorsx1,x2,,xt,butnot on any future inputs xt1, ,x .
26.1.2 Backpropagation in Time
Once the feedforward phase computes the output sequence O o1 , o2 , , o , we can compute the error in the predictions using the squared error or crossentropy loss function, which can in turn be used to compute the net gradient vectors that are backpropagated from the output layers to the input layers for each time step.
For the backpropagation step it is easier to view the RNN in terms of the distinct layers based on the dependencies, as opposed to unfolding in time. Figure 26.3a shows the RNN unfolded using this layer view as opposed to the time view shown in Figure 26.2b. The first layer is l 0 comprising the hidden states h0 and the input vector x1, both of which are required to compute h1 in layer l 1, which also includes the input vector x2. In turn, both h1 and x2 are required to compute h2 in layer l 2, and so on for other layers. Note also that o1 is not output until layer l 2 since we can compute the output only once h1 has been computed in the previous layer. The layer view is essentially indexed by the hidden state vector index, except the final layer l1thatoutputso thatdependsonh fromlayerl.
Let Ext denote the loss on input vector xt from the input sequence X x1,x2, ,x . The unfolded feedforward RNN for X has l 1 layers, as shown inFigure26.3a.Defineot asthenetgradientvectorfortheoutputvectorot,i.e.,the derivative of the error function Ext with respect to the net value at each neuron in ot , given as
oExt ,Ext ,,Ext T t neto neto neto
t1 t2 tp
whereot ot1,ot2,,otpT Rp isthepdimensionaloutputvectorattimet,andneto
ti is the net value at output neuron oti at time t. Likewise, let ht denote the net gradient

676
Deep Learning
l 0
Wh,bh
l 1
l 2 o1
l 3 o2

l o1
l 1 o
h2

Wh,bh
Wh,bh
l 0
Wh h1
l 1
l 2 o1
o1
l 3 o2
o2

h 1 h1
l o1
o 1
l 1 o
o

a Feedforward step
Wh h2
Wh h
b Backpropagation step Figure 26.3. RNN unfolded as layers.
vectorforthehiddenstateneuronsht attimet
hExt ,Ext ,,Ext T
t neth neth neth t1 t2 tm
where ht ht1,ht2, ,htmT Rm is the mdimensional hidden state vector at time t, and neth is the net value at hidden neuron h at time t. Let fh and fo denote the
ti ti activationfunctionsforthehiddenstateandoutputneurons,andletfth andfto denote
the vector of the derivatives of the activation function with respect to the net signal that is, its argument for the hidden and output neurons at time t, given as
fhneth fhneth fhneth T fh t1 , t2 ,, tm

h0
h1
h 1

h
x1
x2
x 1
x
h1 h1
h2 h2
h h
h0

x1
x2

x 1
x
t neth neth neth t1 t2 tm
Wo,bo
Wi
Wo,bo
Wo,bo
Wi
Wo,bo
Wi
Wi
W o o2
W o o
W o 1o
W o o 1

Recurrent Neural Networks 677
foneto foneto fonetoT fo t1 , t2 ,, tp
Finally, let Ext respect to ot :
denote the vector of partial derivatives of the error function with
Ext Ext ,Ext , , Ext T ot1 ot2 otp
t neto neto neto t1 t2 tp
Computing Net Gradients
The key step in backpropagation is to compute the net gradients in reverse order, starting from the output neurons to the input neurons via the hidden neurons. Given the layer view in Figure 26.3a, the backpropagation step reverses the flow direction for computing the net gradients ot and ht , as shown in the backpropagation graph in Figure 26.3b.
Inparticular,thenetgradientvectorattheoutputot canbecomputedasfollows:
ot f t o E x t 2 6 . 3
where is the elementwise or Hadamard product, For example, if Ext is the squared error function, and the output layer uses the identity function, then via Eq. 25.8 and Eq. 25.15 we have
ot 1 o t y t
On the other hand, the net gradients at each of the hidden layers need to account for the incoming net gradients from ot and from ht1 as seen in Figure 26.3b. Thus, generalizingEq.25.53,thenetgradientvectorforht fort1,2,…,1isgiven as
ht fth WootWhht1 26.4
Notethatforh,itdependsonlyono seeFigure26.3b,therefore h f h W o o
For the tanh activation, which is commonly used in RNNs, the derivative of the activationfunctionseeEq.25.11withrespecttothenetvaluesatht isgivenas
fth 1ht ht
Finally, note that the net gradients do not have to be computed for h0 or for any of the input neurons xt , since these are leaf nodes in the backpropagation graph, and thus do not backpropagate the gradients beyond those neurons.
Stochastic Gradient Descent
The net gradients for the output ot and hidden neurons ht at time t can be used to compute the gradients for the weight matrices and bias vectors at each time point.

678 Deep Learning
However, since an RNN uses parameter sharing across time, the gradients are obtained bysummingupallofthecontributionsfromeachtimestept.DefinetWo andtbo asthe gradientsoftheweightsandbiasesbetweenthehiddenneuronsht andoutputneurons ot fortimet.Usingthebackpropagationequations,Eq.25.47andEq.25.48,for deep multilayer perceptrons, these gradients are computed as follows:
bo tbo ot
t1 t1
Wo
T tWo ht ot
t1 t1
Likewise, the gradients of the other shared parameters between hidden layers ht1 and ht,andbetweentheinputlayerxt andhiddenlayerht,areobtainedasfollows:

ht
Wi tWi xt ht
t1 t1
b h
t1
W h
t1
t1
t1
T h t 1 ht
t b h
T
t W h
where tWh and tbh are the gradient contributions from time t for the weights and biases for the hidden neurons, and tWi the gradient contributions for the weights for the input neurons. Finally, we update all the weight matrices and bias vectors as follows
Wi Wi Wi Wh Wh Wh bh bh bh Wo Wo Wo bo bo bo
where is the gradient step size or learning rate. 26.1.3 Training RNNs
26.5
Algorithm 26.1 shows the pseudocode for learning the weights and biases for an RNN. TheinputscomprisethedatasetDXi,Yii1,,n,thestepsizeforgradientdescent, an integer threshold maxiter denoting the number of iterations for training, the size of the hidden state vectors m, and the activation functions f o and f h for the output and hidden layers. The size of the input d and output p layers is determined directly from D. For simplicity we assume that all inputs Xi have the same length , which determines the number of layers in the RNN. It is relatively easy to handle variable length input sequences by unrolling the RNN for different input lengths i .
The RNN first initializes the weight matrices and bias vectors with random values drawn uniformly from a small range, e.g., 0.01,0.01. The RNN considers each input pair X , Y D, and computes the predicted output ot for each time step via the feedforward process. The backpropagation phase begins by computing the error betweenot andtrueresponseyt,andthenthenetgradientvectorot attheoutputlayer for each time step t. These net gradients are backpropagated from the output layers at time t to the hidden layers at time t , which are in turn used to compute the net gradients ht at the hidden layers for all time steps t 1,2, ,. Note that Line 15 shows the case where the output layer neurons are independent; if they are not independent we can replace it by Fo Ext see Eq. 25.51. Next, we compute the weight gradient

Recurrent Neural Networks 679 Algorithm 26.1: RNN Training: Stochastic Gradient Descent
RNNTRAINING D,,maxiter,m,f o,f h:
Initialize bias vectors
1 bh random mdimensional vector with small values
2 bo random pdimensional vector with small values
Initialize weight matrix
3 Wi random d m matrix with small values
4 Wh random m m matrix with small values
5 Wo random m p matrix with small values
6 r 0 iteration counter
7 repeat
8 foreach X,Y D in random order do
9 X length of training sequence
FeedForward Phase
10 h0 0 Rm initialize hidden state
11 for t 1,2,…, do
12 h t f h W i T x t W hT h t 1 b h
13 o t f o W oT h t b o
Backpropagation Phase
14 for t , 1, . . . , 1 do
15 ot fto Ext net gradients at output
16 h fth Wootnetgradientsath
17 for t 1, 2, , 1 do
18 ht fth WootWhht1netgradientsatht
Gradients of weight matrices and bias vectors
19 bot1ot; Wot1htotT
20 bh t1ht ; Wh t1ht1 ht T; Gradient Descent Step
21 bobobo; WoWoWo
22 bh bh bh; Wh Wh Wh;
23 rr1
24 until r maxiter
Wi t1xt ht T Wi Wi Wi
matrices,Wi,Wh,andWo,andbiasgradientvectors,bh andbo.Thesegradients are used to update the weights and biases via stochastic gradient descent. After each point has been used to update the weights, that completes one epoch of training. The training stops when maxiter epochs have been reached.
Note that, whereas Algorithm 26.1 shows the pseudocode for stochastic gradient descent, in practice, RNNs are trained using subsets or minibatches of input sequences instead of single sequences. This helps to speed up the computation and convergence of gradient descent, since minibatches provide better estimates of the bias and weight gradients and allow the use of vectorized operations.

680
Deep Learning
S
2X4 TS
0B1XP6E7 PV
3V5
T
Figure 26.4. Reber grammar automata.
Example 26.1 RNN. We use an RNN to learn the Reber grammar, which is generated according to the automata shown in Figure 26.4. Let B,E,P,S,T,V,X
denote the alphabet comprising the seven symbols. Further, let
symbol. Starting from the initial node, we can generate strings that follow the Reber grammar by emitting the symbols on the edges. If there are two transitions out of a node, each one can be chosen with equal probability. The sequence B,T,S,S,X,X,T,V,V,E is a valid Reber sequence with the corresponding state sequence 0,1,2,2,2,4,3,3,5,6,7. On the other hand, the sequence B,P,T,X,S,E is not a valid Reber sequence, since there is no edge out of state 3 with the symbol X.
The task of the RNN is to learn to predict the next symbol for each of the positions in a given Reber sequence. For training, we generate Reber sequences from theautomata.LetSX s1,s2,,sbeaRebersequence.Thecorrespondingtrue output Y is then given as the set of next symbols from each of the edges leaving the state corresponding to each position in SX . For example, consider the Reber sequence SX B,P,T,V,V,E, with the state sequence 0,1,3,3,5,6,7. The
, where
the terminal symbol. Here, PT denotes that the next symbol can be either P or T. We can see that SY comprises the sequence of possible next symbols from each of the
desired output sequence is then given as SY PT,TV,TV,PV,E,

states in excluding the start state 0.
To generate the training data for the RNN, we have to convert the symbolic
Reber strings into numeric vectors. We do this via a binary encoding of the symbols, as follows:

denote a terminal
is
B E P S T V X
1,0,0,0,0,0,0T 0,1,0,0,0,0,0T 0,0,1,0,0,0,0T 0,0,0,1,0,0,0T 0,0,0,0,1,0,0T 0,0,0,0,0,1,0T 0,0,0,0,0,0,1T 0,0,0,0,0,0,0T

Recurrent Neural Networks 681
That is, each symbol is encoded by a 7dimensional binary vector, with a 1 in the column corresponding to its position in the ordering of symbols in . The terminal symbol is not part of the alphabet, and therefore its encoding is all 0s. Finally, to encode the possible next symbols, we follow a similar binary encoding with a 1 in the column corresponding to the allowed symbols. For example, the choice PT is encoded as 0,0,1,0,1,0,0T. Thus, the Reber sequence SX and the desired output sequence SY are encoded as:
X
Y
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6

BPTVVE
PT TV TV PV E
B E P S T V X
100000 000001 010000 000000 001000 000110 000000
000000 000010 100100 000000 111000 011100 000000
For training, we generate n 400 Reber sequences with a minimum length of 30. The maximum sequence length is 52. Each of these Reber sequences is used to create a training pair X,Y as described above. Next, we train an RNN with m 4 hidden neurons using tanh activation. The input and ouput layer sizes are determined by the dimensionality of the encoding, namely d 7 and p 7. We use a sigmoid activation at the output layer, treating each neuron as independent. We use the binary cross entropy error function. The RNN is trained for r 10000 epochs, using gradient step size 1 and the entire set of 400 input sequences as the batch size. The RNN model learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the RNN model on 100 previously unseen Reber sequences with minimum length 30, as before. The RNN makes no errors on the test sequences. On the other hand, we also trained an MLP with a single hidden layer, with size m varying between 4 and 100. Even after r 10000 epochs, the MLP is not able to correctly predict any of the output sequences perfectly. It makes 2.62 mistakes on average per sequence for both the training and testing data. Increasing the number of epochs or the number of hidden layers does not improve the MLP performance.
26.1.4 Bidirectional RNNs
AnRNNmakesuseofahiddenstateht thatdependsontheprevioushiddenstateht1 and the current input xt at time t. In other words, it only looks at information from the past. A bidirectional RNN BRNN, as shown in Figure 26.5, extends the RNN model to also include information from the future. In particular, a BRNN maintains a backward hidden state vector bt Rm that depends on the next backward hidden state bt1 and the current input xt. The output at time t is a function of both ht and bt. In

682
Deep Learning
t 0
t 1
Wbo Who, bo
t 2
Wbo Who, bo
t 1 Wbo Who, bo
Wb,bb Wih Wib Wih
t 1
Wh,bh
o1
o2

t

o 1
o
Wh,bh
Wbo Who, bo Wh,bh
h0
h1

h 1
h
b 1
b1
h2
b2

b 1
b
Wb,bb Wih Wib Wih
Wb,bb
Wib
Wib
x1
x2

Figure 26.5. Bidirectional RNN: Unfolded in time.
particular, we compute the forward and backward hidden state vectors as follows:
x 1
x
hfhWTxWTh b t ihtht1h
bt fbWTxt WTbt1 bb ib b
26.6
Also, a BRNN needs two initial state vectors h0 and b1 to compute h1 and b, respectively. These are usually set to 0 Rm. The forward and backward hidden states are computed independently, with the forward hidden states computed by considering the input sequence in the forward direction x1,x2, ,x , and with the backward hidden states computed by considering the sequence in reverse order x,x1,,x1. The output at time t is computed only when both ht and bt are available, and is given as
ot foWT ht WT bt bo ho bo
It is clear that BRNNs need the complete input before they can compute the output. We can also view a BRNN as having two sets of input sequences, namely the forward input sequence X x1,x2,,x and the reversed input sequence Xr x,x1,…,x1,withthecorrespondinghiddenstatesht andbt,whichtogether determine the output ot . Thus, a BRNN is comprised of two stacked RNNs with independent hidden layers that jointly determine the output.
26.2 GATED RNNS: LONG SHORTTERM MEMORY NETWORKS
One of the problems in training RNNs is their susceptibility to either the vanishing gradient or the exploding gradient problem. For example, consider the task of computing the net gradient vector ht for the hidden layer at time t, given as
ht fthWootWhht1

Gated RNNS: Long ShortTerm Memory Networks 683
Assume for simplicity that we use a linear activation function, i.e., fth 1, and let us ignore the net gradient vector for the output layer, focusing only on the dependence on the hidden layers. Then for an input sequence of length , we have
hWh WWh W2h Wth t h t1 h h t2 h t2 h
Inparticular,forexample,attimet1,wehaveh W1h.Wecanobservethat 1h
the net gradient from time affects the net gradient vector at time t as a function of Wt, i.e., as powers of the hidden weight matrix W . Let the spectral radius of W ,
hhh defined as the absolute value of its largest eigenvalue, be given as 1. It turns out that if 1 1, then Whk 0 as k , that is, the gradients vanish as we train on long sequences. On the other hand, if 1 1, then at least one element of Whk becomes unbounded and thus Whk as k , that is, the gradients explode as we train on long sequences. To see this more clearly, let the eigendecomposition of the square mmmatrixWh begivenas
1 0 W U0 2
0
0U1
… . m
where 1 2 m are the eigenvalues of Wh, and U is the matrix comprising the corresponding eigenvectors, u1,u2, ,um, as columns. Thus, we have
0
0 U1
… . km
It is clear that the net gradients scale according to the eigenvalues of Wh. Therefore, if11,then1k 0ask,andsince1iforalli1,2,,m,then necessarily i k 0 as well. That is, the gradients vanish. On the other hand, if 1 1, then 1 k as k , and the gradients explode. Therefore, for the error to neither vanish nor explode, the spectral radius of Wh should remain 1 or very close to it.
Long shortterm memory LSTM networks alleviate the vanishing gradients problem by using gate neurons to control access to the hidden states. Consider the mdimensional hidden state vector ht Rm at time t. In a regular RNN, we update the hidden state as follows as per Eq. 26.1:
h t f h W i T x t W hT h t 1 b h
Let g 0, 1m be a binary vector. If we take the elementwise product of g and ht , namely, g ht , then elements of g act as gates that either allow the corresponding element of ht to be retained or set to zero. The vector g thus acts as logical gate that allows selected elements of ht to be remembered or forgotten. However, for backpropagation we need differentiable gates, for which we use sigmoid activation on the gate neurons so that their value lies in the range 0,1. Like a logical gate, such neurons allow the inputs to be completely remembered if the value is 1, or
h . . 0 0
k1 0 WkU0k2
h . . 0 0

684 Deep Learning forgotten if the value is 0. In addition, they allow a weighted memory, allowing partial
remembrance of the elements of ht , for values between 0 and 1.
Example26.2DifferentiableGates. Asanexample,considerahiddenstatevector ht 0.94 1.05 0.39 0.97 0.90T
First consider a logical gate vector
g0 1 1 0 1T
Their elementwise product gives
ght 0 1.05 0.39 0 0.90T
We can see that the first and fourth elements have been forgotten. Now consider a differentiable gate vector
g0.1 0 1 0.9 0.5T The elementwise product of g and ht gives
g ht 0.094 0 0.39 0.873 0.45T
Now, only a fraction specified by an element of g is retained as a memory after the elementwise product.
26.2.1 Forget Gate
To see how gated neurons work, we consider an RNN with a forget gate. Let ht Rm be the hidden state vector, and let t Rm be a forget gate vector. Both these vectors have the same number of neurons, m.
In a regular RNN, assuming tanh activation, the hidden state vector is updated unconditionally, as follows:
h t t a n h W i T x t W hT h t 1 b h
Instead of directly updating ht , we will employ the forget gate neurons to control how much of the previous hidden state vector to forget when computing its new value, and also to control how to update it in light of the new input xt .
Figure26.6showsthearchitectureofanRNNwithaforgetgate.Giveninputxt and previous hidden state ht 1 , we first compute a candidate update vector ut , as follows:
ut tanhWTxt WT ht1 bu 26.7 u hu
The candidate update vector ut is essentially the unmodified hidden state vector, as in a regular RNN.

Gated RNNS: Long ShortTerm Memory Networks
685
ot
Wo,bo
ht
Wh 1
1
1t
1Whu
t
ut
W,b
Wu,bu
Figure 26.6. RNN with a forget gate t. Recurrent connections shown in gray, forget gate shown doublelined. denotes elementwise product.
Using the forget gate, we can compute the new hidden state vector as follows:
ht t ht1 1tut 26.8
Here is the elementwise product operation. We can see that the new hidden state vector retains a fraction of the previous hidden state values, and a complementary fraction of the candidate update values. Observe that if t 0, i.e., if we want to entirely forgettheprevioushiddenstate,then1t 1,whichmeansthatthehiddenstatewill be updated completely at each time step just like in a regular RNN. Finally, given the hiddenstateht wecancomputetheoutputvectorot asfollows
o t f o W oT h t b o
How should we compute the forget gate vector t ? It makes sense to base it on
xt
both the previous hidden state and the new input value, so we compute it as follows: WTx WT h b 26.9
where we use a sigmoid activation function, denoted , to ensure that all the neuron values are in the range 0, 1, denoting the extent to which the corresponding previous hidden state values should be forgotten.
To summarize, a forget gate vector t is a layer that depends on the previous hidden state layer ht1 and the current input layer xt; these connections are fully connected, and are specified by the corresponding weight matrices Wh and W , and the bias vector b. On the other hand, the output of the forget gate layer t needs to modify the previous hidden state layer ht1, and therefore, both t and ht1 feed into what is essentially a new elementwise product layer, denoted by in Figure 26.6.
t tht1

686 Deep Learning
Finally, the output of this elementwise product layer is used as input to the new hidden layerht thatalsotakesinputfromanotherelementwisegatethatcomputestheoutput fromthecandidateupdatevectorut andthecomplementedforgetgate,1t.Thus, unlike regular layers that are fully connected and have a weight matrix and bias vector between the layers, the connections between t and ht via the elementwise layer are all onetoone, and the weights are fixed at the value 1 with bias 0. Likewise the connections between ut and ht via the other elementwise layer are also onetoone, with weights fixed at 1 and bias at 0.
Example 26.3. Let m 5. Assume that the previous hidden state vector and the candidate update vector are given as follows:
ht1 0.94 1.05 0.39 0.97 0.9T ut 0.5 2.5 1.0 0.5 0.8T Let the forget gate and its complement be given as follows:
t 0.9 1 0 0.1 0.5T 1t 0.1 0 1 0.9 0.5T
The new hidden state vector is then computed as the weighted sum of the previous hidden state vector and the candidate update vector:
ht t ht1 1tut
0.9 1 0 0.1 0.1 0 1 0.9
0.5T 0.94 1.05 0.39 0.97 0.9T
0.5T 0.5 2.5 1.0 0.5 0.8T
0 0.097 0.45T 0.05 0 1.0 0.45 0.40T
0.846 1.05
0.796 1.05 1.0 0.353 0.85T
Computing Net Gradients
It is instructive to compute the net gradients for an RNN with a forget gate, since a similar approach is used to compute the net gradients when training an LSTM network. An RNN with a forget gate has the following parameters it needs to learn, namely the weight matrices Wu, Whu, W, Wh, and Wo, and the bias vectors bu, b and bo. The computation of the hidden state vector ht adds together the inputs from the new elementwise layer that multiplies its incoming edges to compute the net input as opposed to computing a weighted sum. We will look at how to account for the elementwise layers during backpropagation.
Figure 26.7 shows a forget gate RNN unfolded in time for two time steps. Let ot , ht , t , and ut denote the net gradient vectors at the output, hidden, forget gate, and candidate update layers, respectively. During backpropagation, we need to compute the net gradients at each layer. The net gradients at the outputs are computed by

Gated RNNS: Long ShortTerm Memory Networks 687
o1
Wo,bo Wo,bo

h0
h1
h2
1
11 12
u1
2
u2
Wh Whu
Wh
Whu
x1
W,b
Figure 26.7. RNN with a forget gate unfolded in time recurrent connections in gray.
considering the partial derivatives of the activation function fto and the error function Ext :
ot f t o E x t
For the other layers, we can reverse all the arrows to determine the dependencies
between the layers. Therefore, to compute the net gradient for the update layer
ut , notice that in backpropagation it has only one incoming edge from ht via the
elementwise product 1 u . The net gradient u at update layer neuron i at tt ti

time t is given as
u x x ti ti h 1 1u2
Wu,bu
W,b
Wu,bu
E E neth u
ti netu neth uti netu ti ti ti
x2
ti ti ti neth
ti ti ht1,i 1tiuti 1 ti, and we use the fact that the uti uti
where
update layer uses a tanh activation function. Across all neurons, we obtain the net gradient at ut as follows:
ut ht 1t1ut ut
To compute the net gradient vector for the forget gate, we observe from
Figure 26.7 that there are two incoming flows into t during backpropagation one
from ht via the elementwise product t ht1, and the other also from ht via the
elementwise product 1 u . Therefore, the net gradient at forget gate neuron tt ti
i at time t is given as
E E neth
x x ti ti h h u 1
ti net neth ti net ti t1,i ti ti ti ti ti ti
o2

688 Deep Learning neth
where ti ti ht1,i 1tiuti ht1,i uti, and we use the fact that the ti ti
forget gate uses a sigmoid activation function. Across all neurons, we obtain the net gradientatt asfollows:
t ht ht1 utt 1t
Finally, let us consider how to compute ht , the net gradient at the hidden layer at time t. In Figure 26.7 we can observe that if we reverse the arrows, ht depends on the gradients at the output layer ot, the forget gate layer t1, the update layer ut1, and on the hidden layer ht1 via the elementwise product ht1 t1. The output, forget and update layers are treated as in a regular RNN. However, due to the elementwise layer, the flow from ht1 is handled as follows:
E neth h
xt t1,i ti h 1h
neth hti neth t1,i t1,i t1,i t1,i t1,i ti
neth
t1,i t1,i hti 1t1,iut1,i t1,i, and we used the fact that
where
ht implicitly uses an identity activation function. Across all the hidden neurons at time t, the net gradient vector component from ht1 is given as ht1 t1. Considering all the layers, including the output, forget, update and elementwise layers, the complete net gradient vector at the hidden layer at time t is given as:
ht Woot Wht1 Whuut1 ht1 t1
Given the net gradients, we can compute the gradients for all the weight matrices and bias vectors in a manner similar to that outlined for a regular RNN in Section 26.1.2. Likewise, stochastic gradient descent can be used to train the network.
26.2.2 Long ShortTerm Memory LSTM Networks
We now describe LSTMs, which use differentiable gate vectors to control the hidden state vector ht , as well as another vector ct Rm called the internal memory vector. In particular, LSTMs utilize three gate vectors: an input gate vector t Rm, a forget gate vector t Rm, and an output gate vector t Rm, as illustrated in Figure 26.8, which shows the architecture of an LSTM network. Like a regular RNN, an LSTM also maintains a hidden state vector for each time step. However, the content of the hidden vector is selectively copied from the internal memory vector via the output gate, with the internal memory being updated via the input gate and parts of it forgotten via the forget gate.
Let X x1,x2, ,x denote a sequence of ddimensional input vectors of length , Y y1,y2, ,y the sequence of pdimensional response vectors, and O o1 , o2 , , o the pdimensional output sequence from an LSTM. At each time step t, the three gate vectors are updated as follows:
26.10
hti hti
t WTxt WT ht1 b h
WTx WT h b t tht1
tWTxtWT ht1b h

Gated RNNS: Long ShortTerm Memory Networks
689
1 Wh
1
1
Wo,bo

tanh
1
Wh
Whu
Wh
1

Wu,bu
W,b
W,b
W,b
Figure 26.8. LSTM neural network. Recurrent connections shown in gray, gate layers shown doublelined.
Here denotes the sigmoid activation function. We can observe that each gate is a function of the input vector xt at time t, as well as the hidden state ht1 from the previous time step. Each gate vector has a corresponding weight matrix from the input neurons to the gate neurons, and from the hidden state neurons to the gate neurons, as well as a corresponding bias vector. Each of the gate vectors conceptually plays a different role in an LSTM network. The input gate vector t controls how much of the input vector, via the candidate update vector ut , is allowed to influence the memory vectorct.Theforgetgatevectort controlshowmuchofthepreviousmemoryvector to forget, and finally the output gate vector t controls how much of the memory state is retained for the hidden state.
Given the current input xt and the previous hidden state ht1, an LSTM first computesacandidateupdatevectorut afterapplyingthetanhactivation:
ut tanhWTxt WT ht1 bu 26.11 u hu
It then then applies the different gates to compute the internal memory and hidden state vectors:
ct t ut t ct1 ht t tanhct
26.12
ot
ht
ct
t
t
ut
t
xt

690 Deep Learning
The memory vector ct at time t depends on the current update vector ut and the previous memory ct1. However, the input gate t controls the extent to which ut influences ct, and the forget gate t controls how much of the previous memory is forgotten.Ontheotherhand,thehiddenstateht dependsonatanhactivatedinternal memory vector ct , but the output gate t controls how much of the internal memory is reflected in the hidden state. Besides the input vectors xt , the LSTM also needs an initial hidden state vector h0 and an initial memory state vector c0, both typically set to 0Rm.
Finally, the output of the network ot is obtained by applying the output activation function f o to an affine combination of the hidden state neuron values:
o t f o W oT h t b o
LSTMs can typically handle long sequences since the net gradients for the internal memory states do not vanish over long time steps. This is because, by design, the memory state ct1 at time t 1 is linked to the memory state ct at time t via implicit weights fixed at 1 and biases fixed at 0, with linear activation. This allows the error to flow across time steps without vanishing or exploding.
LSTMs can be trained just like regular RNNs by unfolding the layers in time, as illustrated in Figure 26.9, which shows the unfolded layers for two time steps. The first step in training is to use the feedforward steps to compute the error, followed by the backpropagation of the gradients as a second step. The latter has to be modified to accommodatetheelementwiseoperationsusedtoupdatethememorystatect andthe hidden state ht. The connections from ct1 to ct starting from c0 to c, which can be thought of as using unit weight matrices and zero biases, appear as a straight line in the figure indicating that the internal memory state can flow across longer periods of time without the gradients vanishing or exploding.
26.2.3 Training LSTMs
Consider the unfolded LSTM in Figure 26.9. During backpropagation the net gradient vector at the output layer at time t is computed by considering the partial derivatives of the activation function, fto, and the error function, Ext , as follows:
ot f t o E x t
where we assume that the output neurons are independent.
In backpropagation there are two incoming connections to the internal memory
vector ct, one from ht and the other from ct1. Therefore, the net gradient c at the ti
internal memory neuron i at time t is given as
E E neth c E netc c
c x x ti ti x t1,i ti ti netc neth cti netc netc cti netc
ti ti
ti t1,i ti
h 1c2c
ti ti ti t1,i t1,i

Gated RNNS: Long ShortTerm Memory Networks 691
o1
o2
Wo,bo Wo,bo

tanh tanh
h0
h1
h2
Wh Wh Whu
Wh Wh Wh Whu Wh
c0
c1
c2
1
1
u1
1
2
2
u2
2
W,b

W,b Wu,bu W,b W,b W,b Wu,bu
W,b
x1
Figure 26.9. LSTM neural network unfolded in time recurrent connections in gray.
where we use the fact that the internal memory vector implicitly uses an identity activation function, and furthermore
neth
ti

cti ti
ti tanhctiti1c2 netc
x2
cti
t1,i t1,i ut1,i t1,i cti t1,i cti cti
The net gradient vector ct at ct is therefore given as:
ct ht t1ctctct1t1
The forget gate has only one incoming edge in backpropagation, from ct, via the elementwise multiplication t ct1, with sigmoid activation, therefore the net gradient is:
E E netc
x x ti ti cc 1
net ti t1,i ti ti ti ti ti
ti net netc ti
where we used the fact that the forget gate uses sigmoid activation and
netc
ti ti uti ti ct1,i ct1,i
ti ti
Across all forget gate neurons, the net gradient vector is therefore given as t ct ct1 1tt

692 Deep Learning
The input gate also has only one incoming edge in backpropagation, from ct , via the elementwise multiplication t ut , with sigmoid activation. In a similar manner, as outlined above for t , the net gradient t at the input gate t is given as:
t ct ut 1tt
The same reasoning applies to the update candidate ut , which also has an incoming edge from ct via t ut and tanh activation, so the net gradient vector ut at the update layer is
ut ct t 1 u t u t
Likewise, in backpropagation, there is one incoming connection to the output gate
from ht via t tanhct with sigmoid activation, therefore t ht tanhct1tt
Finally, to compute the net gradients at the hidden layer, note that gradients flow back to ht from the following layers: ut1,t1,t1,t1 and ot. Therefore, the net gradient vector at the hidden state vector ht is given as
thWoot Wht1Wht1Wht1Whuut1
The gradients for the weight matrix and bias vector at the output layer are given
as:
given as follows:
b t1t
b t1t b t1t bu t1ut
W t1xttT W t1xt t T W t1xt t T Wu t1xt ut T
b o
t1
W o
h t ot
Wh t1ht1tT
Wh t1ht1 t T Wh t1ht1 t T Whu t1ht1 ut T

T
ot
Likewise, the gradients for the weight matrices and bias vectors for the other layers are
Given these gradients, we can use the stochastic gradient descent approach to train the network.
t1
Example 26.4 LSTM. We use an LSTM to learn the embedded Reber grammar, which is generated according to the automata shown in Figure 26.10. This automata has two copies of the Reber automata from Example 26.1. From the state s1, the top automata is reached by following the edge labeled T, whereas the bottom automata is reached via the edge labeled P. The states of the top automata are labeled as t0,t1, ,t7, whereas the states of the bottom automata are labeled as p0,p1, ,p7. Finally, note that the state e0 can be reached from either the top or the bottom

Gated RNNS: Long ShortTerm Memory Networks 693 S
X
t2 TS
BE
t0 t1 Pt6 t7 X
PV TT
t4
t3
V
t5
BTE
s0 s1 S
e0 e1
X
p4 TS
BE
p0 p1 Pp6 p7 X
PV
p2 PP
p3
T
V
p5
Figure 26.10. Embedded Reber grammar automata.
automata by following the edges labeled T and P, respectively. The first symbol is always B and the last symbol is always E. However, the important point is that the second symbol is always the same as the second last symbol, and thus any sequence learning model has to learn this long range dependency. For example, the following is a valid embedded Reber sequence: SX B,T,B,T,S,S,X,X,T,V,V,E,T,E.
The task of the LSTM is to learn to predict the next symbol for each of the positions in a given embedded Reber sequence. For training, we generate n 400 embedded Reber sequences with a minimum length of 40, and convert them into training pairs X,Y using the binary encoding described in Example 26.1. The maximum sequence length is 64.
Given the long range dependency, we used an LSTM with m 20 hidden neurons smaller values of m either need more epochs to learn, or have trouble learning the grammar. The input and output layer sizes are determined by the dimensionality of encoding, namely d 7 and p 7. We use sigmoid activation at the output layer, treating each neuron as independent. Finally, we use the binary cross entropy error function. The LSTM is trained for r 10000 epochs using step size 1 and batch size 400; it learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the LSTM model on 100 previously unseen embedded Reber sequences with minimum length 40, as before. The trained LSTM makes no errors on the test sequences. In particular, it is able to learn the long range dependency between the second symbol and the second last symbol, which must always match.
The embedded Reber grammar was chosen since an RNN has trouble learning the long range dependency. Using an RNN with m 60 hidden neurons, using r 25000 epochs with a step size of 1, the RNN can perfectly learn the training sequences. That is, it makes no errors on any of the 400 training sequences. However,

694 Deep Learning
on the test data, this RNN makes a mistake in 40 out of the 100 test sequences. In fact, in each of these test sequences it makes exactly one error; it fails to correctly predict the second last symbol. These results suggest that while the RNN is able to memorize the long range dependency in the training data, it is not able to generalize completely on unseen test sequences.
26.3 CONVOLUTIONAL NEURAL NETWORKS
A convolutional neural network CNN is essentially a localized and sparse feedfor ward MLP that is designed to exploit spatial andor temporal structure in the input data. In a regular MLP all of the neurons in layer l are connected to all of the neurons in layer l 1. In contrast, a CNN connects a contiguous or adjacent subset of neurons in layer l to a single neuron in the next layer l 1. Different sliding windows comprising contiguous subsets of neurons in layer l connect to different neurons in layer l 1. Furthermore, all of these sliding windows use parameter sharing, that is, the same set of weights, called a filter, is used for all sliding windows. Finally, different filters are used to automatically extract features from layer l for use by layer l 1.
26.3.1 Convolutions
We begin by defining the convolution operation for oneway, twoway and threeway inputs. By oneway we mean data in the form of a single vector, by twoway we mean data in the form of a matrix, and by threeway we mean data in the form of a tensor. We also call them 1D, 2D or 3D inputs where the dimensionality refers to the number of axes in the input data. We will discuss the convolution operation in the context of the input layer and the first hidden layer, but the same approach can be applied to subsequent layers in the network.
1D Convolution
Let x x1,x2, ,xnT be an input vector a oneway or 1D input with n points. It is assumed that the input points xi are not independent, but rather, there are dependencies between successive points. Let w w1 , w2 , , wk T be a vector of weights, called a 1D filter, with k n. Here k is also called the window size. Let xki denote the window of x of length k starting at position i, given as
xkixi,xi1,xi2,,xik1T
with 1 i n k 1. Given a vector a Rk , define the summation operator as one that
adds all the elements of the vector. That is,
k
suma
A 1D convolution between x and w, denoted by the asterisk symbol , is defined as
T xw sum xk1w sum xknk1w
i1
ai

Convolutional Neural Networks
695
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
0
2
1
0
2
1
1
1
1
1
0
2
7
7
7
1
0
2
5
5
4
xwxxxx w
1
3
1
2
3
1
2

w

w

d sumx3 4 w
k
sum xkiw xij1 wj 26.13
j1 fori1,2,,nk1.WecanseethattheconvolutionofxRn andwRk results
in a vector of length n k 1.
b sumx3 2 w
Figure26.11. 1DConvolution:aeshowtheconvolutionbetweendifferentslidingwindowsofxandthe
c sumx3 3 w
filter w with window size k 3. The final convolution output is shown in e.
a sumx3 1 w
where is the elementwise product, so that

e sumx3 5 w
w
1
7
5
1
0
2
4
1
Example 26.5 1D Convolution. Figure 26.11 shows a vector x with n 7 and a filter w 1,0,2T with window size k 3. The first window of x of size 3 is x31 1,3,1T. Therefore, as seen in Figure 26.11a, we have
sumx31wsum1,3,1T 1,0,2Tsum1,0,2T1
The convolution steps for different sliding windows of x with the filter w are shown in Figure 26.11ae. The convolution x w has size n k 1 7 3 1 5, and is given as
x w 1,7,5,4,1T
2D Convolution
We can extend the convolution operation to matrix input, for example for images. Let X be an nn input matrix, and let W be a kk matrix of weights, called a 2D filter, with k n. Here k is called the window size. Let Xk i, j denote the k k submatrix of X starting at row i and column j , defined as follows:
x i , j x i , j 1 x i , j k 1 X i,j xi1,j xi1,j1 xi1,jk1 k . . .
xik1,j xik1,j 1 xik1,j k1

696 Deep Learning with 1 i, j n k 1. Given a k k matrix A Rkk , define the summation operator
as one that adds all the elements of the matrix. That is,
sumA
whereai,j istheelementofAatrowiandcolumnj.The2DconvolutionofXandW,
denoted X W, is defined as:
sumXk1,1W sumXk1,nk1W
sumXk2,1W sumXk2,nk1W XW . . sumXknk1,1W sumXknk1,nk1W
where is the elementwise product of Xk i, j and W, so that kk
k k i1 j1
ai,j
sum Xki,jW xia1,jb1 wa,b 26.14 a1 b1
fori,j1,2,,nk1.TheconvolutionofXRnn andWRkk resultsina n k 1 n k 1 matrix.
Example 26.6 2D Convolution. Figure 26.12 shows a matrix X with n 4 and a filter W with window size k 2. The convolution of the first window of X, namely X21,1, with W is given as see Figure 26.12a
sumX21,1Wsum1 21 0sum1 02 310101
The convolution steps for different 2 2 sliding windows of X with the filter W are shown in Figure 26.12ai. The convolution X W has size 3 3, since n k 1
4 2 1 3, and is given as
2 6 4 XW4 4 8
444
3D Convolution
We now extend the convolution operation to a threedimensional matrix, which is also called a 3D tensor. The first dimension comprises the rows, the second the columns, and the third the channels. Let X be an n n m tensor, with n rows, n columns and m channels. The assumption is that the input X is a collection of n n matrices obtained by applying m filters, which specify the m channels. For example, for n n image inputs, each channel may correspond to a different color filter red, green or blue.

Convolutional Neural Networks 697 XXX
WWW

a sumX21,1W b sumX21,2W c sumX21,3W
XXX WWW

d sumX22,1W e sumX22,2W f sumX22,3W
XXX WWW

g sumX23,1W h sumX23,2W i sumX23,3W
Figure 26.12. 2D Convolution: ai show the 2D convolution between different 2 2 sliding windows of X and the filter W. The final 2D convolution output is shown in i.
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
2
2
6
4
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
4
4
i , j , q 1 xi1,j,q1
.
i k i , 1 j , j , 1 q , q r 1 1 xi1,j1,q1
.
i k 1 , j 1 , q i r , j 1 k 1 , q 1 xi1,jk1,q1 .
xik1,jk1,q1
xi,j,qr1 xi1,j,qr1
.
xi,j1,qr1 xi1,j1,qr1 .
xi,jk1,qr1 xi1,jk1,qr1
. xik1,jk1,qr1
xxxxx
xxxxx
i , j , q xi1,j,q
. xik1,j,q
i k i , j 1 , j 1 , , q q 1 xi1,j1,q
. xik1,j1,q
i k 1 , j 1 , q i , j 1 k 1 , q

xi1,jk1,q .
xik1,jk1,q
Figure26.13. 3DsubtensorXki,j,q:kkrsubtensorofXstartingatrowi,columnj,andchannelq.
Let W be a k k r tensor of weights, called a 3D filter, with k n and r m. Let Xki,j,q denote the k k r subtensor of X starting at row i, column j and channel q , as illustrated in Figure 26.13, with 1 i, j n k 1, and 1 q m r 1.
Given a k k r tensor A Rkkr , define the summation operator as one that adds all the elements of the tensor. That is,
k k r i1 j1 q1
sumA
ai,j,q
6
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
2
6
4
4
4
4
8
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
2
6
4
4
4
8
4
4
8
4
4
8
4
4
4
4
4
4

698 Deep Learning where ai,j,q is the element of A at row i, column j, and channel q. The 3D convolution
sumXk1,nk1,1W sumXk2,nk1,1W
of X and W, denoted X W, is defined as:
sumXk1,1,1W
XW
. sumXknk1,nk1,2W
sumXk2,1,1W
. sumXknk1,1,1W sumXk1,1,2W sumXk2,1,2W
. sumXknk1,1,2W
. sumXknk1,nk1,1W
sumXk1,nk1,2W sumXk2,nk1,2W
. . sumXk1,1,mr 1W
. sumXk1,nk1,mr 1W
sumXk2,1,mr 1W .
sumXk2,nk1,mr 1W .
sumXknk1,1,mr1W
where is the elementwise product of Xki,j,q and W, so that
sumXknk1,nk1,mr1W
sum Xki,j,qW xia1,jb1,qc1 wa,b,c 26.15
k k r a1 b1 c1
for i, j 1, 2, , n k 1 and q 1, 2, , m r 1. We can see that the convolution ofXRnnm andWRkkr resultsinank1nk1mr1tensor.
3D Convolutions in CNNs Typically in CNNs, we use a 3D filter W of size k k m, with the number of channels r m, the same as the number of channels in X Rnnm. Let Xki,j be the k k m subtensor of X starting at row i and column j. Then the 3D convolution of X and W is given as:
sumXk1,1W sumXk1,nk1W sumXk2,1W sumXk2,nk1W
XW . . sumXknk1,1W sumXknk1,nk1W
We can see that when W Rkkm, its 3D convolution with X Rnnm results in a n k 1 n k 1 matrix, since there is no freedom to move in the third dimension. Henceforth, we will always assume that a 3D filter W Rkkm has the same number of channels as the tensor X on which it is applied. Since the number of channels is fixed based on X, the only parameter needed to fully specify W is the window size k.
Example26.73DConvolution. Figure26.14showsa333tensorXwithn3 and m 3, and a 2 2 3 filter W with window size k 2 and r 3. The convolution

26.3 Convolutional Neural Networks
699
X
1
2
4
W
2
1
2
1
3
1
0
1
1
2
1
3

a sumX2 1, 1 W

b sumX2 1, 2 W

c sumX2 2, 1 W

d sumX2 2, 2 W
Figure 26.14. 3D Convolution: ad Convolution between different 2 2 3 sliding windows of X, and the filter W. The final 3D convolution output is shown in d.
3
1
1
0
0
1
0
5

1
1
0
1
3
2
1
1
1
1
2
1
4
2
0
3
1
2
X
1
2
4
2
1
2
1
3
1
0
1
1
2
1
3
3
1
1
1
0

5
1
1
2
0
1
1
1
3
1
1
2
1
4
2
0
11
3
1
2
W
X
1
2
4
W
2
1
2
1
3
1
0
1
2
1
3
1
0
3
1
1
1
0
5
11

1
1
2
0
1
15
1
1
3
1
1
2
1
4
2
0
3
1
2
X
1
2
4
W
2
1
2
1
3
1
0
1
1
0
2
1
3
3
1
1
1
0
5
11

1
1
2
0
1
15
5
1
1
3
1
1
2
1
4
2
0
3
1
2

700 Deep Learning
of the first window of X, namely X21,1, with W is given as see Figure 26.14a sumX21,1Wsum1 1 2 1 1 21 1 1 0 0 1
2 1312 1 200110 sum1 1 2 0 0 25
4 0012 0
where we stack the different channels horizontally. The convolution steps for differ ent 2 2 3 sliding windows of X with the filter W are shown in Figure 26.14ad. The convolution X W has size 2 2, since n k 1 3 2 1 2 and r m 3; it is given as
XW5 11 15 5
26.3.2 Bias and Activation Functions
We discuss the role of bias neurons and activation functions in the context of a tensor
of neurons at layer l. Let Zl be an nl nl ml tensor of neurons at layer l so that zl i,j,q
denotes the value of the neuron at row i, column j and channel q for layer l, with 1i,jnl and1qml.
Filter Bias
Let W be a kkml 3D filter. Recall that when we convolve Zl and W, we get a nl k1nl k1matrixatlayerl1.However,sofar,wehaveignoredthe role of the bias term in the convolution. Let b R be a scalar bias value for W, and let Zlki,jdenotethekkml subtensorofZl atpositioni,j.Then,thenetsignalat neuron zl1 in layer l 1 is given as
i,j
netl1 sumZli,jWb i,j k
and the value of the neuron zl1 is obtained by applying some activation function f to i,j
the net signal
zl1 fsumZl i,jWb i,j k
The activation function can be any of the ones typically used in neural networks, for example, identity, sigmoid, tanh, ReLU and so on. In the language of convolutions, the values of the neurons in layer l 1 is given as follows
Zl1 fZl W b
where indicates that the bias term b is added to each element of the nl k 1 nl k1matrixZl W.

Convolutional Neural Networks
701
Zl
W1
3
1
4
1
1
1
1
2
3
0
1
2
2
0
1
0
0
1
1
0
Zl1
7
8
7
7
6
8
4
6
10
1
2
2
1
0
1
3
1
4
2
1
0
2
1
3
4
W2

Figure 26.15. Multiple 3D filters.

1
0
0
1
1
2
3
1
6
8
10
5
6
11
7
5
5
1
0
Multiple 3D Filters
We can observe that one 3D filter W with a corresponding bias term b results in a nl k1nlk1matrixofneuronsinlayerl1.Therefore,ifwedesireml1 channels in layer l 1, then we need ml1 different k k ml filters Wq with a corresponding biastermbq,toobtainthenl k1nl k1ml1 tensorofneuronvaluesat layer l 1, given as
Zl1 zl1 fsumZl i,jW b i,j,q k qq
i,j1,2,…,nlk1 and q1,2,…,ml1 Zl1 fZl W1 b1, Zl W2 b2, , Zl Wml1 bml1
where the activation function f distributes over all of its arguments.
In summary, a convolution layer takes as input the nl nl ml tensor Zl of neurons from layer l, and then computes the nl1 nl1 ml1 tensor Zl1 of neurons for the nextlayerl1viatheconvolutionofZl withasetofml1 different3Dfiltersofsizek k ml , followed by adding the bias and applying some nonlinear activation function f . Note that each 3D filter applied to Zl results in a new channel in layer l 1. Therefore,
ml1 filters are used to yield ml1 channels at layer l 1.
which can be written more compactly as
26.3.3 Padding and Striding
One of the issues with the convolution operation is that the size of the tensors will necessarily decrease in each successive CNN layer. If layer l has size nl nl ml , and
0
1
Example 26.8 Multiple 3D Filters. Figure 26.15 shows how applying different fil tersyieldthechannelsforthenextlayer.Itshowsa442tensorZl withn4 and m 2. It also shows two different 222 filters W1 and W2 with k 2 and r2.Sincerm2,theconvolutionofZl andWi fori1,2resultsina33 matrixsincenk14213.However,W1 yieldsonechannelandW2 yields a second channel, so that the tensor for the next layer Zl1 has size 3 3 2, with two channels one per filter.

702
Deep Learning
Zl
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
11
9
9
11
12
8
8
7
0
0
0
0
0
0
0
0
1
2
2
1
1
0
0
3
1
4
2
1
0
0
2
1
3
4
3
0
0
1
2
3
1
1
0
0
4
1
3
2
1
0
0
0
0
0
0
0
0
6
5
7
4
2
1
0
0
0
1
1
0
1
0
6
7
11
9
5
4
9
11
12
6
7
8
8
7
6
5
5
7
6
2
Zl
W
a No padding: p 0
Zl1
Zl1
W

b Padding: p 1
Figure26.16. Padding:2DConvolutionawithoutpadding,andbwithpadding.
weusefiltersofsizekkml,theneachchannelinlayerl1willhavesizenl k 1 nl k 1. That is the number of rows and columns for each successive tensor will shrink by k 1 and that will limit the number of layers the CNN can have.
Padding
To get around this limitation, a simple solution is to pad each tensor along both the rows and columns in each channel by some default value, typically zero. For uniformity, we always pad by adding the same number of rows at the top and at the bottom, and likewise the same number of columns on the left and on the right. That is, assume that we add p rows both on top and bottom, and p columns both on the left and right. With padding p, the implicit size of layer l tensor is then nl 2p nl 2p ml. Assume that each filter is of size k k ml , and assume there are ml1 filters, then the size of thelayerl1tensorwillbenl 2pk1nl 2pk1ml1.Sincewewant to preserve the size of the resulting tensor, we need to have
nl 2pk1nl,whichimplies,pk1 2
With padding, we can have arbitrarily deep convolutional layers in a CNN.
7
Example 26.9 Padding. Figure 26.16 shows a 2D convolution without and with padding. Figure 26.16a shows the convolution of a 5 5 matrix Zl n 5 with a 3 3 filter W k 3, which results in a 3 3 matrix since n k 1 5 3 1 3. Thus, the size of the next layer Zl1 has decreased.
On the other hand, zero padding Zl using p 1 results in a 7 7 matrix as shown in Figure 26.16b. Since p 1, we have an extra row of zeros on the top

Convolutional Neural Networks
703
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
Zl
Zl
W Zl1

a s u m Z l3 1 , 1 W Zl
W Zl1

c sumZl3 3, 1 W d sumZl3 3, 3 W Figure26.17. Striding:2DConvolutionwithstrides2.
b s u m Z l3 1 , 3 W Zl
W Zl1
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
7
W Zl1
and bottom, and an extra column of zeros on the left and right. The convolution of the zeropadded X with W now results in a 55 matrix Zl1 since 7315, which preserves the size.
If we wanted to apply another convolution layer, we could zero pad the resulting matrix Zl1 with p 1, which would again yield a 55 matrix for the next layer, using a 3 3 filter. This way, we can chain together as many convolution layers as desired, without decrease in the size of the layers.
Striding
Striding is often used to sparsify the number of sliding windows used in the convolutions. That is, instead of considering all possible windows we increment the index along both rows and columns by an integer value s 1 called the stride. A 3D convolutionofZl ofsizenl nl ml withafilterWofsizekkml,usingstrides,is given as:
sumZlk1,1W sumZlk1,1sW l sumZlk1s,1W sumZlk1s,1sW
Z W . . sumZlk1t s,1W sumZlk1t s,1sW
sumZlk1,1t sW sumZlk1s,1t sW
. sumZlk1t s,1t sW
where t nl k . We can observe that using stride s, the convolution of Zl Rnl nl ml s
withWRkkml resultsinat1t1matrix.

704 Deep Learning
Example 26.10 Striding. Figure 26.17 shows 2D convolution using stride s 2 on a55matrixZl nl 5withafilterWofsize33k3.Insteadofthedefault stride of one, which would result in a 3 3 matrix, we get a t 1 t 1 2 2 matrix Zl1, since
t nl k 5 3 1 s2
We can see that the next window index increases by s along the rows and columns. For example, the first window is Zl31,1 and thus the second window is Z l3 1 , 1 s Z l3 1 , 3 s e e F i g u r e s 2 6 . 1 7 a a n d 2 6 . 1 7 b . N e x t , w e m o v e d o w n b y a stride of s 2, so that the third window is Zl31s,1 Zl33,1, and the final window i s Z l3 3 , 1 s Z l3 3 , 3 s e e F i g u r e s 2 6 . 1 7 c a n d 2 6 . 1 7 d .
26.3.4 Generalized Aggregation Functions: Pooling
Let Zl be a nl nl ml tensor at layer l. Our discussion of convolutions so far assumes that we sum together all of the elements in the elementwise product of the k k r subtensor Zlki,j,q and the filter W. In fact, CNNs also use other types of aggregation functions in addition to summation, such as average and maximum.
AvgPooling
If we replace the summation with the average value over the elementwise product of Zlki,j,q and W, we get
avgZl i,j,qW k
avg zl w
ia1,jb1,qc1 a,b,c 1 sumZlki,j,qW
a1,2, ,k b1,2, ,k c1,2, ,r
k2 r
If we replace the summation with the maximum value over the elementwise product
MaxPooling
of Zlki,j,q and W, we get
maxZl i,j,qW max zl wa,b,c 26.16
c1,2, ,r
The 3D convolution of Zl Rnl nl ml with filter W Rkkr using maxpooling, denoted Zl max W,resultsinanl k1nl k1ml r1tensor,givenas:
k a1,2,,k ia1,jb1,qc1 b1,2, ,k

Convolutional Neural Networks
705
maxXk1,1,1W maxXk2,1,1W
. maxXknk1,1,1W maxXk1,1,2W maxXk2,1,2W
ZlmaxW . maxXknk1,1,2W
maxXk1,nk1,1W maxXk2,nk1,1W
. maxXknk1,nk1,1W
maxXk1,nk1,2W maxXk2,nk1,2W
. maxXknk1,nk1,2W
. . maxXk1,1,mr 1W
. maxXk1,nk1,mr1W
maxXk2,1,mr 1W .
maxXk2,nk1,mr1W .
maxXknk1,nk1,mr1W
Maxpooling in CNNs Typically, maxpooling is used more often than avgpooling. Also, for pooling it is very common to set the stride equal to the filter size s k, so that the aggregation function is applied over disjoint k k windows in each channel in Zl . More importantly, in pooling, the filter W is by default taken to be a k k 1 tensor all of whose weights are fixed as 1, so that W 1kk1. In other words, the filter weights are fixed at 1 and are not updated during backpropagation. Further, the filter uses a fixed zero bias that is, b 0. Finally, note that pooling implicitly uses an identity activation function. As such, the convolution of Zl Rnl nl ml with W Rkk1, using stridesk,resultsinatensorZl1 ofsizenl nl ml.
ss
Zl Zl W Zl1
maxXknk1,1,mr1W
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl1
1
1
1
1
1
1
1
1
3
3
4
a maxZl2 1, 1 W Zl
W Zl1

c maxZl2 3, 1 W d maxZl2 3, 3 W Figure26.18. Maxpooling:Strides2.
b maxZl2 1, 3 W Zl
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl1
1
1
1
1
1
1
1
1
3
4
3
4
2
2
4

706 Deep Learning
Example26.11Maxpooling. Figure26.18showsmaxpoolingona44matrixZl
nl 4, using window size k 2 and stride s 2 that equals the window size. The
resultinglayerZl1 thushassize22,sincenl 42.Wecanseethatthefilter s2
W has fixed weights equal to 1.
The convolution of the first window of Zl, namely Zl21,1, with W is given as
see Figure 26.18a
maxZl21,1Wmax1 21 1max1 23 311131
The other convolution steps are shown in Figures 26.18b to 26.18d.
26.3.5 DeepCNNs
In a typical CNN architecture, one alternates between a convolution layer with summation as the aggregation function, and learnable filter weights and bias term and a pooling layer say, with maxpooling and fixed filter of ones. The intuition is that, whereas the convolution layer learns the filters to extract informative features, the pooling layer applies an aggregation function like max or avg to extract the most important neuron value or the mean of the neuron values within each sliding window, in each of the channels.
Starting from the input layer, a deep CNN is comprised of multiple, typically alternating, convolution and pooling layers, followed by one or more fully connected layers, and then the final output layer. For each convolution and pooling layer we need to choose the window size k as well as the stride value s, and whether to use padding p or not. We also have to choose the nonlinear activation functions for the convolution layers, and also the number of layers to consider.
26.3.6 Training CNNs
To see how to train CNNs we will consider a network with a single convolution layer
and a maxpooling layer, followed by a fully connected layer as shown in Figure 26.19.
For simplicity, we assume that there is only one channel for the input X, and further,
we use only one filter. Thus, X denotes the input matrix of size n0 n0. The filter W0,
with bias b0, for the convolution layer l 1, has size k1 k1, which yields the matrix of
neuronsZ1 ofsizen1n1,wheren1 n0k1usingstrides1 1.Sinceweuseonly
one filter, this results in a single channel at layer l 1, i.e., m1 1. The next layer l 2
is a maxpooling layer Z2 of size n2 n2 obtained by applying a k2 k2 filter with stride
s2 k2. Implicitly, the maxpooling layer uses a filter of weights fixed at 1, with 0 as the
bias,thatisW11k k andb10.Weassumethatn1isamultipleofk2sothatn2n1. 22 k2
The output of the maxpooling layer Z2 is recast as a vector z2 of length n22, since it is fully connected to the layer l 3. That is, all of the n22 neurons in z2 are connected to each of the n3 neurons in z3 with the weights specified by the matrix W2, which has size n2 2 n3 , and the bias vector b2 Rn3 . Finally, all neurons in z3 are

Convolutional Neural Networks 707
input convolution maxpooling fully connected output l0 l1 l2 l3 l4
W0,b0 W2,b2 W3,b3 n2 n2
n0 n0 n1 n1 k2 s2 n3 Figure 26.19. Training: Convolutional neural network.
connected to the p output layer neurons o, with the weight matrix W3 Rn3p and bias vectorb3 Rp.
Feedforward Phase
Let D Xi,yini1 denote the training data, comprising n tensors Xi Rn0n0m0 with m0 1 for ease of explanation and the corresponding response vector yi Rp . Given a training pair X, y D, in the feedforward phase, the predicted output o is given via the following equations:
Z1 f1XW0b0
Z2 Z1 s2,max 1k2k2 z 3 f 3 W T2 z 2 b 2 o f o W To z 3 b o
where s2,max denotes maxpooling with stride s2.
Backpropagation Phase
Given the true response y and predicted output o, we can use any loss function EX to evaluate the discrepancy between them. The weights and biases are updated by computing the net gradient vector at the output layer, and then backpropagating the net gradients from layer l 4 to l 1. Let 1, 2, and 3 denote the net gradient vectors at layers l 1, 2, 3, respectively, and let o denote the net gradient vector at the output layer. The output net gradient vector is obtained in the regular manner by computing the partial derivatives of the loss function EX and the activation function fo:
o fo EX
assuming that the output neurons are independent.
Since layer l 3 is fully connected to the output layer, and likewise the
maxpooling layer l 2 is fully connected to Z3, the net gradients at these layers are computed as in a regular MLP
3 f3 Wo o
2 f2 W2 3W2 3
X
Z1 1
Z2 2
z3 3
o
o
p

708 Deep Learning
Let last step follows from the fact that f2 1, since maxpooling implicitly uses an identity activation function. Note that we also implicitly reshape the net gradient vector 2, so that its size is n22 n3 n3 1 n22 1 n2 n2, as desired.
Consider the net gradient 1 at neuron z1 in layer l 1 where i,j 1,2, ,n1. ij ij
Since we assume that the stride s2 equals the filter size k2 for the maxpooling layer,
each sliding window in the convolution layer contributes only to one neuron at the
maxpooling layer. Given stride s2 k2, the k2 k2 sliding window that contains z1 is ij
g i v e n a s Z 1k 2 a , b , w h e r e
ai bj
s2 s2
Due to the max aggregation function, the maximum valued element in Z1k2a,b
specifies the value of neuron z2 in the maxpooling layer l 2. That is, ab
z2 max z1 ab i,j1,2,…,k2 a1k2i,b1k2j
i,j argmax z1
a1k2i,b1k2j i,j 1,2,…,k2
where i,j is the index of the maximum valued neuron in the window Z1k2a,b. The net gradient 1 at neuron z1 is therefore given as
ij
ij
E
E net2 z1 1 X X ab ij
ij net1 net2 z1 net1 ij ab ij ij
net2
2 abf1
ab z1 ij ij
ll21 wherenetij denotesthenetinputatneuronzij inlayerl.However,sincenetab zi,j,
net2
the partial derivative ab is either 1 or 0, depending on whether z1 is the maximum
z 1i j i j element in the window Z1k2 a,b or not. Putting it all together, we have
2 f1 ifiiandjj 1ab ij
ij 0 otherwise
In other words, the net gradient at neuron z1 in the convolution layer is zero if
ij
this neuron does not have the maximum value in its window. Otherwise, if it is the maximum, the net gradient backpropagates from the maxpooling layer to this neuron and is then multiplied by the partial derivative of the activation function. The n1 n1 matrixofnetgradients1comprisesthenetgraidents1 foralli,j1,2,,n.
ij 1
From the net gradients, we can compute the gradients of the weight matrices and
bias parameters. For the fully connected layers, that is, between l 2 and l 3, and l 3 and l 4, we have
W3 Z3 oT b3 o W2 Z2 3T b2 3 where we treat Z2 as a n2 2 1 vector.

Convolutional Neural Networks 709
Note that the weight matrix W1 is fixed at 1k2k2 and the bias term b1 is also fixed at 0, so there are no parameters to learn between the convolution and maxpooling layers. Finally, we compute the weight and bias gradients between the input and convolution layer as follows:
W Xk i,j1 b 1 0 1 ij 0 ij
nn nn
1 1
1 1
i1 j1
i1 j1
where we used the fact that the stride is s1 1, and that W0 is a shared filter for all k1 k1 windows of X, with the shared bias value b0 for all windows. There are n1 n1 such windows, where n1 n0 k1 1, therefore, to compute the weight and bias gradients, we sum over all the windows. Note that if there were multiple filters that is, if m1 1, then the bias and weight gradients for the jth filter would be learned from the corresponding channel j in layer l 1.
Example26.12CNN. Figure26.20showsaCNNforhandwrittendigitrecognition. This CNN is trained and tested on the MNIST dataset, that contains 60,000 training images and 10,000 test images. Some examples of handwritten digits from MNIST are shown in Figure 26.21. Each input image is a 28 28 matrix of pixel values between 0 to 255, which are divided by 255, so that each pixel lies in the interval 0,1. The corresponding true output yi is a onehot encoded binary vector that denotes a digit from 0 to 9; the digit 0 is encoded as e1 1,0,0,0,0,0,0,0,0,0T, the digit 1 as e2 0,1,0,0,0,0,0,0,0,0T, and so on.
In our CNN model, all the convolution layers use stride equal to one, and do not use any padding, whereas all of the maxpooling layers use stride equal to the window size. Since each input is a 28 28 pixels image of a digit with 1 channel grayscale, we haven0 28andm0 1,andtherefore,theinputXZ0 isan0n0m0 28281 tensor. The first convolution layer uses m1 6 filters, with k1 5 and stride s1 1, without padding. Thus, each filter is a 5 5 1 tensor of weights, and across the six filterstheresultinglayerl1tensorZ1 hassize24246,withn1 n0k11 28 5 1 24 and m1 6. The second hidden layer is a maxpooling layer that
uses k 2 with a stride of s 2. Since maxpooling by default uses a fixed filter
2 2 n 24 W1k k 1,theresultingtensorZ2 hassize12126,withn2 1
12, and m2 6. The third layer is a convolution layer with m3 16 channels, with a window size of k3 5 and stride s3 1, resulting in the tensor Z3 of size 8816, where n3 n2 k3 1 1251 8. This is followed by another maxpooling
22 k2 2
Input Convolution
X Z1
Maxpooling
12126 k2s22
Convolution
8816 k3 5
Maxpooling
4416 k4 s4 2
Fully Connected Layers
Z2
Z3
Z4
Z5
Z6
o
28 28 1
120
10 84
24246 k15
Figure 26.20. Convolutional neural network.

710
Deep Learning
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
a label 0 b label 1 c label 2 d label 3 e label 4
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
f label 5 g label 6 h label 7 i label 8 j label 9 Figure 26.21. MNIST dataset: Sample handwritten digits.
layerthatusesk 2ands 2,whichyieldsthetensorZ4 thatis4416,where n3 84 4
n4 k4 2 4,andm416.
The next three layers are fully connected as in a regular MLP. All of the 4 4
16 256 neurons in layer l 4 are connected to layer l 5, which has 120 neurons. Thus, Z5 is simply a vector of length 120, or it can be considered a degenerate tensor of size 120 1 1. Layer l 5 is also fully connected to layer l 6 with 84 neurons, which is the last hidden layer. Since there are 10 digits, the output layer o comprises 10 neurons, with softmax activation function. The convolution layers Z1 and Z3, and the fully connected layers Z5 and Z6, all use ReLU activation.
We train the CNN model on n 60000 training images from the MNIST dataset; we train for 15 epochs using step size 0.2 and using crossentropy error since there are 10 classes. Training was done using minibatches, using batch size of 1000. After training the CNN model, we evaluate it on the test dataset of 10,000 images. The CNN model makes 147 errors on the test set, resulting in an error rate of 1.47. Figure 26.22 shows examples of images that are misclassified by the CNN. We show the true label y for each image and the predicted label o converted back from the onehot encoding to the digit label. We show three examples for each of the labels. For example, the first three images on the first row are for the case when the true label is y 0, and the next three examples are for true label y 1, and so on. We can see that several of the misclassified images are noisy, incomplete or erroneous, and would be hard to classify correctly even by a human.
For comparison, we also train a deep MLP with two fully connected hidden layers with the same sizes as the two fully connected layers before the output layer in the CNN shown in Figure 26.20. Therefore, the MLP comprises the layers X, Z5, Z6, and o, with the input 28 28 images viewed as a vector of size d 784. The first hidden layer has size n1 120, the second hidden layer has size n2 84, and the output layer has size p 10. We use ReLU activation function for all layers, except the output, which uses softmax. We train the MLP model for 15 epochs on the training dataset with n 60000 images, using step size 0.5. On the test dataset, the MLP made 264 errors, for an error rate of 2.64. Figure 26.23 shows the number of errors on

Convolutional Neural Networks
711
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
y 0,o 2 y 0,o 8 y 0,o 6 y 1,o 8 y 1,o 7 000000 555555
y1,o5 5 10 15 20 25
y3,o8 5 10 15 20 25
y5,o3 5 10 15 20 25
y7,o2 5 10 15 20 25
y9,o0 5 10 15 20 25
0510152025
0510152025
0510152025
0510152025
0
0
0
0
0
y2,o0 y2,o8 y2,o7 y3,o2 y3,o5 000000 555555
0510152025
0510152025
0510152025
0510152025
y4,o2 y4,o2 y4,o7 y5,o6 y5,o3 000000 555555
0510152025
0510152025
0510152025
0510152025
y6,o4 y6,o5 y6,o1 y7,o3 y7,o8 000000 555555
0510152025
0510152025
0510152025
0510152025
y8,o6 y8,o3 y8,o2 y9,o7 y9,o4 000000 555555
0510152025
0510152025
0510152025
0510152025
0510152025
Figure 26.22. MNIST: Incorrect predictions by the CNN model; y is the true label, o is the predicted label. the test set after each epoch of training for both the CNN and MLP model; the CNN
model achieves significantly better accuracy than the MLP.
6,000
4,000
2,000
epochs
Figure 26.23. MNIST: CNN versus Deep MLP; prediction error as a function of epochs.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MLP CNN

712 Deep Learning
26.4 REGULARIZATION
Consider the squared error loss function, given as Ly,yyy2
where y is the true response and y o the predicted response on a given input x. The goal of learning is the minimize the expected loss ELy, y Ey y 2 .
As described in Section 22.3, the expected loss can be decomposed into three terms: noise, bias, and variance, given as
E y y 2 E y E y 2 E y E y 2 E E y E y 2 2 6 . 1 7
noise average variance average bias
The noise term is the expected variance of y, since Ey Ey2 vary. It contributes a fixed cost to the loss independent of the model. Since it is the inherent uncertainty or variability of y it can be ignored when comparing different models. The average bias term indicates how much the model deviates from the true but unknown function relating the response to the predictor variables. For example, if the response is a nonlinear function of the predictor variables, and we fit a simple linear model, then this model will have a high bias. We can try to fit a more complex nonlinear model to reduce the bias, but then we run into the issue of overfitting and high variance, captured by the average variance term, which quantifies the variance of the predicted response y, since Ey Ey2 vary. That is, our model may try to fit noise and other artifacts in the data, and will therefore be highly susceptible to small changes in the data, resulting in high variance. In general, there is always a tradeoff between reducing bias and reducing variance.
Regularization is an approach whereby we constrain the model parameters to reduce overfitting, by reducing the variance at the cost of increasing the bias slightly. For example, in ridge regression Eq. 23.29, we add a constraint on the L2norm of the weight parameters as follows
min JwYY2 w2 YDw2 w2 26.18 w
where Y is the true response vector and Y is the predicted response vector over all training instances. The goal here is to drive the weights to be small, depending on the hyperparameter 0 called the regularization constant. If 0, then there is no regularization, and we get a low bias, but possibly high variance model. If then the effect is to drive all weights to be nearly zero, which results in a low variance, but high bias model. An intermediate value of tries to achieve the right balance between these two conflicting objectives. As another example, in L1 regression or Lasso Eq. 23.43, we minimize the L1 norm of the weights
min JwYDw2 w1 w
wherew1 di1wi.ComparedtoL2 norm,whichmerelymakestheweightssmall, the use of the L1 norm sparsifies the model by forcing many of the weights to zero, acting as a feature subset selection method.

Regularization 713
In general, for any learning model M, if Ly,y is some loss function for a given input x, and denotes all the model parameters, where y Mx. The learning objective is to find the parameters that minimize the loss over all instances:
n
Lyi,yi
With regularization, we add a penalty on the parameters , to obtain the regularized
objective:
i1
min J
i1
min J
n i1
Lyi,Mxi
n
Lyi,yiR
26.19
where 0 is the regularization constant.
Let be a parameter of the regression model. Typical regularization functions
include the L2 norm, the L1 norm, or even a combination of these, called the elasticnet: RL22 RL11 Relastic2 11
with 0,1.
26.4.1 L2 Regularization for Deep Learning
We now consider approaches for regularizing the deep learning models. We first consider the case of a multilayer perceptron with one hidden layer, and then generalize it to multiple hidden layers. Note that while our discussion of regularization is in the context of MLPs, since RNNs are trained via unfolding, and CNNs are essentially sparse MLPs, the methods described here can easily be generalized for any deep neural network.
MLP with One Hidden Layer
Consider regularization in the context of a feedforward MLP with a single hidden layer as shown in Figure 26.24. Let the input x Rd , the hidden layer z Rm and let y o Rp. The set of all the parameters of the model are
Wh,bh,Wo,bo
Whereas it makes sense to penalize large weights, we usually do not penalize the bias
terms since they are just thresholds that shift the activation function and there is no l0 l1 l2
Wh,bh Wo,bo
Figure 26.24. Multilayer perceptron with one hidden layer.
x
z
o

714 Deep Learning need to force them to be small values. The L2 regularized objective is therefore given
as
min JEx RL2Wo,WhEx Wh2F Wo2F 22
Here we added the factor 12 to the regularization term for convenience. For the L2 norm of the weight matrices, we use the Frobenius norm, which has the usual sense of L2norm, since for an n m matrix A, it is defined as
n m A2 a2
F ij i1 j1
where aij is the i,jth element of A. The regularized objective tries to minimize the individual weights for pairs of neurons between the input and hidden, and hidden and output layers. This has the effect of adding some bias to the model, but possibly reducing variance, since small weights are more robust to changes in the input data in terms of the predicted output values.
The regularized objective has two separate terms, one for the loss and the other for the L2 norm of the weight matrices. Recall that we have to compute the weight gradients wij and the bias gradients bj by computing
wijJEx RL2Wo,Whjziwij wij wij 2 wij
bj JEx RL2Wo,WhEx j
bj bj 2 bj whereweuseEq.25.31tonotethat Ex j zi and Ex j,andwherej Ex is
bj
wij bj netj
the net gradient. Further, since the squared L2 norm of a weight matrix is simply the sum of the squared weights, only the term w2 matters, and all other elements are just
ij
constant with respect to the weight wij between neurons i and j in Wh or Wo. Across
all the neuron pairs between the hidden and output layer, we can write the update rule compactly as follows:
W o z oT W o b o o
where o is the net gradient vector for the output neurons, and z is the vector of hidden layer neuron values. The gradient update rule using the regularized weight gradient matrix is given as
Wo Wo Wo Wo zoT WoWo Wo zoT 1 W o z oT
L2 regularization is also called weight decay, since the updated weight matrix uses decayed weights from the previous step, using the decay factor 1 .
In a similar manner we get the weight and bias gradients between the input and hidden layers:
W h x hT W h b h h

Regularization
715
l 0
W0,b0
l 1
l 2

l h
l h1
x
z1
z2
W1,b1
Figure 26.25. Deep multilayer perceptron.
Wh,bh
The update rule for the weight matrix between the input and hidden layers is therefore given as
W h W h W h 1 W h x hT
where h is the net gradient vector for the hidden neurons, and x is the input vector.
Deep MLPs
Consider the deep MLP shown in Figure 26.25. We denote the input neurons as layer l 0, the first hidden layer as l 1, the last hidden layer as l h, and the final output layer as l h1. The vector of neuron values for layer l for l 0, ,h1 is denoted as
z l z 1l , , z nl T l
where nl is the number of neurons in layer l. Thus, x z0 and o zh1. The weight matrix between neurons in layer l and layer l 1 is denoted Wl Rnl nl1 , and the vector of bias terms from the bias neuron z0l to neurons in layer l 1 is denoted bl Rnl1 , for l 1, , h 1.
Given the error function Ex, the L2 regularized objective function is min JEx RL2 W0,W1,…,Wh
2 h E x W l 2F
2 l0
where the set of all the parameters of the model is W0,b0,W1,b1, ,Wh,bh. Based on the derivation for the one hidden layer MLP from above, the regularized gradient is given as:
Wl zll1TWl 26.20 and the update rule for weight matrices is
Wl Wl Wl 1Wl zl l1T 26.21
for l 0,1, ,h, where where l is the net gradient vector for the hidden neurons in layer l. We can thus observe that incorporating L2 regularization within deep MLPs is relatively straightforward. Likewise, it is easy to incorporate L2 regularization in other models like RNNs, CNNs, and so on. For L1 regularization, we can apply the subgradient approach outlined for L1 regression or Lasso in Section 23.6.

zh
o

716 Deep Learning 26.4.2 Dropout Regularization
The idea behind dropout regularization is to randomly set a certain fraction of the neuron values in a layer to zero during training time. The aim is to make the network more robust and to avoid overfitting at the same time. By dropping random neurons for each training point, the network is forced to not rely on any specific set of edges. From the perspective of a given neuron, since it cannot rely on all its incoming edges to be present, it has the effect of not concentrating the weight on specific input edges, but rather the weight is spread out among the incoming edges. The net effect is similar to L2 regularization since weight spreading leads to smaller weights on the edges. The resulting model with dropout is therefore more resilient to small perturbations in the input, which can reduce overfitting at a small price in increased bias. However, note that while L2 regularization directly changes the objective function, dropout regularization is a form of structural regularization that does not change the objective function, but instead changes the network topology in terms of which connections are currently active or inactive.
MLP with One Hidden Layer
Consider the one hidden layer MLP in Figure 26.24. Let the input x Rd , the hidden layer z Rm and let y o Rp . During the training phase, for each input x, we create a random mask vector to drop a fraction of the hidden neurons. Formally, let r 0, 1 be the probability of keeping a neuron, so that the dropout probability is 1 r . We create a mdimensional multivariate Bernoulli vector u 0,1m, called the masking vector, each of whose entries is 0 with dropout probability 1 r , and 1 with probability r . Let u u1,u2, ,umT, where
ui 0 with probability 1 r 1 with probability r
The feedforward step is then given as
z f h b h W hT x z u z
o f o b o W oT z
where is the elementwise multiplication. The net effect is that the masking vector zeros out the ith hidden neuron in z if ui 0. Zeroing out also has the effect that during the backpropagation phase the error gradients do not flow back from the zeroed out neurons in the hidden layer. The effect is that any weights on edges adjacent to zeroed out hidden neurons are not updated.
Inverted Dropout There is one complication in the basic dropout approach above, namely, the expected output of hidden layer neurons is different during training and testing, since dropout is not applied during the testing phase after all, we do not want the predictions to be randomly varying on a given test input. With r as the probability of retaining a hidden neuron, its expected output value is
Ezirzi 1r0rzi

Further Reading 717
On the other hand, since there is no dropout at test time, the outputs of the hidden neurons will be higher at testing time. So one idea is to scale the hidden neuron values by r at testing time. On the other hand, there is a simpler approach called inverted dropout that does not need a change at testing time. The idea is to rescale the hidden neurons after the dropout step during the training phase, as follows:
z f b h W hT x z 1 u z
rT of boWoz
With the scaling factor 1r, the expected value of each neuron remains the same as without dropout, since
Dropout in Deep MLPs
Ezi 1 r zi 1 r 0 zi r
Dropout regularization for deep MLPs is done in a similar manner. Let rl 0, 1, for l 1,2, ,h denote the probability of retaining a hidden neuron for layer l, so that 1 rl is the dropout probability. One can also use a single rate r for all the layers by setting rl r. Define the masking vector for hidden layer l, ul 0,1nl , as follows:
uli 0 with probability 1 rl 1 with probability rl
The feedforward step between layer l and l 1 is then given as z l f b l W l T z l 1
z l 1 u l z l rl
26.22
using inverted dropout. Usually, no masking is done for the input and output layers, so we can set r0 1 and rh1 1. Also note that there is no dropout at testing time. The dropout rates are hyperparameters of the model that have to be tuned on a separate validation dataset.
26.5 FURTHER READING
The backpropagation through time algorithm was introduced by Werbos 1990. Bidirectional RNNs were proposed by Schuster and Paliwal 1997, and LSTMs were proposed by Hochreiter and Schmidhuber 1997, with the forget gate introduced by Gers, Schmidhuber, and Cummins 2000. Convolutional neural networks trained via backpropagation were proposed by LeCun, Bottou, Bengio, and Haffner 1998, with application to handwritten digit recognition. Dropout regularization was introduced by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov 2014.

718 Deep Learning
Gers, F. A., Schmidhuber, J., and Cummins, F. 2000. Learning to forget: Continual prediction with LSTM. Neural Computation, 12 10, 24512471.
Hochreiter, S. and Schmidhuber, J. 1997. Long shortterm memory. Neural Compu tation, 9 8, 17351780.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86 11, 22782324.
Schuster, M. and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45 11, 26732681.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 1, 19291958.
Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78 10, 15501560.
26.6 EXERCISES
Q1. Consider the RNN architecture in Figure 26.26. Note that an edge labeled 1 indicates dependence on the previous time step, whereas 2 indicates dependence on values from two time steps back. Use f o and f h as the activation functions for output and hidden layers. Let be the longest sequence length. Answer the following questions:
a UnfoldtheRNNforthreestepsandshowalltheconnections.
b Listalltheparametersrequiredinthismodel,includingtheweightmatricesand
bias vectors. Next write the forward propagation equations to compute ot .
c Writetheequationforthenetgradientvectorot attheoutputneuronsattimet.
d Writetheequationforthenetgradientvectorht atthehiddenneuronsattimet.
Q2. DerivethenetgradientequationsforbackpropagationinanRNNwithaforgetgate using vector derivatives. That is, derive equations for the net gradients at the output ot,updateut,forgett,andhiddenht layersbycomputingthederivativeoftheerror function Ext with respect to netot , netut , nett and netht , the net inputs at the output, update, forget and hidden layers, respectively, at time t.
1
1
xt ht ot
2
Figure 26.26. RNN architecture for Q1.

Exercises 719
Q3. Derive the net gradient equations at for backpropagation in an LSTM using vector derivatives of the error function Ex with respect to netta, where a o,h,c,u,,,. That is, for the output, hidden state, internal memory, and candidate update layers, as well as for the input, forget and output gate layers.
Q4. Showthatwithstrides,intheconvolutionofXRnnmandWRkkm,thereare t 1 t 1 possible windows, where t nk .
s
Q5. Consider a CNN that takes as input substrings of length 8 over the alphabet A,B,C,D, and predicts whether the class is positive P or negative N. Assume we use two 1D filters with window size k 4 and stride s 2, with weights w1 1, 0, 1, 1T and w2 1, 1, 0, 1T , respectively. Assume bias is zero. The 1D convolution is followed by a maxpooling operation with window size k 2 and stride s 1. All of the convolutions use linear activation function, and there is no padding used. Finally, all of the neurons after the maxpooling feed into a single output neuron that uses a sigmoid activation function with bias 15 and all weights set to 1. Given input sequence ABACABCD, show the neuron values after the convolution and maxpooling layer, and the final output. Use onehot encoding for the input alphabet in lexicographic order, and let P 1 and N 0. What is the final output on the given input sequence.
a Input tensor: X
b 3D filter W Figure 26.27. 3D tensor for Q6.
Q6. Giventhe3DtensorinFigure26.27a.Usingpaddingp1andstrides2,answer the following questions:
a ComputetheconvolutionofXwiththe333maskWinFigure26.27b.
b Computethemaxpoolingoutputusinga331maskWofallones.
channel 1
1
1
3
1
1
2
4
4
1
2
2
2
3
1
1
1
2
2
3
1
4
1
1
2
2
channel 2
1
4
3
3
2
2
2
2
2
2
1
1
4
1
1
1
2
3
1
1
1
4
1
1
2
channel 3
1
2
4
3
2
2
2
4
2
1
1
1
3
2
4
1
4
3
2
3
1
1
1
4
4
channel 1
1
0
0
0
1
0
0
0
1
channel 2
0
0
1
0
1
0
1
0
0
channel 3
1
1
1
1
1
1
1
1
1

CHAPTER 27 RegressionEvaluation
Given a set of predictor attributes or independent variables X1,X2, ,Xd, and given the response attribute Y, the goal of regression is to learn a f , such that
YfX1,X2,,XdfX
where X X1,X2, ,XdT is the ddimensional multivariate random variable comprised of the predictor variables. Here, the random variable denotes the inherent error in the response that is not explained by the linear model.
When estimating the regression function f , we make assumptions about the form of f , for example, whether f is a linear or some nonlinear function of the parameters of the model. For example, in linear regression, we assume that
f X 1 X1 . . . d Xd
whereisthebiasterm,andi istheregressioncoefficientforattributeXi.Themodel is linear, since fX is a linear function of the parameters ,1, ,d.
Once we have estimated the bias and coefficients, we need to formulate a probabilistic model of regression to evaluate the learned model in terms of goodness of fit, confidence intervals for the parameters, and to test for the regression effects, namely whether X really helps in predicting Y. In particular, we assume that even if the value of X has been fixed, there can still be uncertainty in the response Y. Further, we will assume that the error is independent of X and follows a normal or Gaussian distribution with mean 0 and variance 2, that is, we assume that the errors are independent and identically distributed with zero mean and fixed variance.
The probabilistic regression model comprises two components the deterministic component comprising the observed predictor attributes, and the random error component comprising the error term, which is assumed to be independent of the predictor attributes. With the assumptions on the form of the regression function f , and the error variable , we can answer several interesting questions such as how good of a fit is the estimated model to the input data? How close are the estimated bias and coefficients to the true but unknown parameters, and so on. We consider such questions in this chapter.
720

Univariate Regression 721
27.1 UNIVARIATE REGRESSION
In univariate regression we have one dependent attribute Y and one independent attribute X, and we assume that the true relationship can be modeled as a linear function
YfXX
where is the slope of the best fitting line and is its intercept, and is the random
error variable that follows a normal distribution with mean 0 and variance 2 .
Mean and Variance of Response Variable
Consider a fixed value x for the independent variable X. The expected value of the response variable Y given x is
EYXxExxEx
The last step follows from our assumption that E 0. Also, since x is assumed to be fixed, and and are constants, the expected value E x x. Next, consider the variance of Y given X x, we have
varYXxvarxvarxvar02 2
Here var x 0, since , and x are all constants. Thus, given X x, the response variable Y follows a normal distribution with mean EYX x x, and variance varYX x 2.
Estimated Parameters
The true parameters , and 2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi, for i 1,2, ,n. Let b and w denote the estimated bias and weight terms; we can then make predictions for any given value xi as follows:
y i b w x i
The estimated bias b and weight w are obtained by minimizing the sum of squared
errors SSE, given as
n
n i1
yi yi2
with the least squares estimates given as see Eq. 23.11 and Eq. 23.8
According to our model, the variance in prediction is entirely due to the random error term . We can estimate this variance by considering the predicted value yi and its deviation from the true response yi , that is, by looking at the residual error
o i y i y i
yi bwxi2 w XY b Y w X
SSE
i1
X2 27.1.1 EstimatingVariance2

722 Regression Evaluation One of the properties of the estimated values b and w is that the sum of residual
errors is zero, since
nn
oi yi bwxi i1 i1
n yi Y wX wxi
i1
n n
yi n Y w nX xi i1 i1
nY nY wnX nX0 27.1 Thus, the expected value of oi is zero, since Eoi 1 n oi 0. In other words, the
n i1
sum of the errors above and below the regression line cancel each other.
Next, the estimated variance 2 is given as
1n 2 1n 1n
Thus, the estimated variance is
2 S S E n2
We divide by n 2 to get an unbiased estimate, since n 2 is the number of degrees of freedom for estimating SSE see Figure 27.2. In other words, out of the n training points, we need to estimate two parameters and , with n 2 remaining degrees of freedom.
The squared root of the variance is called the standard error of regression
S S E 2 7 . 3
n2
27.1.2 Goodness of Fit
The SSE value gives an indication of how much of the variation in Y cannot be explained by our linear model. We can compare this value with the total scatter, also called total sum of squares, for the dependent variable Y, defined as
n i1
Notice that in TSS, we compute the squared deviations of the true response from the true mean for Y, whereas, in SSE we compute the squared deviations of the true response from the predicted response.
2 v a r o i n 2 o i E o i n 2 o i2 n 2
i1 i1 i1
y i y i 2
2 7 . 2
TSS
yi Y2

Univariate Regression 723 The total scatter can be decomposed into two components by adding and
subtracting yi as follows
TSS

n
yi Y2 yi yi yi Y2
i1
n n n
yi yi2 yi Y2 2 yi yiyi Y i1 i1 i1
n i1
nn
22

whereweusethefactthat ni1yi yiyi Y0,and
yi yi yi Y SSERSS i1
i1
is a new term called regression sum of squares that measures the squared deviation of
the predictions from the true mean. TSS can thus be decomposed into two parts: SSE,
which is the amount of variation not explained by the model, and RSS, which is the
amount of variance explained by the model. Therefore, the fraction of the variation
left unexplained by the model is given by the ratio SSE . Conversely, the fraction of TSS
the variation that is explained by the model, called the coefficient of determination or simply the R2 statistic, is given as
R2 TSSSSE 1 SSE RSS 27.4 TSS TSS TSS
The higher the R2 statistic the better the estimated model, with R2 0, 1.
R S S
n i1
y i Y 2
Example 27.1 Variance and Goodness of Fit. Consider Example 23.1 that shows the regression of petal length X; the predictor variable on petal width Y; the response variable for the Iris dataset. Figure 27.1 shows the scatterplot between the two attributes. There are a total of n 150 data points. The least squares estimates for the bias and regression coefficients are as follows
w 0.4164 b 0.3665 The SSE value is given as
150 150
SSEoi2 yi yi2 6.343
i1 i1
Thus, the estimated variance and standard error of regression are given as
2 SSE 6.343 2 4.28610
n 2 1 4 8
SSE 4.2861020.207 n2

724
Regression Evaluation
2.5 2.0 1.5 1.0 0.5
0 0.5

0 1.0
Figure27.1. Scatterplot:petallengthXversuspetalwidthY.Meanpointshownasblackcircle.
2.0
3.0
X: petal length
6.0
7.0
Geometry of Goodness of Fit
Recall that Y can be decomposed into two orthogonal parts as illustrated in Figure 27.2a.
YYo
where Y is the projection of Y onto the subspace spanned by 1,X. Using the fact
that this subspace is the same as that spanned by the orthogonal vectors 1,X, with X X X 1, we can further decompose Y as follows
Y p r o j 1 Y 1 p r o j X Y X Y 1 Y T X X Y 1 w X 2 7 . 5 X TX
Likewise, the vectors Y and Y can be centered by subtracting their projections along the vector 1
Y Y Y 1 Y Y Y 1 w X
where the last step follows from Eq. 27.5. The centered vectorsY,Y andX all lie in the n 1 dimensional subspace orthogonal to the vector 1, as illustrated in Figure 27.2b.
4.0 5.0
For the bivariate Iris data, the values of TSS and RSS are given as TSS 86.78 RSS 80.436
We can observe that TSS SSE RSS. The fraction of variance explained by the model, that is, the R2 value, is given as
R2 RSS 80.436 0.927 TSS 86.78
This indicates a very good fit of the linear model.
Y: petal width

Univariate Regression
725
x1 x1Y Y
o Y Y
1
Y
o

x2
xn
1
x2
xn

Y YY
XX XX
a Uncentered b Centered
Figure 27.2. Geometry of univariate regression: uncentered and centered vectors. The vector space that is the complement of 1 has dimensionality n 1. The error space containing the vector o is orthogonal to X , and has dimensionality n 2, which also specifies the degrees of freedom for the estimated variance 2 .
In this subspace, the centered vectors Y and Y, and the error vector o form a right triangle, since Y is the orthogonal projection of Y onto the vector X . Noting that o Y Y Y Y , by the Pythagoras theorem, we have
2 2 2 2 2 Y Y oY YY
This equation is equivalent to the decomposition of the total scatter, TSS, into sum of squared errors, SSE, and residual sum of squares, RSS. To see this, note that the total scatter, TSS, is defined as follows:
n
2 22 TSS yiY YY1 Y
i1
The residual sum of squares, RSS, is defined as
n
2 2 2 RSS yiYYY1Y
i1
Finally, the sum of squared errors, SSE, is defined as
S S E o 2 Y Y 2
Thus, the geometry of univariate regression makes it evident that
2 2
Y Y YY
2
YY 12 YY 12 YY2 TSSRSSSSE

726 Regression Evaluation

Notice further that since Y , Y and o form a right triangle, the cosine of the angle
betweenY andY is given as the ratio of the base to the hypotenuse. On the other hand, via Eq. 2.30, the cosine of the angle is also the correlation between Y and Y, denoted
YY. Thus, we have:

Y Y Y c o s
Y
Y Y Y Y
Notethat, whereas YY 1, due to the projection operation, the angle between Y and Y is always less than or equal to 90, which means that YY 0,1 for univariate regression. Thus, the predicted response vector Y is smaller than the true response vector Y by an amount equal to the correlation between them. Furthermore, by Eq. 27.4, the coefficient of determination is the same as the squared correlation between Y and Y
We can observe that
2 R2RSSY 2
TSS2 YY Y
Example 27.2 Geometry of Goodness of Fit. Continuing Example 27.1, the corre
lation between Y and Y is given as

Y Y c o s 9 . 3 1 6 0 . 9 6 3
Y Y

80.436 8.969 86.78
RSS TSS
The square of the correlation is equivalent to R2, since 2 0.9632 0.927
YY The angle between Y and Y is given as
cos10.963 15.7 The relatively small angle indicates a good linear fit.
27.1.3 Inference about Regression Coefficient and Bias Term
The estimated values of the bias and regression coefficient, b and w, are only point estimates for the true parameters and . To obtain confidence intervals for these parameters, we treat each yi as a random variable for the response given the corresponding fixed value xi. These random variables are all independent and identicallydistributedasY,withexpectedvaluexi andvariance2.Ontheother hand, the xi values are fixed a priori and therefore X and X2 are also fixed values.

Univariate Regression
727
We can now treat b and w as random variables, with bw
YX
ni1xi Xyi Y 1 n
n
where ci is a constant since xi is fixed, given as ci xi X
sX
w n x 2 s xiXyi i1iX Xi1 i1
ciyi
27.6
and sX ni1xi X2 is the total scatter for X, defined as the sum of squared deviations of xi from its mean X. We also use the fact that
n
i1
Note that
n 1n
n
xi XY Y xi X0
i1 X i1
i1
ci s xi X0
Mean and Variance of Regression Coefficient
The expected value of w is given as
nn n
EwE ciyi ci Eyi cixi
i1 i1 i1
nnn
ci ci xi sX xi Xxi sX sX i1 i1 i1
which follows from the observation that ni1 ci 0, and further n nn
sX xi X2
i1 i1 i1
varwvar
i1
27.7
27.8
sinceci isaconstant,varyi2,andfurther
n
n i1
2 ci2 s
X
xi2 n2X xi Xxi
Thus, w is an unbiased estimator for the true parameter .
Using the fact that the variables yi are independent and identically distributed as
Y, we can compute the variance of w as follows
n
ci2 varyi2 n1n sX1
ci yi
ci2sX2 xiX2sX2 sX
i1
i1 i1
The standard deviation of w, also called the standard error of w, is given as
sew varw
s
X

728 Regression Evaluation Mean and Variance of Bias Term
The expected value of b is given as
1 n
EbEYwXE n yiwX 1ni11n
n Eyi XEw n xi X i1 i1
X X
Thus, b is an unbiased estimator for the true parameter .
Using the observation that all yi are independent, the variance of the bias term can
be computed as follows
varb varY w X
1n
var n yi varXw
i1
1 n22Xvarw122X2
n2 nsX 1 2X 2
n sX
where we used the fact that for any two random variables A and B, we have varA B varA varB. That is, variances of A and B add, even though we are computing the variance of A B.
The standard deviation of b, also called the standard error of b, is given as
sebvarb 1 2X n sX
Covariance of Regression Coefficient and Bias
We can also compute the covariance of w and b, as follows covw,bEwbEwEbEY wXw
27.9
Y EwX Ew2 Y X varwEw2
2
Y X
sX
X 2 sX

2 X 2 Y X

sX
where we use the fact that varw Ew2 Ew2, which implies Ew2 varw Ew2,andfurtherthatYX .

Univariate Regression 729
Confidence Intervals
Since the yi variables are all normally distributed, their linear combination also follows a normal distribution. Thus, w follows a normal distribution with mean and variance 2sX. Likewise, b follows a normal distribution with mean and variance 1n2XsX2.
Since the true variance 2 is unknown, we use the estimated variance 2 from Eq. 27.2, to define the standardized variables Zw and Zb as follows
Zw wEw w Zb bEb b 27.10
sew
These variables follow the Students t distribution with n 2 degrees of freedom. Let Tn2 denote the cumulative t distribution with n 2 degrees of freedom, and let t2 denote the critical value of Tn2 that encompasses 2 of the probability mass in the right tail. That is,
PZt2 orequivalentlyTn2t21 22
s X s e b 1 n 2X s X
Since the t distribution is symmetric, we have
PZt21 orequivalentlyTn2t2
22
Given confidence level 1 , i.e., significance level 0,1, the 1001
confidence interval for the true values, and , are therefore as follows Pwt2sew wt2sew1
Pbt2seb bt2seb1
Example 27.3 Confidence Intervals. Continuing with Example 27.1, we consider the variance of the bias and regression coefficient, and their covariance. However, since we do not know the true variance 2, we use the estimated variance and the standard error for the Iris data
2SSE 2 4.28610
n2
4.286 102 0.207
Furthermore, we have
X 3.7587 sX 463.864
Therefore, the estimated variance and standard error of w is given as 2 4.286102 5
varw sX 463.864 9.24 10
sew varw 9.24 105 9.613 103

730 Regression Evaluation
The estimated variance and standard error of b is v a r b 1 2X 2
n sX
1 3.75924.286102
150 463.864
3.712 102 4.286 102 1.591 103
seb varb 1.591 103 3.989 102 and the covariance between b and w is
X 2 3.7587 4.286 102 4 covw,b sX 463.864 3.47310
For the confidence interval, we use a confidence level of 1 0.95 or 0.05. The critical value of the tdistribution, with n 2 148 degrees of freedom, that encompasses 2 0.025 fraction of the probability mass in the right tail is t2 1.976. We have
t2 sew 1.976 9.613 103 0.019
Therefore, the 95 confidence interval for the true value, , of the regression
coefficient is given as
wt2sew, wt2sew0.41640.019,0.41640.019
0.397, 0.435
Likewise, we have:
t2 seb 1.976 3.989 102 0.079 Therefore, the 95 confidence interval for the true bias term, , is
bt2seb, bt2seb0.36650.079,0.36650.079 0.446, 0.288
27.1.4 Hypothesis Testing for Regression Effects
One of the key questions in regression is whether X predicts the response Y. In the regression model, Y depends on X through the parameter , therefore, we can check for the regression effect by assuming the null hypothesis H0 that 0, with the alternative hypothesis Ha being 0:
H0 : 0 Ha : 0
When 0, the response Y depends only on the bias and the random error . In other words, X provides no information about the response variable Y.

Univariate Regression 731 Now consider the standardized variable Zw from Eq. 27.10. Under the null
hypothesis we have Ew 0. Thus,
wEw w
We can therefore compute the pvalue for the Zw statistic using a twotailed test via the t distribution with n 2 degrees of freedom. Given significance level e.g., 0.01, we reject the null hypothesis if the pvalue is below . In this case, we accept the alternative hypothesis that the estimated value of the slope parameter is significantly different from zero.
We can also define the f statistic, which is the ratio of the regression sum of squares, RSS, to the estimated variance, given as
Zw sew
s 27.11
X
R S S ni 1 y i Y 2
f2 n 2 i1yi yi
Under the null hypothesis, one can show that ERSS2
Further, it is also true that
E22
Thus, under the null hypothesis the f statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then ERSS 2, resulting in a larger f value. In fact, the f statistic follows a F distribution with 1, n 2 degrees of freedom for the numerator and denominator, respectively; therefore, we can reject the null hypothesis that w 0 if the pvalue of f is less than the significance level , say 0.01.
Interestingly the f test is equivalent to the t test since Zw2 f . We can see this as follows:
27.12
1 n
f 2
i1
1 n
2
i1
1 n
yi Y2 2 bwxi Y2
i1
Y wX wxi Y2 2
n2
1 n w 2 s X 2w2 xiX2 2
i1 w2 2 2 s X Z w
Test for Bias Term
1 n i1
2
wxi X
Note that we can also test if the bias value is statistically significant or not by setting up the null hypothesis, H0 : 0, versus the alternative hypothesis Ha : 0. We then

732 Regression Evaluation evaluate the Zb statistic see Eq. 27.10 under the null hypothesis:
Zb b Eb b 27.13 s e b 1 n 2X s X
since, under the null hypothesis Eb 0. Using a twotailed ttest with n 2 degrees of freedom, we can compute the pvalue of Zb. We reject the null hypothesis if this value is smaller than the significance level .
Example 27.4 Hypothesis Testing. We continue with Example 27.3, but now we test for the regression effect. Under the null hypothesis we have 0, further Ew 0. Therefore, the standardized variable Zw is given as
Zw wEw w 0.4164 43.32 sew sew 9.613 103
Using a twotailed t test with n 2 degrees of freedom, we find that pvalue43.32 0
Since this value is much less than the significance level 0.01, we conclude that observing such an extreme value of Zw is unlikely under the null hypothesis. Therefore, we reject the null hypothesis and accept the alternative hypothesis that 0.
Now consider the f statistic, we have
f RSS 80.436 1876.71
2 4.286 102
Using the F distribution with 1, n 2 degrees of freedom, we have
pvalue1876.71 0
In other words, such a large value of the f statistic is extremely rare, and we can reject the null hypothesis. We conclude that Y does indeed depend on X, since 0. Finally, we test whether the bias term is significant or not. Under the null
hypothesis H0 : 0, we have
Zb b 0.3665 9.188
seb 3.989 102 Using the twotailed ttest, we find
pvalue9.188 3.35 1016
It is clear that such an extreme Zb value is highly unlikely under the null hypothesis. Therefore, we accept the alternative hypothesis that Ha : 0.

Univariate Regression 733 27.1.5 StandardizedResiduals
Our assumption about the true errors i is that they are normally distributed with mean 0 and fixed variance 2. After fitting the linear model, we can examine how well the residual errors oi yi yi satisfy the normality assumption. For this, we need to compute the mean and variance of oi , by treating it as a random variable.
The mean of oi is given as
EoiEyi yiEyiEyi
xi Ebwxixi xi0
which follows from the fact that Eb and Ew .
To compute the variance of oi , we will express it as a linear combination of the yj
variables, by noting that
1 n 1 n n
n x j X
w s xj yj n X Y s xj yj X yj s yj Xj1 Xj1j1 j1X
n1
b Y w X j 1 n yj w X
Therefore, we can express oi , as follows
oi yi yi yi bwxi yi n 1yj wX wxi
j1 n yin 1yjxiXw
j1 n
yi n 1yj n xi Xxj X yj
j1 n j1 sX
1 1 xi X 2 yi 1 xi X xj X yj
27.14 where we have separated yi from the rest of the yj s, so that all terms in the summation
nsX jinsX are independent. Define aj as follows:
aj 1xiXxjX n sX
Rewriting Eq. 27.14 in terms of aj , we get
varoivar 1aiyi aj yj
j i
1ai2 varyi aj2 varyj j i
27.15

734
Regression Evaluation
2 12ai ai2 aj2, j i
n 2 12ai aj2
j1 Consider the term nj1 aj2, we have
since varyivaryj2
27.16
n 1 xi Xxj X2 aj2 n sX
n
j1 j1
n 1 2xiXxjXxiX2xjX2 j 1 n 2 n s X s X2
1 2xiXn xiX2n
n ns xj X s2 xj X2
X j1 X j1 sincenj1xj X0andnj1xj X2 sX,weget
n 1 xiX2 aj2 n s
j1 X
Plugging Equations 27.15 and 27.17 into Eq. 27.16, we get
2 2 2xi X2 1 xi X2 varoi 1n sX n sX
2 1 1 xi X 2 n sX
27.17
We can now define the standardized residual oi by dividing oi by its standard deviation after replacing 2 by its estimated value 2 . That is,
oi oi oi 27.18 varoi 1 xi X2
These standardized residuals should follow a standard normal distribution. We can thus plot the standardized residuals against the quantiles of a standard normal distribution, and check if the normality assumption holds. Significant deviations would indicate that our model assumptions may not be correct.
1n sX
Example 27.5 Standardized Residuals. Consider the Iris dataset from Exam ple 27.1, with the predictor variable petal length and response variable petal width, and n 150. Figure 27.3a shows the quantilequantile QQ plot. The yaxis is the list of standardized residuals sorted from the smallest to the largest. The xaxis

Multiple Regression
735
2
0 2
2
0
2
246
X b X vs. o
2 0 2 Q
a QQplot
Figure 27.3. Residual plots: a QuantileQuantile scatter plot of normal quantiles versus standardized residuals. b Independent variable versus standardized residuals.
is the list of the quantiles of the standard normal distribution for a sample of size n, defined as
Q q1,q2,…,qnT qi F1i0.5
n
where F is the cumulative distribution function CDF, and F 1 is the inverse CDF or quantile function see Eq. 2.2 for the normal distribution. Thus, the Q values are also sorted in increasing order from smallest to largest. If the standardized residuals follow a normal distribution, then the QQ plot should follow a straight line. Figure 27.3a plots this perfect line for comparison. We can observe that the residuals are essentially normally distributed.
The plot of the independent variable X versus the standardized residuals is also instructive. We can see in Figure 27.3b that there is no particular trend or pattern to the residuals, and the residual values are concentrated along the mean value of 0, with the majority of the points being within two standard deviations of the mean, as expected if they were sampled from a normal distribution.
27.2 MULTIPLE REGRESSION
In multiple regression there are multiple independent attributes X1,X2, ,Xd and a single dependent or response attribute Y, and we assume that the true relationship can be modeled as a linear function
Y1 X1 2 X2 …d Xd
whereistheinterceptorbiastermandi istheregressioncoefficientforattributeXi. Recallthati denotestheexpectedincreaseinYwithaunitincreaseinthevalueofXi, assuming all other variables are held fixed. We assume that is a random variable that
o
o

736 Regression Evaluation
is normally distributed with mean 0 and variance 2 . Further, we assume that the errors for different observations are all independent of each other, and consequently the observed responses are also independent.
Mean and Variance of Response Variable
Let X X1,X2, ,XdT Rd denote the multivariate random variable comprising the independent attributes. Let x x1,x2, ,xdT be some fixed value of X, and let 1,2, ,d T. The expected response value is then given as
d EYXxE1 x1 …d xd E i xi E
i1 1×1…d xd Tx
which follows from the assumption that E 0. The variance of the response variable is given as
d d
varYXxvar i xi var i xi var02 2
i1 i1
which follows from the assumption that all xi are fixed a priori. Thus, we conclude that YalsofollowsanormaldistributionwithmeanEYx di1i xi Txand variance varYx 2 .
Estimated Parameters
The true parameters , 1,2, ,d and 2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi , for i 1,2, ,n. We augment the data matrix by adding a new column X0 with all values fixed at 1, that is, X0 1. Thus, the augmented data D Rnd1 comprises the d 1 attributes X0,X1,X2, ,Xd, and each augmented point is given as x i 1,xi1,xi2, ,xid T.
Let b w0 denote the estimated bias term, and let wi denote the estimated regression weights. The augmented vector of estimated weights, including the bias term, is
w w0,w1,,wdT
We then make predictions for any given point xi as follows:
y i b 1 w 1 x i 1 w d x i d w T x i
Recall that these estimates are obtained by minimizing the sum of squared errors
SSE, given as
n nd2 S S E y i y i 2 y i b w j x i j
i1 i1 j1

Multiple Regression
737
with the least squares estimate Eq. 23.20 given as T 1 T
w D D D Y The estimated variance 2 is then given as
SSE 1 n
y i y i 2
We divide by nd 1 to get an unbiased estimate, since nd 1 is the number of degrees of freedom for estimating SSE see Figure 27.5. In other words, out of the n training points, we need to estimate d 1 parameters, and the i s, with n d 1 remaining degrees of freedom.
Estimated Variance is Unbiased We now show that 2 is an unbiased estimator of the true but unknown variance 2. Recall from Eq. 23.18 that
YD w D D TD 1D TYHY
where H is the n n hat matrix assuming that D TD 1 exists. Note that H is an orthogonal projection matrix, since it is symmetric HT H and idempotent H2 H. The hat matrix H is symmetric since
HT D D TD 1D TT D TTD TD T1D T H and it is idempotent since
H2 D D TD 1D TD D TD 1D T D D TD 1D T H Furthermore, the trace of the hat matrix is given as
trH trD D TD 1D T trD TD D TD 1 trId1 d 1
where Id 1 is the d 1 d 1 identity matrix, and we used the fact that the trace of a product of matrices is invariant under cyclic permutations.
Finally, note that the matrix I H is also symmetric and idempotent, where I is the n n identity matrix, since
2 n d 1 n d 1
2 7 . 1 9
IHT IT HT IH
IH2 IHIHIHHH2 IH
Now consider the squared error; we have
SSEYY2 YHY2 IHY2
YTI HI HY YTI HY However, note that the response vector Y is given as
Y D
27.20
i1

738 Regression Evaluation
where 0 , 1 , , d T is the true augmented vector of parameters of the model, and 1,2, ,nT is the true error vector, which is assumed to be normally distributed with mean E 0 and with fixed variance i 2 for each point, so that cov 2I. Plugging the expression of Y into Eq. 27.20, we get
S S E Y T I H Y D T I H D D T I H D I H
0
IHTD TIHD T I H D T I H T I H
0

where we use the observation that
I H D D H D D D D T D 1 D T D D D 0
Let us consider the expected value of SSE; we have ESSEETIH
n nn n nn 2 2
E i hijij Ei hijEij i1 i1 j1 i1 i1 j1
n
1hiiEi2,sincei areindependent,andthereforeEij0 n 2 2 2

n hii ntrH nd1
i1
i1
whereweusedthefactthat2 variEi2Ei2 Ei2,sinceEi0.It follows that
2 E SSE 1 ESSE 1 nd12 2 27.21 nd 1 nd 1 nd 1
27.2.1 Goodness of Fit
Following the derivation in Section 27.1.2, the decomposition of the total sum of squares, TSS, into the sum of squared errors, SSE, and the residual sum of squares, RSS, holds true for multiple regression as well:
TSSSSERSS
n n n
yi Y2 yi yi2 yi Y2 i1 i1 i1

Multiple Regression
739
X1
Y

Figure27.4. Multipleregression:sepallengthX1andpetallengthX2withresponseattributepetal width Y. The vertical bars show the residual errors for the points. Points in white are above the plane, whereas points in gray are below the plane.
The coefficient of multiple determination, R2, gives the goodness of fit, measured as the fraction of the variation explained by the linear model:
R2 1 SSE TSSSSE RSS 27.22 TSS TSS TSS
One of the potential problems with the R2 measure is that it is susceptible to increase as the number of attributes increase, even though the additional attributes may be uninformative. To counter this, we can consider the adjusted coefficient of determination, which takes into account the degrees of freedom in both TSS and SSE
2 SSEnd 1 n1SSE
Ra 1 TSSn1 1nd1TSS 27.23
We can observe that the adjusted R2a measure is always less than R2, since the ratio n1 1. If there is too much of a difference between R2 and R2 , it might indicate that
nd1 a
there are potentially many, possibly irrelevant, attributes being used to fit the model.
Example 27.6 Multiple Regression: Goodness of Fit. Continuing with multiple regression from Example 23.3, Figure 27.4 shows the multiple regression of sepal length X1 and petal length X2 on the response attribute petal width Y for the Iris dataset with n 150 points. We also add an ext ra attribute X0 1150, which is a vector of all ones in R150. The augmented dataset D R1503 comprises n 150 points, and three attributes X0, X1 and X2.

X2

740 Regression Evaluation
Theuncentered33scattermatrixD TD anditsinversearegivenas
150.0 876.50 563.80 0.793 0.176 0.064 D TD 876.5 5223.85 3484.25 D TD 1 0.176 0.041 0.017
563.8 3484.25 2583.00 0.064 0.017 0.009 The augmented estimated weight vector w is given as
w0 0.014 w w 1 D T D 1 D T Y 0 . 0 8 2
w2 0.45 The bias term is therefore b w0 0.014, and the fitted model is
Y0.0140.082X10.45X2
Figure 27.4 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals i.e., oi 0 or y i yi , whereas the gray points have negative residual values i.e., oi 0 or yi y.
The SSE value is given as
SSEoi2 yi yi2 6.179
i1 i1
Thus, the estimated variance and standard error of regression are given as
2 SSE

6.179 2
150 150
n d 1 S S E
4.20310
4 . 2 0 3 1 0 2 0 . 2 0 5
1 4 7
nd1
The values of total and residual sum of squares are given as
TSS 86.78 RSS 80.60
We can observe that TSS SSE RSS. The fraction of variance explained by the
model, that is the R2 value, is given as
R2 RSS 80.60 0.929
TSS 86.78
This indicates a very good fit of the multiple linear regression model. Nevertheless, it
makes sense to also consider the adjusted R2a value
R2a 1 n1SSE 1 1496.179 0.928
nd 1TSS 14786.78 The adjusted value is almost the same as the R2 value.

Multiple Regression
741
x1
Y

X2
o
X1 Y
U1
xn
U2
x2
Figure 27.5. Geometry of multiple regression. The figure shows two centered predictor variables X1 and X2, along with the corresponding orthogonal basis vectors U1 and U2. The subspace 1 is not shown. The dimensionality of the error space, containing the vector o, is n d 1, which also specifies the degrees of
freedom for the estimated variance 2 .
Geometry of Goodness of Fit
In multiple regression there are d predictor attributes X1,X2, ,Xd. We can center them by subtracting their projection along the vector 1 to obtain the centered predictor vectors Xi . Likewise, we can center the response vector Y and the predicted vector Y. Thus, we have
X i X i X i 1 Y Y Y 1 Y Y Y 1
Once Y, Y and Xi s have been centered, they all lie in the n 1 dimensional subspace orthogonal to the vector 1. Figure 27.5 shows this n 1 dimensional subspace. In this subspace, we first extract an orthogonal basis U1,U2, ,Ud via the GramSchmidt orthogonalization process outlined in Section 23.3.1, and the predicted response vector is the sum of the projections of Y onto each of the new basis vectors see Eq. 23.23.
The centered vectors Y and Y, and the error vector o form a right triangle, and thus, by the Pythagoras theorem, we have
2 2 2 2 2 Y Y oY YY
27.24
TSSRSSSSE
The correlation between Y and Y is the cosine of the angle betweenY andY, which
is also given as the ratio of the base to the hypotenuse
Y
Y Y Y c o s

742
Regression Evaluation
Furthermore, by Eq. 27.4, the coefficient of multiple determination is given as
2 R2RSSY 2
TSS2 YY Y
Example 27.7 Geometry of Goodness of Fit. Continuing Example 27.6, the corre
lation between Y and Y is given as

80.60
86.78
The angle between Y and Y is given as
cos10.964 15.5
The relatively small angle indicates a good linear fit.

Y RSS
Y Y c o s
Y TSS
0 . 9 6 4
27.2.2 Inferences about Regression Coefficients LetYbetheresponsevectoroverallobservations.Letw w0,w1,w2,,wdT bethe
estimated vector of regression coefficients, computed as T 1 T
w D D D Y The expected value of w is given as follows:
E w E D T D 1 D T Y D T D 1 D T E Y D TD 1D T ED D TD 1D TD
since E 0. Thus, w is an unbiased estimator for the true regression coefficients vector .
Next, we compute the covariance matrix for w , as follows
covw covD TD 1D TY, letting A D TD 1D T, we get
covAY A covYAT
A 2 I AT
D T D 1 D T 2 I D D T D 1
2 D T D 1 D T D D T D 1 2 D T D 1
2 7 . 2 5
Here, we made use of the fact that A D TD 1D T is a matrix of fixed values, and therefore covAY A covYAT . Also, we have covY 2 I, where I is the n n identity matrix. This follows from the fact that the observed responses yis are all independent and have the same variance 2.

Multiple Regression 743 Note that D TD Rd1d1 is the uncentered scatter matrix for the augmented
data. Let C denote the inverse of D TD . That is
D T D 1 C
Therefore, the covariance matrix for w can be written as covw 2C
In particular, the diagonal entries 2 cii give the variance for each of the regression coefficient estimates including for b w0 , and their squared roots specify the standard errors. That is
varwi2 cii sewivarwi cii
We can now define the standardized variable Zwi that can be used to derive the
confidence intervals for i as follows
wi Ewi wi i
c 27.26
Zwi sew
i ii
where we have replaced the unknown true variance 2 by 2 . Each of the variables Zwi follows a t distribution with n d 1 degrees of freedom, from which we can obtain the 1001 confidence interval of the true value i as follows:
Pwit2sewii wit2sewi1
Here, t2 is the critical value of the t distribution, with n d 1 degrees of freedom,
that encompasses 2 fraction of the probability mass in the right tail, given as
PZt2 orequivalentlyTnd1t21 22
Example 27.8 Confidence Intervals. Continuing with multiple regression from Example 27.6, we have
2 4 . 2 0 3 1 0 2
0.793 0.176 0.064 C D TD 1 0.176 0.041 0.017
0.064 0.017 0.009
Therefore, the covariance matrix of the estimated regression parameters, including
the bias term, is given as
covw 2 C7.379103
3.333 102 2.678 103
7.379 103 1.714103 7.012 104
2.678 103 7.012104
3.775 104

744 Regression Evaluation
The diagonal entries specify the variances and standard errors of each of the estimated parameters
varb 3.333 102 varw1 1.714 103 varw2 3.775 104
seb 3.333 102 0.183 sew1 1.714 103 0.0414 sew2 3.775 104 0.0194
where b w0.
Using confidence level 1 0.95 or significance level 0.05, the critical
value of the tdistribution that encompasses 0.025 fraction of the probability 2
mass in the right tail is given as t2 1.976. Thus, the 95 confidence intervals for the true bias term , and the true regression coefficients 1 and 2, are:
b t2 seb 0.014 0.074, 0.014 0.074 0.088, 0.06
1 w1 t2 sew1 0.0820.0168,0.0820.0168 0.099, 0.065
2 w2 t2 sew2 0.450.00787,0.450.00787 0.442, 0.458
27.2.3 Hypothesis Testing
Once the parameters have been estimated, it is beneficial to test whether the regression coefficients are close to zero or substantially different. For this we set up the null hypothesis that all the true weights are zero, except for the bias term 0. We contrast the null hypothesis with the alternative hypothesis that at least one of the weights is not zero
H0: 1 0,2 0,…,d 0 Ha: i, suchthati 0
The null hypothesis can also be written as H0 : 0, where 1,2, ,d T.
We use the Ftest that compares the ratio of the adjusted RSS value to the
estimated variance 2 , defined via the f statistic RSSd RSSd
f 2 S S E n d 1 2 7 . 2 7
Under the null hypothesis, we have ERSSd2

Multiple Regression 745
To see this, let us examine the regression equations in vector terms, namely Y b 1 w 1 X 1 . . . w d X d
YYw1X1 …wdXd1w1X1…wdXd Y Y 1 w 1 X 1 X 1 1 . . . w d X d X d 1 , w h i c h i m p l i e s
d i1
Yw1X1w2X2…wdXd Let us consider the RSS value; we have
wiXi
2 2 T RSSYY1 Y YY
d
Td j1
d d
wjXj
wiwjXiTXj wTDTDw

i1
wiXi
where w w1 , w2 , , wd T is the
without the bias term, and D Rnd is the centered data matrix without augmen tation by the X0 1 attribute. The expected value of RSS is thus given as
ERSS EwTDT Dw
trEwTDT Dw, since, EwTDT Dw is a scalar
EtrwTDT Dw
EtrDT DwwT, since trace is invariant under cyclic permutation
t r DT D E w w T
t r DT D c o v w E w E w T 2 7 . 2 8 tr DT D covw , since under the null hypothesis Ew 0
t r DT D 2 DT D 1 2 t r I d d 2
where Id is the d d identity matrix. We also used the fact that
covw EwwT Ew EwT, and therefore EwwT covw Ew EwT
Notice that from Eq. 27.25, the covariance matrix for the augmented weight vector w , t h a t i n c l u d e s t h e b i a s t e r m , i s g i v e n a s 2 D T D 1 . H o w e v e r , s i n c e w e a r e i g n o r i n g the bias b w0 in the hypothesis test, we are interested only in the lower right d d submatrix of D TD 1, which excludes the values related to w0. It can be shown that this submatrix is precisely the inverse of the centered scatter matrix DT D1 for the unaugmented data. We used this fact in the derivation above. Therefore, it follows that
E RSS 1 ERSS 1 d 2 2 ddd
i1 j1
d dimensional vector of regression coefficients

746 Regression Evaluation
Further, as per Eq. 27.21, the estimated variance is an unbiased estimator, so that E22
Thus, under the null hypothesis the f statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then ERSSd 2, resulting in a larger f value.
The ratio f follows a F distribution with d , n d 1 degrees of freedom for the numerator and denominator, respectively. Therefore, we can reject the null hypothesis if the pvalue is less than the chosen significance level, say 0.01.
Noticethat,sinceR21SSE RSS,wehave TSS TSS
SSE1R2TSS RSSR2 TSS Therefore, we can rewrite the f ratio as follows
RSSd nd1 R2
f SSEn d 1 d 1 R2
27.29
In other words, the F test compares the adjusted fraction of explained variation to the unexplained variation. If R2 is high, it means the model can fit the data well, and that is more evidence to reject the null hypothesis.
Hypothesis Testing for Individual Parameters
We can also test whether each independent attribute Xi, contributes significantly for the prediction of Y or not, assuming that all the attributes are still retained in the model.
For attribute Xi , we set up the null hypothesis H0 : i 0 and contrast it with the alternative hypothesis Ha : i 0. Using Eq. 27.26, the standardized variable Zwi under the null hypothesis is given as
wi Ewi wi wi
c 27.30
Zwi sew sew
i i ii
since Ewi i 0. Next, using a twotailed t test with n d 1 degrees of freedom, we compute pvalueZwi . If this probability is smaller than the significance level say 0.01, we can reject the null hypothesis. Otherwise, we accept the null hypothesis, which would imply that Xi does not add significant value in predicting the response in light of other attributes already used to fit the model. The ttest can also be used to test whether the bias term is significantly different from 0 or not.
Example 27.9 Hypothesis Testing. We continue with Example 27.8, but now we test for the regression effect. Under the null hypothesis that 1 2 0, the expected value of RSS is 2 . Thus, we expect the f statistic to be close to 1. Let us check if that

Multiple Regression 747
is the case; we have
f RSSd 80.602 958.8 2 4.203 102
Using the F distribution with d , n d 1 2, 147 degrees of freedom, we have pvalue958.8 0
In other words, such a large value of the f statistic is extremely rare, and therefore, we can reject the null hypothesis. We conclude that Y does indeed depend on at least one of the predictor attributes X1 or X2.
We can also test for each of the regression coefficients individually using the t test. For example, for 1 , let the null hypothesis be H0 : 1 0, so that the alternative hypothesis is Ha : 1 0. Assuming that the model still has both X1 and X2 as the predictor variables, we can compute the tstatistic using Eq. 27.26:
Zw1 w1 0.082 1.98 sew1 0.0414
Using a twotailed t test with n d 1 147 degrees of freedom, we find that pvalue1.98 0.0496
Since the pvalue is only marginally less than a significance level of 0.05 i.e., a 95 confidence level, this means that X1 is only weakly relevant for predicting Y in the presence of X2. In fact, if we use the more stringent significance level 0.01, we would conclude that X1 is not significantly predictive of Y, given X2.
On the other hand, individually for 2, if we test whether H0 : 2 0 versus Ha :2 0,wehave:
Zw2 w2 0.45 23.2 sew2 0.0194
Using a twotailed t test with n d 1 147 degrees of freedom, we find that pvalue23.2 0
This means, that individually X2 is significantly predictive of Y even in the presence of X1.
Using the ttest, we can also compute the pvalue for the bias term: Zb b 0.014 0.077
which has a pvalue 0.94 for a twotailed test. This means, we accept the null hypothesis that 0, and reject the alternative hypothesis that 0.
seb 0.183
Example 27.10 Centered Scatter Matrix. Here we show that the inverse of the centered scatter matrix DTD1 for the unaugmented data is the lower right

748 Regression Evaluation
submatrix of the inverse of the uncentered scatter matrix D TD 1 for the augmented data.
For the augmented data comprising X0, X1 and X2, from Example 27.6, the uncentered33scattermatrixD TD anditsinverse,aregivenas
150.0 876.50 563.80 0.793 0.176 0.064 D TD 876.5 5223.85 3484.25 D TD 1 0.176 0.041 0.017
563.8 3484.25 2583.00 0.064 0.017 0.009 For the unaugmented data comprising only X1 and X2, the centered scatter
matrix and its inverse are given as:
DT D 102.17 189.78 DT D1 0.041 0.017
189.78 463.86 0.017 0.009 WecanobservethatDTD1 isexactlythelowerright22submatrixofD TD 1.
27.2.4 Geometric Approach to Statistical Testing
The geometry of multiple regression provides further insight into the hypothesis testing approach for the regression effect. Let Xi Xi Xi 1 denote the centered attributevector,andletXX1,X2,,XdT denotethemultivariatecenteredvector of predictor variables. The ndimensional space over the points is divided into three mutually orthogonal subspaces, namely the 1dimensional mean space S span1, the d dimensional centered variable space SX spanX, and the n d 1 dimensional error space So, which contains the error vector o. The response vector Y can thus be decomposed into three components see Figures 27.2 and 27.5
Y Y 1 Y o
Recall that the degrees of freedom of a random vector is defined as the dimensionality of its enclosing subspace. Since the original dimensionality of the point space is n, we have a total of n degrees of freedom. The mean space has dimensionality dimS 1, thecenteredvariablespacehasdimSXd,andtheerrorspacehasdimSond 1, so that we have
dimSdimSXdimSo1dnd1n
Population Regression Model
Recall that the regression model posits that for a fixed value xi xi1,xi2, ,xid T, the true response yi is given as
y i 1 x i 1 . . . d x i d i
where the systematic part of the model dj 1 j xij is fixed, and the error term i varies randomly, with the assumption that i follows a normal distribution with mean 0 and variance 2. We also assume that the i values are all independent of each other.

Multiple Regression
749
So
Y
E Y X

SX
1
Figure 27.6. Population regression model. The n 1 dimensional subspace, comprising SX So , orthogonal to 1 is shown as a hyperrectangle. The true error vector is a random vector of length ; the n 1dimensional hypersphere shows the possible orientations of for a fixed length value.
Plugging Y dj 1 j Xj into the equation above, we obtain yi Y 1 xi1 X1…d xid Xdi
Y 1 xi1 …d xid i
where x ij xij Xj is the centered value for attribute Xj . Across all the points, we
can rewrite the above equation in vector form
YY 11 X1 …d Xd
We can also center the vector Y, so that we obtain a regression model over the centered response and predictor variables, as illustrated in Figure 27.6:
Y YY 11 X1 2 X2 …d Xd EYX
In this equation, di1i Xi is a fixed vector that denotes the expected value EYX and is an ndimensional random vector that is distributed according to a ndimensional multivariate normal vector with mean 0, and a fixed variance 2 along all dimensions, so that its covariance matrix is 2 I. The distribution of is therefore given as
f 1 expT1 1 exp2 2n 2 2nn 22
which follows from the fact that det det2I 2n and 1 1 I. 2
The density of is thus a function of its squared length 2, independent of its angle. In other words, the vector is distributed uniformly over all angles and is equally likely to point in any direction. Figure 27.6 illustrates the population regression model. The fixed vector EYX is shown, and one orientation of is shown. It is important

750 Regression Evaluation
to note that the n 1 dimensional hypersphere denotes the fact that the random vector can be in any orientation in this hypersphere of radius . Notice how the population regression model differs from the fitted model. The residual error vector o is orthogonal to the predicted mean response vector Y, which acts as the estimate for EYX. On the other hand, in the population regression model, the random error vector can be in any orientation compared to EYX.
Hypothesis Testing
Consider the population regression model
Y Y 1 1 X 1 . . . d X d Y 1 E Y X
To test whether X1,X2, ,Xd are useful in predicting Y, we consider what would happen if all of the regression coefficients were zero, which forms our null hypothesis
H0: 1 0,2 0,…,d 0 YY 1 YY 1 Y
Since is normally distributed with mean 0 and covariance matrix 2 I, under the null hypothesis, the variation in Y for a given value of x will therefore be centered around the origin 0.
On the other hand, under the alternative hypothesis Ha that at least one of the i is nonzero, we have
Y E Y X
Thus, the variation inY is shifted away from the origin 0 in the direction EYX.
In practice, we obviously do not know the true value of EYX, but we can estimate it by projecting the centered observation vectorY onto the subspaces SX and
So , as follows
Y w 1 X 1 w 2 X 2 . . . w d X d o Y o
Now, under the null hypothesis, the true centered response vector is Y , and therefore, Y and o are simply the projections of the random error vector onto the subspaces SX and So, as shown in Figure 27.7a. In this case, we also expect the length of o and Y to be roughly comparable. On the other hand, under the alternative hypothesis, we have Y EY X , and so Y will be relatively much longer compared to o, as shown in Figure 27.7b.
Given that we expect to see a difference between Y under the null and alternative hypotheses, this suggests a geometric test based on the relative lengths of Y and o since we do not know the true EYX or. However, there is one difficulty; we cannot compare their lengths directly, since Y lies in a d dimensional space, whereas o lies in a n d 1 dimensional space. Instead, we can compare their lengths after normalizing
In this case, we have

Multiple Regression
751
So

o
Y 1
SX
So
a Null hypothesis: Y
Y Y
SX
by the number of dimensions. Define the mean squared length of per dimension for
the two vectorsY and o, as follows
2 2 Y Y
1
b Alternative hypothesis
Figure 27.7. Geometric test for regression effect: Null versus alternative hypothesis.
MY dimSX d
o 2 o 2
Mo dimSo nd1
Thus, the geometric test for the regression effect is the ratio of the normalized mean
squared length ofY to o, given as
2
MYYd Mo o2nd1

o

752
Regression Evaluation
2 It is interesting to note that from Eq. 27.24 we have Y
2
RSS and o
Y Y2 SSE, and therefore, the geometric ratio test is identical to the Ftest in Eq. 27.27, since
2 MY Yd RSSd
Mo o2nd1 SSEnd1 f
The geometric approach, illustrated in Figure 27.7, makes it clear that if f 1 then the null hypothesis holds, and we conclude that Y does not depend on the predictor variables X1,X2, ,Xd. On the other hand, if f is large, with a pvalue less than the significance level say, 0.01, then we can reject the null hypothesis and accept the alternative hypothesis that Y depends on at least one predictor variable Xi .
27.3 FURTHER READING
For a geometrical approach to multivariate statistics see Wickens 2014, and Saville and Wood 2012. For an excellent treatment of modern statistical inference in the context of regression see Devore and Berk 2012.
Devore, J. and Berk, K. 2012. Modern Mathematical Statistics with Applications. 2nd ed. New York: Springer ScienceBusiness Media.
Saville, D. J. and Wood, G. R. 2012. Statistical Methods: The Geometric Approach. New York: Springer Science Business Media.
Wickens, T. D. 2014. The Geometry of Multivariate Statistics. New York: Psychology Press, Taylor Francis Group.
27.4 EXERCISES
Q1. Showthatforbivariateregression,wehaveYX,whereandarethe
multiple regression.
true model parameters.
Q2. Showthatn yi yiyi Y0.
i1
Q3. ProvethatoisorthogonaltoYandX.Showthisforbivariateregressionandthenfor
Q4. Show that for bivariate regression, the R2 statistic is equivalent to the squared correlation between the independent attribute vector X and response vector Y. That is, show that R2 X2 Y.
Q5. ShowthatYYY1.
Q6. Showthato YY YY .
Q7. Inbivariateregression,showthatunderthenullhypothesis,ERSS2.
Q8. Inbivariateregression,showthatE22.
Q9. ShowthatERSSd2.

Exercises 753
Q10. Treating each residual oi yi yi as a random variable, show that varoi21hii
where hii is the ith diagonal of the d 1 d 1 hat matrix H for the augmented data D . Next, using the above expression for varoi, show that for bivariate regression, the variance of the ith residual is given as
varoi211 1 xiX2 n sX
Q11. Given data matrix D, let D the centered data matrix, and D the augmented data matrix with the extra column X0 1. Let D TD be the uncentered scatter matrix for the augmented data, and let DT Dbe the scatter matrix for the centered data. Show that the lower right d d submatrix of D TD 1 is DT D1.