程序代写代做代考 chain AI database kernel Excel graph deep learning C algorithm CHAPTER 23 Linear Regression

CHAPTER 23 Linear Regression
Given a set of attributes or variables X1,X2,··· ,Xd, called the predictor, explanatory, or independent variables, and given a real-valued attribute of interest Y, called the response or dependent variable, the aim of regression is to predict the response variable based on the independent variables. That is, the goal is to learn a regression function f , such that
Y = f (X1 , X2 , · · · , Xd ) + ε = f (X) + ε
where X = (X1 , X2 , · · · , Xd )T is the multivariate random variable comprising the predictor attributes, and ε is a random error term that is assumed to be independent of X. In other words, Y is comprised of two components, one dependent on the observed predictor attributes, and the other, coming from the error term, independent of the predictor attributes. The error term encapsulates inherent uncertainty in Y, as well as, possibly the effect of unobserved, hidden or latent variables.
In this chapter we discuss linear regression, where the regression function f is assumed to be a linear function of the parameters of the model. We also discuss regularized linear regression models considering both L2 (ridge regression) and L1 (Lasso) regularization. Finally, we use the kernel trick to perform kernel ridge regression that can handle non-linear models.
23.1 LINEAR REGRESSION MODEL
In linear regression the function f is assumed to be linear in its parameters, that is
􏰪d i=1
Here, the parameter β is the true (unknown) bias term, the parameter ωi is the true (unknown)regressioncoefficientorweightforattributeXi,andω=(ω1,ω2,···,ωd)T is the true d-dimensional weight vector. Observe that f specifies a hyperplane in Rd+1, where ω is the the weight vector that is normal or orthogonal to the hyperplane, and β
f(X)=β+ω1X1 +ω2X2 +···+ωdXd =β+
ωiXi =β+ωTX (23.1)
589

590 Linear Regression
is the intercept or offset term (see Section 6.1). We can see that f is completely specified by the d + 1 parameters comprising β and ωi , for i = 1, · · · , d .
The true bias and regression coefficients are unknown. Therefore, we have to estimate them from the training dataset D comprising n points xi ∈ Rd in a d-dimensional space, and the corresponding response values yi ∈ R, for i = 1,2,··· ,n. Let b denote the estimated value for the true bias β, and let wi denote the estimated value for the true regression coefficient ωi , for i = 1, 2, · · · , d . Let w = (w1 , w2 , · · · , wd )T denote the vector of estimated weights. Given the estimated bias and weight values, we can predict the response for any given input or test point x = (x1,x2,··· ,xd)T, as follows
yˆ =b+w1x1 +···+wdxd =b+wTx (23.2)
Due to the random error term, the predicted value yˆ will not in general match the observed response y for the given input x. This is true even for the training data. The difference between the observed and predicted response, called the residual error, is given as
ǫ = y − yˆ = y − b − w T x ( 2 3 . 3 )
Note that the residual error ǫ is different from the random statistical error ε, which measures the difference between the observed and the (unknown) true response. The residual error ǫ is an estimator of the random error term ε.
A common approach to predicting the bias and regression coefficients is to use the method of least squares. That is, given the training data D with points xi and response values yi (for i = 1,··· ,n), we seek values b and w, so as to minimize the sum of squared residual errors (SSE)
􏰪n􏰪n􏰄􏰅2􏰪n􏰄 􏰅2
SSE= ǫi2 = yi −yˆi = yi −b−wTxi (23.4)
i=1 i=1 i=1
In the following sections, we will estimate the unknown parameters, by first considering the case of a single predictor variable, and then looking at the general case of multiple predictors.
23.2 BIVARIATE REGRESSION
Let us first consider the case where the input data D comprises a single predictor attribute, X = (x1,x2,··· ,xn)T, along with the response variable, Y = (y1,y2,··· ,yn)T. Since f is linear, we have
yˆi =f(xi)=b+w·xi (23.5)
Thus, we seek the straight line f(x) with slope w and intercept b that best fits the data. The residual error, which is the difference between the predicted value (also

Bivariate Regression 591
called fitted value) and the observed value of the response variable, is given as
ǫ i = y i − yˆ i ( 2 3 . 6 )
Note that |ǫi | denotes the vertical distance between the fitted and observed response. The best fitting line minimizes the sum of squared errors
􏰪n 􏰪n 􏰪n
minSSE= ǫi2 = (yi −yˆi)2 = (yi −b−w·xi)2 (23.7)
b,w
i=1 i=1 i=1
To solve this objective, we differentiate it with respect to b and set the result to 0,
to obtain
∂ 􏰪n
∂bSSE=−2 (yi −b−w·xi)=0
i=1
􏰪n 􏰪n 􏰪n
=⇒ b= yi−w xi i=1 i=1 i=1
1 􏰪n
=⇒ b = n
i=1
1 􏰪n
y i − w · n b=μY −w·μX
x i
Therefore, we have
i=1
(23.8) where μY is the sample mean for the response and μX is the sample mean for the
predictor attribute. Similarly, differentiating with respect to w, we obtain ∂ 􏰪n
∂wSSE=−2 xi(yi −b−w·xi)=0 i=1
􏰪n 􏰪n 􏰪n
=⇒ x i · y i − b x i − w x i2 = 0
i=1 i=1 i=1 substituting b from Eq. (23.8), we have
􏰪n 􏰪n 􏰪n 􏰪n
=⇒ x i · y i − μ Y x i + w · μ X x i − w x i2 = 0
i=1 i=1 i=1 i=1
( 2 3 . 9 )
􏰞􏰪n 􏰟􏰞􏰪n􏰟 =⇒ w xi2 −n·μ2X = xi ·yi
−n·μX ·μY
(23.10)
i=1 i=1 􏰄􏰥ni=1 xi ·yi􏰅−n·μX ·μY
=⇒ w= 􏰄􏰥n x2􏰅−n·μ2 i=1 i X
The regression coefficient w can also be written as
􏰥ni=1(xi −μX)(yi −μY) σXY cov(X,Y) w= 􏰥ni=1(xi−μX)2 = σX2 = var(X)
(23.11)

592 Linear Regression where σX2 is the variance of X and σXY is the covariance between X and Y. Noting that
the correlation between X and Y is given as ρXY = σXY , we can also write w as σX ·σY
w = ρ σY XY σX
(23.12)
Observe that the fitted line must pass through the mean value of Y and X; plugging in the optimal value of b from Eq. (23.8) into the regression equation [Eq. (23.5)], we have
yˆi =b+w·xi =μY −w·μX +w·xi =μY +w(xi −μX)
That is, when xi = μX, we have yˆi = μY. Thus, the point (μX,μY) lies on the regression
line.
Example23.1(BivariateRegression). Figure23.1showsthescatterplotbetweenthe two attributes petal length (X; the predictor variable) and petal width (Y; the response variable) in the Iris dataset. There are a total of n = 150 data points. The mean values for these two variables are
μX = 150
1 􏰪1 5 0
μY = 150
1 􏰪1 5 0
5 6 3 . 8
xi = 150 =3.7587
1 7 9 . 8
yi = 150 =1.1987
i=1 i=1
2.5 2.0 1.5 1.0 0.5
􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱃􏱆
􏱃􏱃􏱆􏱆 􏱆􏱃
􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃
0 −0.5
􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆
0 1.0
2.0
3.0
X: petal length
6.0
7.0
􏱆􏱃
􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱆􏱃􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱆􏱃􏱃 􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃 􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃 􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆
ǫ35
􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃
􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱃􏱆
x35
􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱆 􏱃􏱆 􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱃􏱆􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆 ǫ9 􏱃􏱆
x9
􏱃􏱆
􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱃􏱆 􏱆􏱃 􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃
Figure 23.1. Scatterplot: petal length (X) versus petal width (Y). Solid circle (black) shows the mean point; residual error is shown for two sample points: x9 and x35.
4.0 5.0
Y: petal width

Bivariate Regression 593
The variance and covariance is given as
1 􏰪1 5 0
σ X2 = 1 50
1 􏰪1 5 0
(xi − μX)2 = 3.0924
(yi −μY)2 =0.5785
σ Y2 = 1 50
1 􏰪1 5 0
i=1
i=1 i=1
(xi −μX)·(yi −μY)=1.2877
Assuming a linear relationship between the response and predictor variables, we use
σXY = 150
Eq. (23.8) and Eq. (23.10) to obtain the slope and intercept terms as follows
w = σXY = 1.2877 = 0.4164 σX2 3.0924
b=μY −w·μX =1.1987−0.4164·3.7587=−0.3665 Thus, the fitted regression line is
yˆ = −0.3665 + 0.4164 · x
Figure 23.1 plots the best-fitting or regression line; we can observe that the mean point (μX,μY) = (3.759,1.199) lies on the line. The figure also shows the residual errors, ǫ9 and ǫ35, for the points x9 and x35, respectively.
Finally, we can compute the SSE value (see Eq. (23.4)) as follows:
150 150
SSE=􏰪ǫi2 =􏰪(yi −yˆi)2 =6.343
i=1 i=1
23.2.1 GeometryofBivariateRegression
We now turn to the attribute-centric view, which provides important geometric insight for bivariate regression. Recall that we are given n equations in two unknowns, namely yˆi = b + w · xi, for i = 1,··· ,n. Let X = (x1,x2,··· ,xn)T be the n-dimensional vector denoting the training 􏰭data sample, Y = (y1 , y2 , · · · , yn )T the sample vector for the response variable, and Y = (yˆ1,yˆ2,··· ,yˆn)T the vector of predicted values, then we can expressthenequations,yi =b+w·xi fori=1,2,···,n,asasinglevectorequation:
Y􏰭 = b · 1 + w · X
( 2 3 . 1 3 )
where 1 ∈ Rn is th􏰭e n-dimensional vector of all ones. This equation indicates that the predicted vector Y is a linear combination of 1 and X, i.e., it must lie in the column space spanned by 1 and X, given as span({1,X}). On the other hand, the response vector Y will not usually lie in the same column space. In fact, the residual error vector

594
Linear Regression
x1 Y
ǫ = Y − Y􏰭
1
xn
n-dimensional space spanned by the n data points. The plane illustrates the subspace spanned by 1 and X.
Y􏰭
Figure 23.2. Geometry of bivariate regression: non-orthogonal basis. All the vectors conceptually lie in the
X
x2
X
X
1
θ
􏱀 􏰿􏰾 􏱁
μX ·1
Figure 23.3. Orthogonal decomposition of X into X and μX · 1.
ǫ=(ǫ1,ǫ2,···,ǫn)T capturesthedeviationbetweentheresponseandpredictedvectors
ǫ = Y − Y􏰭
The geometry of the problem, shown in Figure 23.2, makes it clear that the optimal Y􏰭 that minimizes the error is the orthogonal projection of Y onto the subspace spanned by 1 and X. The residual error vector ǫ is thus orthogonal to the subspace spanned by 1 and X, and its squared length (or magnitude) equals the SSE value (see Eq. (23.4)), since
􏰪n 􏰪n ∥ǫ∥2 =∥Y−Y􏰭∥2 = (yi −yˆi)2 =
i=1 i=1
ǫi2 =SSE
At this point it is worth noting that even though 1 and X are linearly independent and form a basis for the column space, they need not be orthogonal (see Figure 23.2). We can create an orthogonal basis by decomposing X into a component along 1 and a component orthogonal to 1 as shown in Figure 23.3. Recall that the scalar projection
θ
X−μX ·1
􏰾 􏱁􏱀 􏰿

Bivariate Regression
595
x1
Y
ǫ = Y − Y􏰭
1
Y􏰭 xn
X
x2
Figure 23.4. Geometry of bivariate regression: orthogonal basis. X = X − μX · 1 is the centered attribute vector. The plane illustrates the subspace spanned by the orthogonal vectors 1 andX.
of a vector b onto vector a (see Eq. (1.12)) is given as proja(b) = 􏰕bTa􏰖
aTa
and the orthogonal projection of b on a (see Eq. (1.11)) is given as
proja(b)·a=􏰕bTa􏰖·a aTa
where X = X − μX · 1 is the centered attribute vector, obtained by subtracting the mean μX from all points.
The two vectors 1 and X form an orthogonal basis for the subspace. We can thus obtain the predicted vector Y􏰭 by projecting Y onto 1 andX, and summing up these two components, as shown in Figure 23.4. That is,
Y􏰭=proj (Y)·1+proj (Y)·X=􏰓YT1􏰔1+􏰓YTX􏰔X=μ ·1+􏰓YTX􏰔X (23.14) 1 X 1T1 XTX Y XTX
On the other hand, from Eq. (23.13), we know that
􏰓􏰔􏰭 Y􏰭=b·1+w·X=b·1+w μX ·1+X =(b+w·μX)·1+w·X (23.15)
Since both Eq. (23.14) and Eq. (23.15) are expressions for Y, we can equate them to obtain
μ =b+w·μ or b=μ −w·μ w= YTX YXYX XTX
Now, consider the projection of X onto 1; we have
p r o j 1 ( X ) · 1 = 􏰕 X T 1 􏰖 · 1 = 􏰕 􏰥 ni = 1 x i 􏰖 · 1 = μ X · 1
Thus, we can rewrite X as
1T1 n
X=μX ·1+(X−μX ·1)=μX ·1+X
μY
􏰾 􏱁􏱀 􏰿
w
􏰾 􏱁􏱀 􏰿

596 Linear Regression where the bias term b = μY − w · μX matches Eq. (23.8), and the weight w also matches
Eq. (23.10), since
Y T X Y T X Y T ( X − μ X · 1 ) 􏰄 􏰥 ni = 1 y i x i 􏰅 − n · μ X · μ Y
w=XTX=∥X∥2=∥X−μX·1∥2 = 􏰄􏰥ni=1xi2􏰅−n·μ2X
Example 23.2 (Geometry of Regression). Let us consider the regression of petal length (X) on petal width (Y) for the Iris dataset, with n = 150. First, we center X by subtracting the mean μX = 3.759. Next, we compute the scalar projections of Y onto 1 andX, to obtain
μY = proj1(Y) = 􏰕YT1􏰖 = 179.8 = 1.1987 1T1 150
w=projX(Y)=􏰞YTX􏰟= 193.16 =0.4164 T
Thus, the bias term b is given as
b=μY −w·μX =1.1987−0.4164·3.7587=−0.3665
These values for b and w match those in Example 23.1. Finally, we can compute the SSE value (see Eq. (23.4)) as the squared length of the residual error vector
SSE=∥ǫ∥2 =∥Y−Y􏰭∥2 =(Y−Y􏰭)T(Y−Y􏰭)=6.343
X X 463.86
23.3 MULTIPLE REGRESSION
We now consider the more general case called multiple regression1 where we have multiple predictor attributes X1,X2,··· ,Xd and a single response attribute Y. The training data sample D ∈ Rn×d comprises n points xi = (xi1,xi2,··· ,xid )T in a d-dimensional space, along with the corresponding observed response value yi. The vector Y = (y1 , y2 , · · · , yn )T denotes the observed response vector. The predicted response value for input xi is given as
yˆi =b+w1xi1 +w2xi2 +···+wdxid =b+wTxi (23.16)
wherew=(w1,w2,···,wd)T istheweightvectorcomprisingtheregressioncoefficients or weights wj along each attribute Xj . Eq. (23.16) defines a hyperplane in Rd+1 with bias term b and normal vector w.
Instead of dealing with the bias b separately from the weights wi for each attribute, we can introduce a new “constant” valued attribute X0 whose value is always fixed at 1, so that each input point xi = (xi1,xi2,··· ,xid )T ∈ Rd is mapped to an augmented
1 We follow the usual terminology and reserve the term multivariate regression for the case when there are multipleresponseattributesY1,Y2,···,Yq andmultiplepredictorattributesX1,X2,···,Xd.

Multiple Regression 597
point x ̃i = (xi0,xi1,xi2,··· ,xid )T ∈ Rd+1, where xi0 = 1. Likewise, the weight vector w = (w1,w2,··· ,wd)T is mapped to an augmented weight vector w ̃ = (w0,w1,w2,··· ,wd)T, where w0 = b. The predicted response value for an augmented (d + 1) dimensional point x ̃i can be written as
yˆi =w0xi0 +w1xi1 +w2xi2 +···+wdxid =w ̃Tx ̃i (23.17)
Since there are n points, in fact we have n such equations, one per point, and there are (d + 1) unknowns, namely the elements of the augmented weight vector w ̃ . We can compactly write all these n equations as a single matrix equation, given as
Y􏰭 = D ̃ w ̃ (23.18) ̃ n×(d+1) 􏰭
where D ∈ R is the augmented data matrix, which includes the constant attribute X0 inadditiontothepredictorattributesX1,X2,···,Xd,andY=(yˆ1,yˆ2,···,yˆn)T isthe vector of predicted responses.
The multiple regression task can now be stated as finding the best fitting hyperplane defined by the weight vector w ̃ that minimizes the sum of squared errors
( 2 3 . 1 9 )
􏰪n w ̃ i=1
minSSE=
ǫi2 =∥ǫ∥2 =∥Y−Y􏰭∥2
=(Y−Y􏰭)T(Y−Y􏰭)=YTY−2YT Y􏰭+Y􏰭T Y􏰭 = Y T Y − 2 Y T ( D ̃ w ̃ ) + ( D ̃ w ̃ ) T ( D ̃ w ̃ )
= Y T Y − 2 w ̃ T ( D ̃ T Y ) + w ̃ T ( D ̃ T D ̃ ) w ̃
where we substituted Y􏰭 = D ̃ w ̃ from Eq. (23.18), and we used the fact that YT (D ̃ w ̃ ) = ( D ̃ w ̃ ) T Y = w ̃ T ( D ̃ T Y ) .
To solve the objective, we differentiate the expression in Eq. (23.19) with respect to w ̃ and set the result to 0, to obtain
∂ SSE=−2D ̃TY+2(D ̃TD ̃)w ̃=0 ∂ w ̃ = ⇒ ( D ̃ T D ̃ ) w ̃ = D ̃ T Y
Therefore, the optimal weight vector is given as
w ̃ = ( D ̃ T D ̃ ) − 1 D ̃ T Y
Substituting the optimal value of w ̃ into Eq. (23.18), we get Y􏰭=D ̃w ̃ =D ̃(D ̃TD ̃)−1D ̃TY=HY
( 2 3 . 2 0 )
where H = D ̃ (D ̃ TD ̃ )−1D ̃ T is called the hat matrix, since it puts the “hat” on Y! Notice also that D ̃ TD ̃ is the uncentered scatter matrix for the training data.

598
Linear Regression
X1
Y
150.0 876.50 563.80  D ̃ TD ̃ = 876.5 5223.85 3484.25
563.8 3484.25 2583.00 We also compute D ̃ TY, given as
 0.793 −0.176 0.064 (D ̃ TD ̃ )−1 = −0.176 0.041 −0.017
􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃􏱃􏱆
Figure23.5. Multipleregression:sepallength(X1)andpetallength(X2)withresponseattributepetal width (Y). The vertical bars show the residual error for the points. Points in white are above the plane, whereas points in gray are below the plane.
Example 23.3 (Multiple Regression). Figure 23.5 shows the multiple regression of sepal length (X1 ) and petal length (X2 ) on the response attribute petal width (Y) for the Iris dataset with n = 150 points. We first add an extra attribute X0 = 1150, which is a vector of all ones in R150. The augmented dataset D ̃ ∈ R150×3 comprises n = 150 points along three attributes X0, X1, and X2.
Next,wecomputetheuncentered3×3scattermatrixD ̃TD ̃ anditsinverse
 179.80  D ̃ TY = 1127.65
868.97 The augmented weight vector w ̃ is then given as
w2 0.45 The bias term is therefore b = w0 = −0.014, and the fitted model is
Y􏰭=−0.014−0.082·X1+0.45·X2
Figure 23.5 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals (i.e., ǫi > 0 or yˆ i > yi ), whereas
w0 
−0.014 w ̃ =  w 1  = ( D ̃ T D ̃ ) − 1 · ( D ̃ T Y ) =  − 0 . 0 8 2 
􏱆􏱃
􏱃􏱆 􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱃􏱆
􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆 X2
􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆
0.064 −0.017 0.009

Multiple Regression 599
the gray points have negative residual values (i.e., ǫi < 0 or yˆi < y). The SSE value for the model is 6.18. 23.3.1 Geometry of Multiple Regression Let D ̃ be the augmented data matrix comprising the d independent attributes Xi , along with the new constant attribute X0 = 1 ∈ Rn, given as ̃||| | D= X0 X1 X2 ··· Xd ||| | Let w ̃ = (w0 , w1 , · · · , wd )T ∈ R(d +1) be the augmented weight vector that incorporates the bias term b = w0. Recall that the predicted response vector is given as 􏰪d i=0 This equation makes it clear that the predicted vector must lie in the column space of the augmented data matrix D ̃ , denoted col(D ̃ ), i.e., it must be a linear combination of the attribute vectors Xi, i = 0,··· ,d. 􏰭 To minimize the error in prediction, Y must be the orthogonal projection of Y onto the subspace col(D ̃ ). The residual error vector ǫ = Y − Y􏰭 is thus orthogonal to the subspace col(D ̃ ), which means that it is orthogonal to each attribute vector Xi . That is, X Ti ǫ = 0 = ⇒ X Ti ( 􏰭 Y − Y􏰭 ) = 0 = ⇒ X Ti Y = X Ti Y = ⇒ X Ti ( D ̃ w ̃ ) = X Ti Y =⇒w0 ·XTi X0 +w1·XTi X1 +···+wd ·XTi Xd =XTi Y We thus have (d + 1) equations, called the normal equations, in (d + 1) unknowns, namely the regression coefficients or weights wi (including the bias term w0). The solution to these simultaneous equations yields the weight vector w ̃ = (w0 , w1 , · · · , wd )T . The (d + 1) normal equations are Y􏰭 = b · 1 + w 1 · X 1 + w 2 · X 2 + · · · + w d · X d = w i · X i = D ̃ w ̃ w0·XT0X0+w1·XT0X1+···+wd ·XT0Xd =XT0Y w0·XT1X0+w1·XT1X1+···+wd ·XT1Xd =XT1Y . = . w0·XTdX0+w1·XTdX1+···+wd ·XTdXd =XTdY (23.21) 600 Linear Regression which can be written compactly in matrix notation to solve for w ̃ as follows  X T0 X 0   X T1 X 0  . . . X Td X 0 X T0 X 1 X T1 X 1 . . . X Td X 1 · · · · · · · · · · · · X T0 X d  X T1 X d   ̃ T . . .  w ̃ = D Y X Td X d ( D ̃ T D ̃ ) w ̃ = D ̃ T Y w ̃ = (D ̃ TD ̃ )−1(D ̃ TY) More insight can be obtained by noting that the attribute vectors comprising the This matches the expression in Eq. (23.20). column space of D are not necessarily orthogonal, even if we assume they are linearly ̃􏰭 independent. To obtain the projected vector Y , we first need to construct an orthogonal basis for col(D ̃ ). Let U0,U1,··· ,Ud denote the set of orthogonal basis vectors for col(D ̃ ). We can construct these vectors in a step-wise manner via Gram–Schmidt orthogonalization, as follows where U0 = X0 U1 = X1 − p10 · U0 U2 =X2 −p20 ·U0 −p21 ·U1 . = . Ud =Xd −pd0 ·U0 −pd1 ·U1 −···−pd,d−1 ·Ud−1 p j i = p r o j U ( X j ) = X jT U i i ∥Ui∥2 (23.22) denotes the scalar projection of attribute Xj onto the basis vector Ui. Essentially, to obtain Uj , we subtract from vector Xj its scalar projections along all previous basis vectors U0,U1,··· ,Uj−1. Rearranging the equations above, so that Xj is on the left hand side, we get X0 = U0 X1 = p10 · U0 + U1 X2 =p20 ·U0 +p21 ·U1 +U2 . = . Xd =pd0 ·U0 +pd1 ·U1 +···+pd,d−1 ·Ud−1 +Ud The Gram–Schmidt method thus results in the so-called QR-factorization2 of the data matrix, namely D ̃ = QR, where by construction Q is a n × (d + 1) matrix withi 2 In QR-factorization, the matrix Q is orthogonal, with orthonormal columns, i.e., with orthogonal columns that are normalized to be unit length. However, we keep the basis vectors un-normalized for ease of presentation. Multiple Regression 601 orthogonal columns ||| | Q= U0 U1 U2 ··· Ud ||| | and R is the (d + 1) × (d + 1) upper-triangular matrix  1 p 1 0 p 2 0 · · · p d 0  0 1 p21 ··· pd1  R=0 0 1 ··· pd2  . . . ... .   0 0 0 1 p d , d − 1  00001 So that, in the column view the QR-factorization of the augmented data matrix is given as: | | | | | | | |0 1 p21 ··· pd1   1 p 1 0 p 2 0 · · · p d 0  X X X ···X=U U U ···U·0 01···pd2 0 1 2 d 0 1 2 d . . . ... .  ||||||||  􏱀 􏰿􏰾 􏱁 􏱀 􏰿􏰾 􏱁0 0 0 1 pd,d−1 D ̃ Q 00001 􏱀 􏰿􏰾 􏱁 R SincethenewbasisvectorsU0,U1,···,Ud formanorthogonalbasisforthecolumn space of D ̃ , we can obtain the predicted response vector as a sum of the projections of Y along each new basis vector, given as Y􏰭 = p r o j U 0 ( Y ) · U 0 + p r o j U 1 ( Y ) · U 1 + · · · + p r o j U d ( Y ) · U d ( 2 3 . 2 3 ) Bias Term The geometric approach makes it easy to derive an expression for the bias term b. Note that each of the predictor attributes can be centered by removing its projection along the vector 1. Define Xi to be the centered attribute vector Xi = Xi − μXi · 1 All the centered vectors Xi lie in the n − 1 dimensional space, comprising the orthogonal complement of the span of 1. From the expression of Y􏰭, we have Y􏰭 = b · 1 + w 1 · X 1 + w 2 · X 2 + · · · + w d · X d 􏰓􏰔􏰓􏰔 =b·1+w1 · X1 +μX1 ·1 +···+wd · Xd +μXd ·1 =􏰄b+w1·μX1 +...+wd ·μXd􏰅·1+w1·X1+...+wd ·Xd (23.24) On the other hand, since 1 is orthogonal to all Xi , we can obtain another expression for Y􏰭 in terms of the projection of Y onto the subspace spanned by the vectors 602 Linear Regression {1,X1,··· ,Xd}. Let the new orthogon􏰭al basis for these centered attribute vectors be {U0,U1,···,Ud},whereU0 =1.Thus,Y canalsobewrittenas 􏰪d i=1 􏰪d i=1 Y􏰭=projU0(Y)·U0+ In particular, equating the scalar projections along 1 in Eq. (23.24) and Eq. (23.25), we projUi(Y)·Ui =proj1(Y)·1+ proj1(Y)=μY =(b+w1·μX1 +...+wd ·μXd), whichimplies projUi(Y)·Ui (23.25) get: b=μY −w1 ·μX1 −...−wd ·μXd =μY − where we use the fact that the scalar projection of any attribute vector onto 1 yields the mean value of that attribute. For example, YT1 1􏰪n proj1(Y)= 1T1 = n yi =μY i=1 23.3.2 Multiple Regression Algorithm The pseudo-code for multiple regression is shown in Algorithm 23.1. The algorithm starts by the QR-decomposition of D ̃ = QR, where Q is a matrix with orthogonal columns that make up an orthogonal basis, and R is an upper triangular matrix, which can be obtained via Gram–Schmidt orthogonalization. Note that, by construction, the matrix QTQ is given as ∥U0∥2 0 ··· 0 QTQ= 0 ∥U1∥2 ··· 0 =􏰀 0 0 ... 0 0 0 ···∥Ud∥2 􏰪d i=1 wi ·μXi (23.26) Algorithm 23.1: Multiple Regression Algorithm MULTIPLE-REGRESSION (D,Y): 1 D ̃ ←􏰓1 D􏰔//augmenteddatawith X0 =1∈Rn 2 {Q,R}←QR-factorization(D ̃)// Q=􏰄U0 U1 ··· Ud􏰅 1 0···0 ∥U0 ∥2 0 1 ··· 0 3 􏰀−1 ←  ∥U1 ∥2  // reciprocal squared norms 0 0 ... 0 0 0 ··· 1 −1 T ∥Ud∥2 4 R􏰭w ← 􏰀 Q Y // solve for w by back-substitution 5 Y←Q􏰀−1QTY Multiple Regression 603 We can observe that the matrix QTQ, denoted 􏰀, is a diagonal matrix that contains the squared norms of the new orthogonal basis vectors U0 , U1 , · · · , Ud . Recall that the solution to multiple regression is given via the normal equations [Eq. (23.21)], which can be compactly written as (D ̃ T w ̃ )w ̃ = D ̃ T Y [Eq. (23.22)]; plugging in the QR-decomposition, we get ( D ̃ T D ̃ ) w ̃ = D ̃ T Y (QR)T(QR)w ̃ = (QR)TY RT(QTQ)Rw ̃ = RTQTY RT􏰀Rw ̃ = RTQTY 􏰀 R w ̃ = Q T Y R w ̃ = 􏰀 − 1 Q T Y Note that 􏰀−1 is a diagonal matrix that records the reciprocal of the squared norms of the new basis vectors U0 , U1 , · · · , Ud . Furthermore, since R is upper-triangular, it is straightforward t􏰭o solve for w ̃ by back-substitution. Note also that we can obtain the predicted vector Y as follows Y􏰭 = D ̃ w ̃ = QRR−1􏰀−1QTY = Q(􏰀−1QTY) It is interesting to note that 􏰀−1QTY gives the vector of scalar projections of Y onto each of the orthogonal basis vectors projU0(Y) 􏰀−1QTY = projU1(Y)  .  projUd (Y) Therefore, Q(􏰀−1QTY), yields the projection formula in Eq. (23.23) projU0(Y) Y􏰭=QprojU1(Y)=proj (Y)·U +proj (Y)·U +···+proj (Y)·U  .  U0 0 U1 1 Ud d projUd(Y) Example 23.4 (Multiple Regression: QR-Factorization and Geometric Approach). Consider the multiple regression of sepal length (X1) and petal length (X2) on the response attribute petal width (Y) for the Iris dataset with n = 150 points, as shown in Figure 23.5. The augmented dataset D ̃ ∈ R150×3 comprises n = 150 points along three attributes X0, X1, and X2, where X0 = 1. The Gram–Schmidt orthogonalization results in the following QR-factorization:  | | |   | | |  1 5.843 3.759 X0 X1 X2 = U0 U1 U2·0 1 1.858 ||||||001 􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 D ̃ Q R 􏱁 604 Linear Regression Note that Q ∈ R150×3 and therefore we do not show the matrix. The matrix 􏰀, which records the squared norms of the basis vectors, and its inverse matrix, is given as 150 0 0 0.00667 0 0 􏰀= 0 102.17 0  􏰀−1 = 0 0.00979 0  0 0 111.35 We can use back-substitution to solve for w ̃ , as follows R w ̃ = 􏰀 − 1 Q T Y 1 5.843 3.759 w0  1.1987 0 1 1.858 w1  = 0.7538 0 0 1 w2 0.4499 In back-substitution, we start with w2, which is easy to compute from the equation above; it is simply Next, w1 is given as: w2 = 0.4499 w1 + 1.858 · w2 = 0.7538 =⇒ w1 = 0.7538 − 0.8358 = −0.082 Finally, w0 can be computed as w0 +5.843·w1 +3.759·w2 =1.1987 =⇒w0 =1.1987+0.4786−1.6911=−0.0139 Thus, the multiple regression model is given as Y􏰭 = − 0 . 0 1 4 · X 0 − 0 . 0 8 2 · X 1 + 0 . 4 5 · X 2 ( 2 3 . 2 7 ) which matches the model in Example 23.3. It is also instructive to construct the new basis vectors U0,U1,··· ,Ud in terms of theoriginalattributesX0,X1,···,Xd.SinceD ̃ =QR,wehaveQ=D ̃R−1.Theinverse of R is also upper-triangular, and is given as 1 −5.843 7.095  R−1 = 0 1 −1.858 001 Therefore, we can write Q in terms of the original attributes as  | | |   | | | 1 −5.843 7.095  U0 U1 U2 = X0 X1 X20 1 −1.858 ||||||001 􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 Q D ̃ R−1 􏱁 0 0 0.00898 Multiple Regression 605 which results in U0 = X0 U1 =−5.843·X0+X1 U2 =7.095·X0−1.858·X1+X2 The scalar projection of the response vector Y onto each of the new basis vectors yields: projU0(Y) = 1.199 projU1(Y) = 0.754 projU2(Y) = 0.45 Finally, the fitted response vector is given as: Y􏰭 = projU0(Y) · U0 + projU1(Y) · U1 + projU2(Y) · U2 = 1.199 · X0 + 0.754 · (−5.843 · X0 + X1) + 0.45 · (7.095 · X0 − 1.858 · X1 + X2) =(1.199−4.406+3.193)·X0+(0.754−0.836)·X1+0.45·X2 = −0.014 · X0 − 0.082 · X1 + 0.45 · X2 which matches Eq. (23.27). 23.3.3 Multiple Regression: Stochastic Gradient Descent Instead of using the QR-factorization approach to exactly solve the multiple regression problem, we can also employ the simpler stochastic gradient algorithm. Consider the SSE objective given in Eq. (23.19) (multiplied by 1/2): minSSE= 1􏰓YTY−2w ̃T(D ̃TY)+w ̃T(D ̃TD ̃)w ̃􏰔 w ̃ 2 The gradient of the SSE objective is given as ∇w ̃ = ∂ SSE=−D ̃TY+(D ̃TD ̃)w ̃ ∂ w ̃ (23.28) Using gradient descent, starting from an initial weight vector estimate w ̃ 0, we can iteratively update w ̃ as follows w ̃t+1=w ̃t−η·∇w ̃ =w ̃t+η·D ̃T(Y−D ̃·w ̃t) where w ̃ t is the estimate of the weight vector at step t. In stochastic gradient descent (SGD), we update the weight vector by considering only one (random) point at each time. Restricting Eq. (23.28) to a single point x ̃k in the training data D ̃ , the gradient at the point x ̃k is given as ∇w ̃ (x ̃k) = −x ̃kyk + x ̃kx ̃Tk w ̃ = −(yk − x ̃Tk w ̃ )x ̃k 606 Linear Regression Algorithm 23.2: Multiple Regression: Stochastic Gradient Descent MULTIPLE REGRESSION: SGD (D,Y,η,ǫ): 1 2 3 4 5 6 7 8 9 D ̃ ← 􏰄1 D􏰅 // augment data t ← 0 // step/iteration counter w ̃ t ← random vector in Rd+1 // initial weight vector repeat foreach k = 1,2,··· ,n (in random order) do ∇ w ̃ ( x ̃ k ) ← − ( y k − x ̃ Tk w ̃ t ) · x ̃ k / / c o m p u t e g r a d i e n t a t x ̃ k w ̃ t+1 ←w ̃ t −η·∇w ̃(x ̃k)//updateestimatefor w ̃ t←t+1 until∥wt−wt−1∥≤ǫ Therefore, the stochastic gradient update rule is given as w ̃ t + 1 = w ̃ t − η · ∇ w ̃ ( x ̃ k ) = w ̃ t + η · ( y k − x ̃ Tk w ̃ t ) · x ̃ k Algorithm 23.2 shows the pseudo-code for the stochastic gradient descent algorithm for multiple regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance ǫ. Example 23.5 (Multiple Regression: SGD). We continue Example 23.4 for multiple regression of sepal length (X1 ) and petal length (X2 ) on the response attribute petal width (Y) for the Iris dataset with n = 150 points. Using the exact approach the multiple regression model was given as Y􏰭 = − 0 . 0 1 4 · X 0 − 0 . 0 8 2 · X 1 + 0 . 4 5 · X 2 Using stochastic gradient descent we obtain the following model with η = 0.001 and ǫ = 0.0001: Y􏰭 = − 0 . 0 3 1 · X 0 − 0 . 0 7 8 · X 1 + 0 . 4 5 · X 2 The results from the SGD approach as essentially the same as the exact method, with a slight difference in the bias term. The SSE value for the exact method is 6.179, whereas for SGD it is 6.181. 23.4 RIDGE REGRESSION We have seen that for linear regression, the predicted response vector Y􏰭 lies in the span of the column vectors comprising the augmented data matrix D ̃ . However, often the data is noisy and uncertain, and therefore instead of fitting the model to the data Ridge Regression 607 exactly, it may be better to fit a more robust model. One way to achieve this is via r e g u l a r i z a t i o n , w h e r e w e c o n s t r a i n t h e s o l u t i o n v e c t o r w ̃ t o h a v e a s m a l l n o r m . I􏰭n o t h e r words, instead of trying to simply minimize the squared residual error ∥Y − Y ∥2 , we add a regularization term involving the squared norm of the weight vector (∥w ̃ ∥2) as follows: min J(w ̃)=∥Y−Y􏰭∥2+α·∥w ̃∥2 =∥Y−D ̃w ̃∥2+α·∥w ̃∥2 (23.29) w ̃ Here α ≥ 0 is a regularization constant that controls the tradeoff between minimizing t􏰥he squared norm of the weight vector and the squared error. Recall that ∥w ̃ ∥2 = di=1wi2 is the L2-norm of w ̃. For this reason ridge regression is also called L2 regularized regression. When α = 0, there is no regularization, but as α increases there is more emphasis on minimizing the regression coefficients. The solve the new regularized objective we differentiate Eq. (23.29) with respect (23.30) (23.31) (23.32) to w ̃ and set the result to 0 to obtain ∂ J(w ̃)= ∂ 􏰵∥Y−D ̃w ̃∥2+α·∥w ̃∥2􏰶=0 ∂ w ̃ ∂ w ̃ =⇒ ∂ 􏰵YTY−2w ̃T(D ̃TY)+w ̃T(D ̃TD ̃)w ̃ +α·w ̃Tw ̃􏰶=0 ∂ w ̃ ̃ T ̃ T ̃ =⇒ −2D Y+2(D D)w ̃ +2α·w ̃ =0 =⇒(D ̃TD ̃ +α·I)w ̃ =D ̃TY Therefore, the optimal solution is w ̃ =(D ̃TD ̃ +α·I)−1D ̃TY where I ∈ R(d +1)×(d +1) is the identity matrix. The matrix (D ̃ T D ̃ + α · I) is always invertible (or non-singular) for α > 0 even if D ̃ T D ̃ is not invertible (or singular). This is because ifλi isaneigenvalueofD ̃TD ̃,thenλi +αisaneigenvalueof(D ̃TD ̃ +α·I).SinceD ̃TD ̃ is positive semi-definite it has non-negative eigenvalues. Thus, even if an eigenvalue of D ̃TD ̃ iszero,e.g.,λi=0,thecorrespondingeigenvalueof(D ̃TD ̃+α·I)isλi+α=α>0.
Regularized regression is also called ridge regression since we add a “ridge” along the main diagonal of the D ̃ TD ̃ matrix, i.e., the optimal solution depends on the regularized matrix (D ̃ TD ̃ + α · I). Another advantage of regularization is that if we choose a small positive α we are always guaranteed a solution.
Example 23.6 (Ridge Regression). Figure 23.6 shows the scatterplot between the two attributes petal length (X; the predictor variable) and petal width (Y; the response variable) in the Iris dataset. There are a total of n = 150 data points. The uncentered scatter matrix is given as
D ̃ TD ̃ = 􏰕150.0 563.8 􏰖 563.8 2583.0

608
Linear Regression
2.5 2.0 1.5 1.0 0.5
0 −0.5
􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃
􏱆􏱆􏱃􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆
α=0 α=10
􏱃􏱆 α=100 􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱃􏱃􏱃􏱆􏱆􏱆􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱃􏱃􏱆􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱃􏱃􏱆􏱆 􏱃􏱃􏱆􏱃􏱆􏱆 􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃 􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱆􏱃 􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱃􏱆􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱆􏱃􏱃
􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱃􏱆􏱆􏱆􏱃 􏱆􏱆􏱃􏱃 􏱆􏱃 􏱆􏱃
􏱃􏱆
􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱆􏱃 􏱆􏱃
0 1.0
Figure 23.6. Scatterplot: petal length (X) versus petal width (Y). Ridge regression lines for α =
0, 10, 100.
2.0
3.0
X: petal length
6.0
7.0
4.0 5.0
Using Eq. (23.32) we obtain different lines of best fit for different values of the regularization constant α:
α = 0 :Y􏰭􏰭 = −0.367 + 0.416 · X, ∥w ̃ ∥2 = ∥(−0.367, 0.416)T∥2 = 0.308, SSE = 6.34 α = 10 :Y􏰭 = −0.244 + 0.388 · X, ∥w ̃ ∥2 = ∥(−0.244, 0.388)T∥2 = 0.210, SSE = 6.75 α = 100 :Y = −0.021 + 0.328 · X, ∥w ̃ ∥2 = ∥(−0.021, 0.328)T∥2 = 0.108, SSE = 9.97
Figure 23.6 shows these regularized regression lines. We can see that as α increases there is more emphasis on minimizing the squared norm of w ̃ . However, since ∥w ̃ ∥2 is more constrained as α increases, the fit of the model decreases, as seen from the increase in SSE values.
Unpenalized Bias Term Often in L2 regularized regression we do not want to penalize the bias term w0, since it simply provides the intercept information, and there is no real justification to minimize it. To avoid penalizing the bias term, consider the new regularized objective with w = (w1,w2,··· ,wd )T, and without w0, given as
min J(w)=∥Y−w0 ·1−Dw∥2 +α·∥w∥2 (23.33)
w
􏰑􏰑 􏰪d 􏰑􏰑2 􏰞􏰪d 􏰟 = 􏰑􏰑 Y − w 0 · 1 − w i · X i 􏰑􏰑 + α · w i2
i=1 i=1
Y: petal width

Ridge Regression
609
Recall from Eq. (23.26) that the bias w0 = b is given as
􏰪d
w0 =b=μY − wi ·μXi =μY −μTw
i=1
where μ = (μX1 , μX2 , · · · , μXd )T is the multivariate mean of
Substituting w0 into the new L2 objective in Eq. (23.33), we get min J(w)=∥Y−w0 ·1−Dw∥2 +α·∥w∥2
(unaugmented) D.
w
Therefore, we have
=∥Y−(μY −μTw)·1−Dw∥2 +α·∥w∥2 =∥(Y−μY ·1)−(D−1μT)w∥2 +α·∥w∥2
min J(w)=∥Y−Dw∥2+α·∥w∥2 w
whereY=Y−μY·1isthecenteredresponsevector,andD=D−1μT isthecentered data matrix. In other words, we can exclude w0 from the L2 regularization objective by simply centering the response vector and the unaugmented data matrix.
(23.34)
Example 23.7 (Ridge Regression: Unpenalized Bias). We continue from Exam- ple 23.6. When we do not penalize w0, we obtain the following lines of best fit for different values of the regularization constant α:
α=0:Y􏰭􏰭=−0.365+0.416·X α=10:Y􏰭=−0.333+0.408·X α = 1 0 0 : Y = − 0 . 0 8 9 + 0 . 3 4 3 · X
w02 +w12 =0.307 w02 +w12 =0.277 w 02 + w 12 = 0 . 1 2 5
SSE=6.34 SSE=6.38 S S E = 8 . 8 7
From Example 23.6, we observe that for α = 10, when we penalize w0, we obtain the following model:
α=10:Y􏰭=−0.244+0.388·X w02 +w12 =0.210 SSE=6.75 As expected, we obtain a higher bias term when we do not penalize w0.
23.4.1 Ridge Regression: Stochastic Gradient Descent
Instead of inverting the matrix (D ̃ T D ̃ + α · I) as called for in the exact ridge regression solution in Eq. (23.32), we can employ the stochastic gradient descent algorithm. Consider the gradient of the regularized objective given in Eq. (23.30), multiplied by 1/2 for convenience; we get:
∇w ̃ = ∂ J(w ̃)=−D ̃TY+(D ̃TD ̃)w ̃+α·w ̃ (23.35) ∂ w ̃

610
Linear Regression
Algorithm 23.3: Ridge Regression: Stochastic Gradient Descent RIDGE REGRESSION: SGD (D,Y,η,ǫ):
1 2 3 4 5 6 7
8 9
D ̃ ← 􏰄1 D􏰅 // augment data
t ← 0 // step/iteration counter
w ̃ t ← random vector in Rd+1 // initial weight vector repeat
foreach k = 1,2,··· ,n (in random order) do ∇w ̃(x ̃k)←−(yk−x ̃Tkw ̃t)·x ̃k+α ·w ̃//computegradientatx ̃k
n
w ̃ t+1 ←w ̃ t −η·∇w ̃(x ̃k)//updateestimatefor w ̃
t←t+1 until∥wt−wt−1∥≤ǫ
Using (batch) gradient descent, we can iteratively compute w ̃ as follows w ̃t+1=w ̃t−η·∇w ̃ =(1−η·α)w ̃t+η·D ̃T(Y−D ̃·w ̃t)
In stochastic gradient descent (SGD), we update the weight vector by considering only one (random) point at each time. Restricting Eq. (23.35) to a single point x ̃k, the gradient at the point x ̃k is given as
∇w ̃ (x ̃k)=−x ̃kyk +x ̃kx ̃Tk w ̃ + αw ̃ =−(yk −x ̃Tk w ̃)x ̃k + αw ̃ (23.36) nn
Here, we scale the regularization constant α by dividing it by n, the number of points in the training data, since the original ridge value α is for all the n points, whereas we are now considering only one point at a time. Therefore, the stochastic gradient update rule is given as
w ̃ t + 1 = w ̃ t − η · ∇ w ̃ ( x ̃ k ) = 􏰓 1 − η · α 􏰔 w ̃ t + η · ( y k − x ̃ Tk w ̃ t ) · x ̃ k n
Algorithm 23.3 shows the pseudo-code for the stochastic gradient descent algorithm for ridge regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance ǫ. The code is for the penalized bias case. It is easy to adapt it for unpenalized bias by centering the unaugmented data matrix and the response variable.
Example 23.8 (Ridge Regression: SGD). We apply ridge regression on the Iris dataset (n = 150), using petal length (X) as the independent attribute, and petal width (Y) as the response variable. Using SGD (with η = 0.001 and ǫ = 0.0001) we obtain different lines of best fit for different values of the regularization constant α, which essentially match the results from the exact method in Example 23.6:
α = 0 : Y􏰭 = − 0 . 3 6 6 + 0 . 4 1 3 · X S S E = 6 . 3 7 α=10:Y􏰭=−0.244+0.387·X SSE=6.76
α = 1 0 0 : Y􏰭 = − 0 . 0 2 2 + 0 . 3 2 7 · X S S E = 1 0 . 0 4

Kernel Regression 611
23.5 KERNEL REGRESSION
We now consider how to generalize linear regression to the non-linear case, i.e., finding a non-linear fit to the data to minimize the squared error, along with regularization. For this we will adopt the kernel trick, i.e., we will show that all relevant operations can be carried out via the kernel matrix in input space.
Let φ correspond to a mapping from the input space to the feature space, so that each point φ(xi) in feature space is a mapping for the input point xi. To avoid explicitly dealing with the bias term, we add the fixed value 1 as the first element of φ (xi ) to obtain the augmented transformed point φ ̃ (xi )T = 􏰄1 φ (xi )T 􏰅. Let D ̃ φ denote the augmented dataset in feature space, comprising the transformed points φ ̃(xi) for i = 1, 2, · · · , n. The augmented kernel function in feature space is given as
K ̃(xi,xj)=φ ̃(xi)Tφ ̃(xj)=1+φ(xi)Tφ(xj)=1+K(xi,xj)
where K(xi , xj ) is the standard, unaugmented kernel function.
Let Y denote the observed response vector. Following Eq. (23.18), we model the
predicted response as
Y􏰭 = D ̃ φ w ̃ ( 2 3 . 3 7 )
where w ̃ is the augmented weight vector in feature space. The first element of w ̃ denotes the bias in feature space.
For regularized regression, we have to solve the following objective in feature space:
min J(w ̃)=∥Y−Y􏰭∥2+α·∥w ̃∥2=∥Y−D ̃φw ̃∥2+α·∥w ̃∥2 (23.38) w ̃
where α ≥ 0 is the regularization constant.
Taking the derivative of J(w ̃ ) with respect to w ̃ and setting it to the zero vector, we get
∂ J(w ̃)= ∂ 􏰵∥Y−D ̃φw ̃∥2+α·∥w ̃∥2􏰶=0 ∂ w ̃ ∂ w ̃
=⇒ ∂ 􏰵YTY−2w ̃T(D ̃TφY)+w ̃T(D ̃TφD ̃φ)w ̃ +α·w ̃Tw ̃􏰶=0 ∂ w ̃ ̃ T ̃ T ̃
=⇒ −2DφY+2(DφDφ)w ̃ +2α·w ̃ =0 = ⇒ α · w ̃ = D ̃ Tφ Y − ( D ̃ Tφ D ̃ φ ) w ̃
􏰕􏰪􏰖 = ⇒ w ̃ = D ̃ T 1 􏰓 Y − D ̃ φ w ̃ 􏰔
φα
n
= ⇒ w ̃ = D ̃ Tφ c = c i · φ ̃ ( x i )
( 2 3 . 3 9 ) wherec=(c1,c2,···,cn)T = 1(Y−D ̃φw ̃).Eq.(23.39)indicatesthattheweightvectorw ̃
is a linear combination of the feature points, with c specifying the mixture coefficients for the points.
i=1 α

612
Linear Regression
Rearranging the terms in the expression for c, we have c = 1 ( Y − D ̃ φ w ̃ )
α
α · c = Y − D ̃ φ w ̃
Now, plugging in the form of w ̃ from Eq. (23.39) we get α · c = Y − D ̃ φ ( D ̃ Tφ c )
( D ̃ φ D ̃ Tφ + α · I ) c = Y
c = ( D ̃ φ D ̃ Tφ + α · I ) − 1 Y
The optimal solution is therefore given as
c=(K ̃ +α·I)−1Y
(23.40)
(23.41) where I ∈ Rn×n is the n × n identity matrix, and D ̃ φ D ̃ Tφ is the augmented kernel matrix
K ̃ , since
D ̃φD ̃Tφ =􏰵φ ̃(xi)Tφ ̃(xj)􏰶i,j=1,2,···,n =􏰵K ̃(xi,xj)􏰶i,j=1,2,···,n =K ̃
Putting it all together, we can substitute Eq. (23.41) and Eq. (23.39) into Eq. (23.37) to obtain the expression for the predicted response
Y􏰭 = D ̃ φ w ̃ = D ̃ φ D ̃ Tφ c
􏰓 ̃ ̃T􏰔􏰓 ̃ 􏰔−1 = DφDφ K+α·I Y
̃􏰓 ̃ 􏰔−1 =K K+α·I Y
(23.42)
where K ̃ (K ̃ + α · I)−1 is the kernel hat matrix. Notice that using α > 0 ensures that the inverse always exists, which is another advantage of using (kernel) ridge regression, in addition to the regularization.
Algorithm 23.4 shows the pseudo-code for kernel regression. The main step is to compute the augmented kernel matrix K ̃ ∈ Rn×n, and the vector of mixture coefficients c ∈ Rn. The predicted response on the training data is then given as Y􏰭 = K ̃ c. As for prediction for a new test point z, we use Eq. (23.39) to predict the response, which is given as:
􏰞􏰪n 􏰟 = φ ̃ ( z ) T c i · φ ̃ ( x i )
􏰓􏰔
yˆ = φ ̃ ( z ) T w ̃ = φ ̃ ( z ) T D ̃ Tφ c
i=1
ci ·K ̃(z,xi)=cTK ̃z
􏰪n
=
i=1
ci ·φ ̃(z)Tφ ̃(xi)=
􏰪n i=1

Kernel Regression 613
Algorithm 23.4: Kernel Regression Algorithm KERNEL-REGRESSION (D,Y,K,α):
1 K ← 􏰌K(xi,xj )􏰍i,j=1,…,n // standard kernel matrix
2 K ̃ ← K + 1// augmented kernel matrix
3 c ← 􏰄K ̃ + α · I􏰅−1 Y // compute mixture coefficients
4 Y􏰭 ← K ̃ c
TESTING (z,D,K,c):
5 K ̃z ←􏰌1+K(z,xi)􏰍∀xi∈D
6 yˆ←cTK ̃z
That is, we compute the vector K ̃ z comprising the augmented kernel values of z with respect to all of the data points in D, and take its dot product with the mixture coefficient vector c to obtain the predicted response.
Example23.9. ConsiderthenonlinearIrisdatasetshowninFigure23.7,obtainedvia a nonlinear transformation applied to the centered Iris data. In particular, the sepal length (A1 ) and sepal width attributes (A2 ) were transformed as follows:
X=A2
Y = 0.2A21 + A2 + 0.1A1A2
We treat Y as the response variable and X is the independent attribute. The points show a clear quadratic (nonlinear) relationship between the two variables.
1.5
1.0
0.5
0
−0.5
−1.5 −1.0 −0.5
0 0.5 1.0 1.5 X
􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱃􏱆􏱆
􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆
􏱃􏱃􏱆􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱃􏱆􏱆􏱆􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃
Figure 23.7. Kernel regression on nonlinear Iris dataset.
Y

614 Linear Regression We find the lines of best fit using both a linear and an inhomogeneous quadratic
kernel, with regularization constant α = 0.1. The linear kernel yields the following fit
Y􏰭 = 0 . 1 6 8 · X
On the other hand, using the quadratic (inhomogeneous) kernel over X comprising constant (1), linear (X), and quadratic terms (X2), yields the fit
Y􏰭=−0.086+0.026·X+0.922·X2
The linear (in gray) and quadratic (in black) fit are both shown in Figure 23.7. The SSE error, ∥Y − Y􏰭∥2, is 13.82 for the linear kernel and 4.33 for the quadratic kernel. It is clear that the quadratic kernel (as expected) gives a much better fit to the data.
Example 23.10 (Kernel Ridge Regression). Consider the Iris principal components dataset shown in Figure 23.8. Here X1 and X2 denote the first two principal components. The response variable Y is binary, with value 1 corresponding to Iris-virginica (points on the top right, with Y value 1) and 0 corresponding to Iris-setosa and Iris-versicolor (other two groups of points, with Y value 0).
X1
􏱆􏱃 􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱆􏱃
X1
􏱃􏱆 􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
Y
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆
X2
(a) Linear Kernel
Y
􏱃􏱆 􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱆􏱃 􏱃􏱆
􏱆􏱃
􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆 􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆
X2
(b) Inhomogeneous Quadratic Kernel
􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱆􏱃 􏱃􏱆
􏱃􏱆
Figure 23.8. Kernel ridge regression: linear and (inhomogeneous) quadratic kernels.
􏱃􏱆
􏱆􏱃

L1 Regression: Lasso 615
Figure 23.8(a) shows the fitted regression plane using a linear kernel with ridge value α = 0.01:
Y􏰭=0.333−0.167·X1+0.074·X2
Figure 23.8(b) shows the fitted model when we use an inhomogeneous quadratic
kernel with α = 0.01:
Y􏰭 = − 0 . 0 3 − 0 . 1 6 7 · X 1 − 0 . 1 8 6 · X 2 + 0 . 0 9 2 · X 21 + 0 . 1 · X 1 · X 2 + 0 . 0 2 9 · X 2 2
The SSE error for the linear model is 15.47, whereas for the quadratic kernel it is 8.44, indicating a better fit for the training data.
23.6 L1 REGRESSION: LASSO
The Lasso, which stands for least absolute selection and shrinkage operator, is a regularization method that aims to sparsify the regression weights. Instead of using the L2 or Euclidean norm for weight regularization as in ridge regression (see Eq. (23.34)), the Lasso formulation uses the L1 norm for regularization
(23.43)
1􏰑􏰑 􏰑􏰑2
min J(w)= · Y−Dw +α·∥w∥1
w2
where α ≥ 0 is the regularization constant and
􏰪d ∥w∥1 = |wi|
i=1
Note that, we have added the factor 1 for convenience; it does not change the objective.
2
Furthermore, we assume that the data comprising the d independent attributes X1, X2,
. . . , Xd , and the response attribute Y, have all been centered. That is, we assume that D=D−1·μT
Y = Y − μY · 1
where1∈Rn isthevectorofallones,μ=(μX1,μX2,···,μXd)T isthemultivariatemean for the data, and μY is the mean response value. One benefit of centering is that we do not have to explicitly deal with the bias term b = w0 , which is important since we usually do not want to penalize b. Once the regression coefficients have been estimated, we can obtain the bias term via Eq. (23.26), as follows:
􏰪d j=1
The main advantage of using the L1 norm is that it leads to sparsity in the solution vector. That is, whereas ridge regression reduces the value of the regression coefficients
b = w0 = μY −
wj · μXj

616
Linear Regression
3 2 1 0
−1 −2
Figure 23.9. Absolute value function (in black) and two of its subgradients (in gray).
wi , they may remain small but still non-zero. On the other hand, L1 regression can drive the coefficients to zero, resulting in a more interpretable model, especially when there are many predictor attributes.
T h e L a s s o o b j e c t i v e c o m p r i s e s t w o p a r t s , t h e s q u a r e d e r r o r t e r m ∥ Y􏰥 − D w ∥ 2 w h i c h is convex and differentiable, and the L1 penalty term α · ∥w∥1 = α di=1|wi|, which is convex but unfortunately non-differentiable at wi = 0. This means that we cannot simply compute the gradient and set it to zero, as we did in the case of ridge regression. However, these kinds of problems can be solved via the generalized approach of subgradients.
23.6.1 Subgradients and Subdifferential Consider the absolute value function f : R → R
f(w)=|w|
which is plotted in Figure 23.9 (black line).
When w > 0, we can see that its derivative is f′(w) = +1, and when w < 0, its derivative is f ′(w) = −1. On the other hand, there is a discontinuity at w = 0 where the derivative does not exist. However, we use the concept of a subgradient that generalizes the notion of a derivative. For the absolute value function, the slope m of any line that passes through w = 0 that remains below or touches the graph of f is called a subgradient of f at w = 0. Figure 23.9 shows two such subgradients, namely the slopes m = −0.5 and m = 0.25. The corresponding lines are shown in gray. The set of all the subgradients at w is called the subdifferential, denoted as ∂|w|. Thus, the subdifferential at w = 0 is given as ∂|w|=[−1,1] since only lines with slope between −1 and +1 remain below or (partially) coincide with the absolute value graph. m=0.25 m=−0.5 −3 −2 −1 0 1 2 3 w |w| L1 Regression: Lasso 617 Considering all the cases, the subdifferential for the absolute value function is given as: 1 iff w > 0
∂|w| = −1 iff w < 0 (23.44) [−1,1] iff w = 0 We can see that when the derivative exists, the subdifferential is unique and corresponds to the derivative (or gradient), and when the derivative does not exist the subdifferential corresponds to a set of subgradients. 23.6.2 Bivariate L1 Regression We first consider bivariate L1 regression, where we have a single independent attribute X and a response attribute Y (both centered). The bivariate regression model is given as yˆ i = w · x i The Lasso objective from Eq. (23.43) can then be written as 1 􏰪n min J(w)= (yi −w·xi)2 +α·|w| (23.45) w 2 i=1 We can compute the subdifferential of this objective as follows: 1 􏰪n ∂J(w)= 2 · 2·(yi −w·xi)·(−xi)+α·∂|w| 􏰪n = − i=1 x i · y i + w · 􏰪n i=1 x i2 + α · ∂ | w | i=1 =−XTY+w·∥X∥2 +α·∂|w| We can solve for w by setting the subdifferential to zero; we get ∂J(w)=0 =⇒ w · ∥X ∥ 2 + α · ∂ | w | = X TY =⇒w+η·α·∂|w|=η·XTY (23.46) where η = 1/∥X∥2 > 0 is a scaling constant.
Corresponding to the three cases for the subdifferential of the absolute value
function in Eq. (23.44), we have three cases to consider: CaseI(w>0and∂|w|=1): Inthiscaseweget
w = η · X TY − η · α
S i n c e w > 0 , t h i s i m p l i e s η · X T Y > η · α o r 􏰐􏰐 η · X T Y 􏰐􏰐 > η · α .

618 Linear Regression CaseII(w<0and∂|w|=−1): Inthiscasewehave w = η · X TY + η · α S i n c e w < 0 , t h i s i m p l i e s η · X T Y < − η · α o r 􏰐􏰐 η · X T Y 􏰐􏰐 > η · α .
CaseIII(w=0and∂|w|∈[−1,1]): Inthiscase,weget
w ∈ 􏰯 η · X TY − η · α , η · X TY + η · α 􏰰
H o w e v e r , s i n c e w = 0 , t h i s i m p l i e s η · X TY − η · α ≤ 0 a n d η · X TY + η · α ≥ 0 . I n o t h e r w o r d s , η · X T Y ≤ η · α a n d η · X T Y ≥ − η · α . T h e r e f o r e , 􏰐􏰐 η · X T Y 􏰐􏰐 ≤ η · α .
Let τ ≥ 0 be some fixed value. Define the soft-threshold function Sτ : R → R as follows:
Sτ (z) = sign(z) · max􏰵0, 􏰄|z| − τ 􏰅􏰶 (23.47) Then the above three cases can be written compactly as:
w = S η · α ( η · X TY ) ( 2 3 . 4 8 ) with τ = η · α, where w is the optimal solution to the bivariate L1 regression problem
in Eq. (23.45).
23.6.3 Multiple L1 Regression
Consider the L1 regression objective from Eq. (23.43) 1􏰑􏰑􏰪d 􏰑􏰑2
min J(w)= ·􏰑Y− wi·Xi􏰑 +α·∥w∥1 w 2i=1
1􏰓􏰪d 􏰪d􏰪d 􏰔􏰪d =2· YTY−2 wi·XiTY+ wi·wj·XiTXj +α· |wi|
i=1 i=1 j=1 i=1
We generalize the bivariate solution to the multiple L1 formulation by optimizing for each wk individually, via the approach of cyclical coordinate descent. We rewrite the L1 objective by focusing only on the wk terms, and ignoring all terms not involving wk , which are assumed to be fixed:
1 􏰪d
min J(wk)=−wk·XkTY+ wk2·∥Xk∥2+wk· wj·XkTXj+α·|wk| (23.49)
wk 2 j̸=k

L1 Regression: Lasso 619 Setting the subdifferential of J(wk ) to zero, we get
∂J(wk) = 0
􏰪d
= ⇒ w k · ∥ X k ∥ 2 + α · ∂ | w k | = X kT Y − = ⇒ w k · ∥ X k ∥ 2 + α · ∂ | w k | = X kT Y −
j=1 􏰓 􏰔 =⇒wk·∥Xk∥2+α·∂|wk|=wk∥XkT∥2+XkT Y−Dw
We can interpret the above equation as specifying an iterative solution for wk . In essence, we let the wk on the left hand side be the new estimate for wk, whereas we treat the wk on the right hand side as the previous estimate. More concretely, let w t represent the weight vector at step t, with wkt denoting the estimate for wk at time t. The new estimate for wk at step t + 1 is then given as
t+1 1 t+1 t 1 T􏰄 t􏰅 wk +∥Xk∥2 ·α·∂|wk |=wk+∥Xk∥2 ·Xk Y−Dw
wt+1 +η·α·∂|wt+1|=wt +η·XT􏰄Y−Dwt􏰅 k kkk
j ̸=k 􏰪d
w j · X kT X j
w j · X kT X j + w k X kT X k
(23.50) where η = 1/∥Xk∥2 > 0 is just a scaling constant. Based on the three cases for wt+1
and the subdifferential ∂|wt+1|, following a similar approach as in the bivariate case k
[Eq. (23.48)], the new estimate for wk can be written compactly as t+1 􏰓t T􏰄 t􏰅􏰔
The pseudo-code for L1 regression is shown in Algorithm 23.5. The algorithm starts with a random estimate for w at step t = 0, and then cycles through each dimension to estimate wk until convergence. Interestingly, the term −XkT(Y − Dwt) is in fact the gradient at wk of the squared error term in the Lasso objective, and thus the update equation is the same as gradient descent with step size η, followed by the soft-threshold operator. Note also that since η is just a positive scaling constant, we make it a parameter of the algorithm, denoting the step size for gradient descent.
k
wk =Sη·α wk+η·Xk Y−Dw (23.51)
Example 23.11 (L1 Regression). We apply L1 regression to the full Iris dataset with n = 150 points, and four independent attributes, namely sepal-width (X1), sepal-length (X2), petal-width (X3), and petal-length (X4). The Iris type attribute comprises the response variable Y. There are three Iris types, namely Iris-setosa, Iris-versicolor, and Iris-virginica, which are coded as 0, 1 and 2, respectively.

620
Linear Regression
Algorithm 23.5: L1 Regression Algorithm: Lasso L1-REGRESSION (D,Y,α,η,ǫ):
1 2
3 4 5 6 7 8 9 10
11 12 13
μ ← mean(D) // compute mean D←D−1·μT//centerthedata
Y ← Y−μY ·1// center the response
t ← 0 // step/iteration counter
w t ← random vector in Rd // initial weight vector repeat
foreach k = 1,2,··· ,d do
∇ ( w kt ) ← − X kT ( Y − D w t ) / / c o m p u t e g r a d i e n t a t w k
wt+1 ←wt −η·∇(wt)//updateestimatefor w kk􏰄􏰅k k
wt+1 ← S wt+1 // apply soft-threshold function k η·αk
t←t+1
until∥wt−wt−1∥≤ǫ
b←μY −􏰄wt􏰅Tμ//computethebiasterm
The L1 regression estimates for different values of α (with η = 0.0001) are shown below:
α=0: Y􏰭􏰭=0.192−0.109·X1−0.045·X2+0.226·X3+0.612·X4
α=1: Y􏰭=−0.077−0.076·X1−0.015·X2+0.253·X3+0.516·X4
α=5: Y􏰭=−0.553+0.0·X1+0.0·X2+0.359·X3+0.170·X4 SSE=8.82
α=10: Y=−0.575+0.0·X1+0.0·X2+0.419·X3+0.0·X4
The L1 norm values for the weight vectors (excluding the bias term) are 0.992, 0.86, 0.529, and 0.419, respectively. It is interesting to note the sparsity inducing effect of Lasso, as observed for α = 5 and α = 10, which drives some of the regression coefficients to zero.
We can contrast the coefficients for L2 (ridge) and L1 (Lasso) regression by comparing models with the same level of squared error. For example, for α = 5, the L1 model has SSE = 8.82. We adjust the ridge value in L2 regression, with α = 35 resulting in a similar SSE value. The two models are given as follows:
L1 : Y􏰭􏰭=−0.553+0.0·X1+0.0·X2+0.359·X3+0.170·X4 ∥w∥1 =0.529 L2 : Y=−0.394+0.019·X1−0.051·X2+0.316·X3+0.212·X4 ∥w∥1 =0.598
where we exclude the bias term when computing the L1 norm for the weights. We can observe that for L2 regression the coefficients for X1 and X2 are small, and therefore less important, but they are not zero. On the other hand, for L1 regression, the coefficients for attributes X1 and X2 are exactly zero, leaving only X3 and X4; Lasso can thus act as an automatic feature selection approach.
SSE=6.96 SSE=7.09
SSE=10.15

Further Reading 621
23.7 FURTHER READING
For a geometrical approach to multivariate statistics see Wickens (2014), and Saville and Wood (2012). For a description of the class of generalized linear models, of which linear regression is a special case, see Agresti (2015). An excellent overview of Lasso and sparsity based methods is given in Hastie, Tibshirani, and Wainwright (2015). For a description of cyclical coordinate descent for L1 regression, and also other approaches for sparse statistical models see Hastie, Tibshirani, and Wainwright (2015).
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John Wiley & Sons.
Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton, FL: CRC press.
Saville, D. J. and Wood, G. R. (2012). Statistical Methods: The Geometric Approach. New York: Springer Science + Business Media.
Wickens, T. D. (2014). The Geometry of Multivariate Statistics. New York: Psychology Press, Taylor & Francis Group.
23.8 EXERCISES
Q1. Consider the data in Table 23.1, with Y as the response variable and X as the independent variable. Answer the following questions:
(a) Compute the predicted response vector Y􏰭 for least square regression using the
geometric approach.
(b) Basedonthegeometricapproachextractthevalueofthebiasandslope,andgive
the equation of the best fitting regression line. Table 23.1. Data for Q1
X
Y
5 0 2 1 2
2 1 1 1 0
Q2. Given data in Table 23.2, let α = 0.5 be the regularization constant. Compute the equation for ridge regression of Y on X, where both the bias and slope are
penalized. Use the fact that the inverse of the matrix A = 􏰕 a b 􏰖 is given as
−1 1􏰕d−b􏰖 cd A = det(A) −c a ,withdet(A)=ad−bc.

622 Linear Regression Table 23.2. Data for Q2
X
Y
1 2 4 6
1 3 4 3
Q3. ShowthatEq.(23.11)holds,thatis
􏰄 􏰥 ni = 1 x i · y i 􏰅 − n · μ X · μ Y 􏰥 ni = 1 ( x i − μ X ) ( y i − μ Y )
w= 􏰄􏰥ni=1xi2􏰅−n·μ2X = 􏰥ni=1(xi −μX)2
Q4. Derive an expression for the bias term b and the weights w = (w1,w2,··· ,wd)T in
multiple regression using Eq. (23.16), without adding the augmented column.
Q5. Show analytically (i.e., without using the geometry) that the bias term in multiple regression, as shown in Eq. (23.26), is given as
w0=μY−w1·μX1 −w2·μX2 −···−wd·μXd
Q6. ShowthatY􏰭Tǫ=0. 􏰭
Q7. Provethat∥ǫ∥2=∥Y∥2−∥Y∥2.
Q8. Show that if λi is an eigenvalue of DTD corresponding to the eigenvector ui, then
λi + α is an eigenvalue of (DT D + α · I) for the same eigenvector ui .
Q9. Show that 􏰀−1QTY gives the vector of scalar projections of Y onto each of the
orthogonal basis vectors in multiple regression projU0(Y)
􏰀−1QTY = projU1(Y)  . 
projUd (Y)
Q10. FormulateasolutiontotheridgeregressionproblemviaQR-factorization.
Q11. Derive the solution for the weight vector w = (w1,w2,··· ,wd)T and bias b = w0 in ridge regression without subtracting the means from Y and the independent attributes X1 , X2 , · · · , Xd , and without adding the augmented column.
Q12. Showthatthesolutionforridgeregressionandkernelridgeregressionareexactlythe same when we use a linear kernel.
Q13. Showthatw=Sη·α(η·XTY)[Eq.(23.48)]isthesolutionforbivariateL1regression.
Q14. Derive the three cases for the subdifferential in Eq. (23.50), and show that they
correspond to the soft-threshold update in Eq. (23.51).
Q15. Show that that the gradient at wk of the SSE term in the L1 formulation is given as
∇(wk)= ∂ 1·∥Y−Dw∥2=−XkT(Y−Dw) ∂wk 2
where Y is the response vector, and Xk is the kth predictor vector.

CHAPTER 24 LogisticRegression
GivenasetofpredictorattributesorindependentvariablesX1,X2,···,Xd,andgivena categorical response or dependent variable Y, the aim of logistic regression is to predict the probability of the response variable values based on the independent variables. Logistic regression is in fact a classification technique, that given a point xj ∈ Rd predicts P (ci |xj ) for each class ci in the domain of Y (the set of possible classes or values for the response variable). In this chapter, we first consider the binary class problem, where the response variable takes on one of two classes (0 and 1, or positive and negative, and so on). Next, we consider the multiclass case, where there are K classes for the response variable.
24.1 BINARY LOGISTIC REGRESSION
In logistic regression, we are given a set of d predictor or independent variables X1,X2,··· ,Xd, and a binary or Bernoulli response variable Y that takes on only two values, namely, 0 and 1. Thus, we are given a training dataset D comprising n points xi ∈ Rd and the corresponding observed values yi ∈ {0, 1}. As done in Chapter 23, we augment the data matrix D by adding a new attribute X0 that is always fixed at the value 1foreachpoint,sothatx ̃i =(1,x1,x2,···,xd)T ∈Rd+1 denotestheaugmentedpoint,and the multivariate random vector X ̃ , comprising all the independent attributes is given as X ̃ = (X0,X1,··· ,Xd)T. The augmented training dataset is given as D ̃ comprising the n augmentedpointsx ̃i alongwiththeclasslabelsyi fori=1,2,···,n.
Since there are only two outcomes for the response variable Y, its probability mass function for X ̃ = x ̃ is given as:
P(Y=1|X ̃ =x ̃)=π(x ̃) P(Y=0|X ̃ =x ̃)=1−π(x ̃)
where π(x ̃) is the unknown true parameter value, denoting the probability of Y = 1
given X ̃ = x ̃. Further, note that the expected value of Y given X ̃ = x ̃ is
E[Y|X ̃ =x ̃]=1·P(Y=1|X ̃ =x ̃)+0·P(Y=0|X ̃ =x ̃) = P ( Y = 1 | X ̃ = x ̃ ) = π ( x ̃ )
623

624 Logistic Regression
Therefore, in logistic regression, instead of directly predicting the response value, the goal is to learn the probability, P (Y = 1|X ̃ = x ̃ ), which is also the expected value of Y g i v e n X ̃ = x ̃ .
Since P (Y = 1|X ̃ = x ̃ ) is a probability, it is not appropriate to directly use the linear regression model
f(x ̃)=ω0 ·x0 +ω1 ·x1 +ω2 ·x2 +···+ωd ·xd =ω ̃Tx ̃
w h e r e ω ̃ = ( ω 0 , ω 1 , · · · , ω d ) T ∈ R d + 1 i s t h e t r u e a u g m e n t e d w e i g h t v e c t o r , w i t h ω 0 = β t h e true unknown bias term, and ωi the true unknown regression coefficient or weight for attributeXi.ThereasonwecannotsimplyuseP(Y=1|X ̃ =x ̃)=f(x ̃)isduetothefact that f (x ̃ ) can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value, and thus we need a model that results in an output that lies in the interval [0,1]. The name “logistic regression” comes from the logistic function (also called the sigmoid function) that meets this requirement. It is defined as follows:
θ(z)= 1 = exp{z} (24.1) 1 + exp{−z} 1 + exp{z}
The logistic function “squashes” the output to be between 0 and 1 for any scalar input z (see Figure 24.1). The output θ (z) can therefore be interpreted as a probability.
Example 24.1 (Logistic Function). Figure 24.1 shows the plot for the logistic func- tion for z ranging from −∞ to +∞. In particular consider what happens when z is −∞, +∞ and 0; we have
θ(−∞)= 1
1 + exp{∞}
θ(+∞)= 1
1 + exp{−∞}
= 1 =1 1
= 1 =0 ∞
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
−∞ 0 +∞
z
Figure 24.1. Logistic function.
θ (z)

Binary Logistic Regression 625
θ(0)= 1 = 1 =0.5 1 + exp{0} 2
As desired, θ (z) lies in the range [0, 1], and z = 0 is the “threshold” value in the sense that for z > 0 we have θ(z) > 0.5, and for z < 0, we have θ(z) < 0.5. Thus, interpreting θ (z) as a probability, the larger the z value, the higher the probability. Another interesting property of the logistic function is that 1−θ(z)=1− exp{z} = 1+exp{z}−exp{z} = 1 =θ(−z) (24.2) 1 + exp{z} 1 + exp{z} 1 + exp{z} Using the logistic function, we define the logistic regression model as follows: P(Y=1|X ̃ =x ̃)=π(x ̃)=θ(f(x ̃))=θ(ω ̃Tx ̃)= exp{ω ̃Tx ̃} (24.3) 1 + e x p { ω ̃ T x ̃ } Thus, the probability that the response is Y = 1 is the output of the logistic function for the input ω ̃ T x ̃ . On the other hand, the probability for Y = 0 is given as P(Y=0|X ̃ =x ̃)=1−P(Y=1|X ̃ =x ̃)=θ(−ω ̃Tx ̃)= 1 1 + e x p { ω ̃ T x ̃ } w h e r e w e u s e d E q . ( 2 4 . 2 ) , t h a t i s , 1 − θ ( z ) = θ ( − z ) f o r z = ω ̃ T x ̃ . Combining these two cases the full logistic regression model is given as P(Y|X ̃ =x ̃)=θ(ω ̃Tx ̃)Y ·θ(−ω ̃Tx ̃)1−Y (24.4) since Y is a Bernoulli random variable that takes on either the value 1 or 0. We can observe that P(Y| X ̃ = x ̃) = θ(ω ̃Tx ̃) when Y = 1 and P(Y| X ̃ = x ̃) = θ(−ω ̃Tx ̃) when Y = 0, as desired. Log-Odds Ratio Define the odds ratio for the occurrence of Y = 1 as follows: odds(Y=1|X ̃ =x ̃)= P(Y=1|X ̃ =x ̃) = θ(ω ̃Tx ̃) P(Y=0|X ̃ =x ̃) θ(−ω ̃Tx ̃) = exp{ω ̃ Tx ̃} · 􏰄1 + exp{ω ̃ Tx ̃}􏰅 1 + e x p { ω ̃ T x ̃ } = e x p { ω ̃ T x ̃ } The logarithm of the odds ratio, called the log-odds ratio, is therefore given as: ln􏰄odds(Y=1|X ̃ =x ̃)􏰅=ln􏰞 P(Y=1|X ̃ =x ̃) 􏰟=ln􏰓exp{ω ̃Tx ̃}􏰔=ω ̃Tx ̃ ( 2 4 . 5 ) (24.6) 1 − P ( Y = 1 | X ̃ = x ̃ ) =ω0 ·x0 +ω1 ·x1 +···+ωd ·xd 626 Logistic Regression The log-odds ratio function is also called the logit function, defined as logit(z) = ln 􏰕 z 􏰖 1−z It is the inverse of the logistic function. We can see that ln􏰄odds(Y=1|X ̃ =x ̃)􏰅=logit􏰄P(Y=1|X ̃ =x ̃)􏰅 The logistic regression model is therefore based on the assumption that the log-odds ratio for Y = 1 given X ̃ = x ̃ is a linear function (or a weighted sum) of the independent attributes. In particular, let us consider the effect of attribute Xi by fixing the values for all other attributes in Eq. (24.6); we get ln(odds(Y=1|X ̃ =x ̃))=ωi ·xi +C =⇒ odds(Y=1|X ̃ =x ̃)=exp{ωi ·xi +C}=exp{ωi ·xi}·exp{C} ∝ exp{ωi ·xi} where C is a constant comprising the fixed attributes. The regression coefficient ωi can therefore be interpreted as the change in the log-odds ratio for Y = 1 for a unit change in Xi , or equivalently the odds ratio for Y = 1 increases exponentially per unit change in Xi. 24.1.1 Maximum Likelihood Estimation Let D ̃ be the augmented training dataset comprising the n augmented points x ̃i along with their labels yi. Let w ̃ = (w0,w1,··· ,wd)T be the augmented weight vector for estimating ω ̃ . Note that w0 = b denotes the estimated bias term, and wi the estimated weight for attribute Xi. We will use the maximum likelihood approach to learn the weight vector w ̃ . Likelihood is defined as the probability of the observed data given the estimated parameters w ̃ . We assume that the binary response variables yi are all independent. Therefore, the likelihood of the observed responses is given as 􏰫n 􏰄 􏰅 􏰪n 􏰄 T 􏰅 􏰄 T 􏰅 ln L(w ̃) = yi ·ln θ(w ̃ x ̃i) +(1−yi)·ln θ(−w ̃ x ̃i) (24.7) i=1 The negative of the log-likelihood can also be considered as an error function, the cross-entropy error function, given as follows: 􏰄􏰅􏰪n 􏰕1􏰖 􏰕1􏰖 E(w ̃)=−ln L(w ̃) = yi ·ln θ(w ̃Tx ̃i) +(1−yi)·ln 1−θ(w ̃Tx ̃i) (24.8) i=1 The task of maximizing the log-likelihood is therefore equivalent to minimizing the cross-entropy error. θ(w ̃Tx ̃i)yi ·θ(−w ̃Tx ̃i)1−yi Instead of trying to maximize the likelihood, we can maximize the logarithm of the L(w ̃)=P(Y|w ̃)= i=1 P(yi|x ̃i)= likelihood, called log-likelihood, to convert the product into a summation as follows: 􏰫n i=1 Binary Logistic Regression 627 Typically, to obtain the optimal weight vector w ̃, we would differentiate the log-likelihood function with respect to w ̃ , set the result to 0, and then solve for w ̃ . However, for the log-likelihood formulation in Eq. (24.7) there is no closed form solution to compute the weight vector w ̃ . Instead, we use an iterative gradient ascent method to compute the optimal value, since Eq. (24.7) is a concave function, and has a unique global optimal. The gradient ascent method relies on the gradient of the log-likelihood function, which can be obtained by taking its partial derivative with respect to w ̃ , as follows: ∂ 􏰵 􏰄 􏰅 􏰶 ∂ 􏰵 􏰪n 􏰶 ∇(w ̃)= ∂w ̃ ln L(w ̃) = ∂w ̃ yi ·ln(θ(zi))+(1−yi)·ln(θ(−zi)) (24.9) i=1 where zi = w ̃ T x ̃ i . We use the chain rule to obtain the derivative of ln(θ (zi )) with respect to w ̃ . We note the following facts: ∂ 􏰵ln(θ(zi))􏰶= 1 ∂θ(zi) θ(zi) ∂ 􏰵ln(θ(−zi))􏰶= ∂ 􏰵ln(1−θ(zi))􏰶= −1 ∂θ(zi) ∂θ(zi) 1−θ(zi) ∂θ(zi) =θ(zi)·(1−θ(zi))=θ(zi)·θ(−zi) ∂zi ∂ z i = ∂ w ̃ T x ̃ i = x ̃ i ∂ w ̃ ∂ w ̃ As per the chain rule, we have ∂ln(θ(zi)) = ∂ln(θ(zi)) · ∂θ(zi) · ∂zi ∂ w ̃ ∂ θ ( z i ) ∂ z i ∂ w ̃ = 1 ·􏰄θ(zi)·θ(−zi)􏰅·x ̃i =θ(−zi)·x ̃i θ(zi) Likewise, using the chain rule, we have ∂ln(θ(−zi)) = ∂ln(θ(−zi)) · ∂θ(zi) · ∂zi (24.10) ∂ w ̃ ∂ θ ( z i ) ∂ z i ∂ w ̃ = −1 ·􏰄θ(zi)·(1−θ(zi)􏰅·x ̃i =−θ(zi)·x ̃i 1−θ(zi) Substituting the above into Eq. (24.9) we get 􏰪n i=1 􏰪n 􏰄 􏰅 yi · θ(−zi)+θ(zi) ·x ̃i −θ(zi)·x ̃i i=1 􏰪n sinceθ(−zi)+θ(zi)=1 i=1 yi ·θ(−zi)·x ̃i −(1−yi)·θ(zi)·x ̃i ∇(w ̃)= = =􏰪􏰄yi −θ(zi)􏰅·x ̃i, i=1 n􏰄T􏰅 = y i − θ ( w ̃ x ̃ i ) · x ̃ i ( 2 4 . 1 1 ) 628 Logistic Regression Algorithm 24.1: Logistic Regression: Stochastic Gradient Ascent LOGISTICREGRESSION-SGA (D,η,ǫ): 1 2 3 4 5 6 7 8 9 10 11 foreachxi∈Ddox ̃Ti ←􏰄1 xTi􏰅//maptoRd+1 t ← 0 // step/iteration counter w ̃ 0 ← (0,...,0)T ∈ Rd+1 // initial weight vector repeat w ̃ ←w ̃ t //makeacopyof w ̃ t foreach x ̃i ∈ D ̃ in random order do ∇(w ̃,x ̃i)←􏰄yi −θ(w ̃Tx ̃i)􏰅·x ̃i //computegradientat x ̃i w ̃ ←w ̃ +η·∇(w ̃,x ̃i)//updateestimatefor w ̃ w ̃ t + 1 ← w ̃ / / u p d a t e w ̃ t + 1 t←t+1 until∥w ̃t−w ̃t−1∥≤ǫ The gradient ascent method starts at some initial estimate for w ̃ , denoted w ̃ 0. At each step t, the method moves in the direction of steepest ascent, which is given by the gradient vector. Thus, given the current estimate w ̃ t , we can obtain the next estimate as follows: w ̃ t + 1 = w ̃ t + η · ∇ ( w ̃ t ) ( 2 4 . 1 2 ) Here, η > 0 is a user-specified parameter called the learning rate. It should not be too large, otherwise the estimates will vary wildly from one iteration to the next, and it should not be too small, otherwise it will take a long time to converge. At the optimal value of w ̃ , the gradient will be zero, i.e., ∇(w ̃ ) = 0, as desired.
Stochastic Gradient Ascent
The gradient ascent method computes the gradient by considering all the data points, and it is therefore called batch gradient ascent. For large datasets, it is typically much faster to compute the gradient by considering only one (randomly chosen) point at a time. The weight vector is updated after each such partial gradient step, giving rise to stochastic gradient ascent (SGA) for computing the optimal weight vector w ̃ .
The pseudo-code for SGA based logistic regression is shown in Algorithm 24.1. Given a randomly chosen point x ̃ i , the point-specific gradient (see Eq. (24.11)) is given as
∇(w ̃,x ̃i)=􏰄yi −θ(w ̃Tx ̃i)􏰅·x ̃i (24.13)
Unlike batch gradient ascent that updates w ̃ by considering all the points, in stochastic gradient ascent the weight vector is updated after observing each point, and the updated values are used immediately in the next update. Computing the full gradient in the batch approach can be very expensive. In contrast, computing the partial gradient at each point is very fast, and due to the stochastic updates to w ̃ , typically SGA is much faster than the batch approach for very large datasets.

Binary Logistic Regression
629
X
1
􏱅􏱋 􏱋􏱅 􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱅􏱋􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱅􏱅􏱋􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱅􏱋􏱅􏱋 􏱅􏱋􏱅􏱋􏱋􏱅 􏱋􏱅􏱋􏱅 􏱋􏱅 􏱋􏱅􏱅􏱋􏱅􏱋􏱋􏱅 􏱅􏱋􏱋􏱅􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅 􏱋􏱅 􏱅􏱋􏱅􏱋􏱋􏱅 􏱅􏱋
􏱅􏱋
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱅􏱋 􏱅􏱋􏱋􏱅 􏱋􏱅 􏱋􏱅 􏱋􏱅􏱋􏱅 􏱋􏱅 􏱋􏱅􏱅􏱋 􏱅􏱋 􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅 􏱋􏱅􏱅􏱋 􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅 􏱋􏱅􏱋􏱅 􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅 􏱅􏱋 􏱅􏱅􏱋􏱋 􏱋􏱋􏱅􏱅
X1
􏱅􏱋
􏱋􏱅 􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋
􏱅􏱋􏱋􏱅􏱅􏱋􏱋􏱅􏱅􏱋􏱋􏱅􏱅􏱋􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱋􏱅􏱋􏱅􏱅􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅􏱋􏱅 􏱅􏱋􏱅􏱋􏱅􏱋 􏱋􏱅􏱋􏱅􏱅􏱋
􏱅􏱋 􏱅􏱋􏱅􏱋 􏱋􏱅 􏱋􏱅􏱅􏱋􏱅􏱋􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅 􏱋􏱅
􏱅􏱋 􏱋􏱅 􏱋􏱅 􏱅􏱋􏱅􏱋􏱅􏱋 􏱅􏱋􏱋􏱅􏱅􏱋􏱅􏱋􏱋􏱅 􏱅􏱋 􏱋􏱅􏱅􏱋􏱅􏱋􏱋􏱅 􏱋􏱅􏱋􏱅 􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅 􏱅􏱅􏱋􏱋􏱅􏱋􏱋􏱅 􏱋􏱅􏱋􏱅 􏱅􏱅􏱋􏱋
􏱋􏱅 􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅 􏱋􏱅􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅 􏱅􏱋 􏱅􏱋 􏱅􏱋 􏱅􏱋 􏱅􏱋
􏱋􏱅 􏱋􏱅 􏱋􏱋􏱅􏱅􏱋􏱅􏱅􏱋􏱅􏱅􏱋􏱋 􏱋􏱅􏱋􏱅 􏱋􏱋􏱅􏱅 􏱋􏱅􏱋􏱅 􏱅􏱋􏱋􏱅 􏱋􏱅􏱋􏱅 􏱅􏱋􏱅􏱋 􏱅􏱋􏱅􏱋 (b) Linear Regression
Y
􏱃􏱆􏱆􏱃
X2
(a) Logistic Regression
Y
􏱆􏱃􏱃􏱆
X2
Figure 24.2. Logistic versus linear regression: Iris principal components data. Misclassified point are shown
in dark gray color. Circles denote Iris-virginica and triangles denote the other two Iris types.
Once the model has been trained, we can predict the response for any new
augmented test point z ̃ as follows:
yˆ = 􏰢1 if θ(w ̃ Tz ̃) ≥ 0.5 (24.14)
0 ifθ(w ̃Tz ̃)<0.5 Example 24.2 (Logistic Regression). Figure 24.2(a) shows the output of logistic regression modeling on the Iris principal components data, where the independent attributes X1 and X2 represent the first two principal components, and the binary response variable Y represents the type of Iris flower; Y = 1 corresponds to Iris-virginica, whereas Y = 0 corresponds to the two other Iris types, namely Iris-setosa and Iris-versicolor. The fitted logistic model is given as w ̃ =(w0,w1,w2)T =(−6.79,−5.07,−3.29)T P(Y=1|x ̃)=θ(w ̃Tx ̃)= 1 1+exp{6.79+5.07·x1 +3.29·x2} Figure 24.2(a) plots P (Y = 1|x ̃ ) for various values of x ̃ . 􏱅􏱋 􏱅􏱋 􏱋􏱅􏱅􏱋􏱅􏱋 􏱅􏱋 􏱅􏱋􏱅􏱋 􏱅􏱋 􏱅􏱋 􏱋􏱅 􏱅􏱋 􏱅􏱅􏱋􏱋 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃 630 Logistic Regression Given x ̃, if P(Y = 1|x ̃) ≥ 0.5, then we predict yˆ = 1, otherwise we predict yˆ = 0. Figure 24.2(a) shows that five points (shown in dark gray) are misclassified. For example, for x ̃ = (1, −0.52, −1.19)T we have: P(Y=1|x ̃)=θ(w ̃Tx ̃)=θ(−0.24)=0.44 P(Y=0|x ̃)=1−P(Y=1|x ̃)=0.54 Thus, the predicted response for x ̃ is yˆ = 0, whereas the true class is y = 1. Figure 24.2 also contrasts logistic versus linear regression. The plane of best fit in linear regression is shown in Figure 24.2(b), with the weight vector: w ̃ =(0.333,−0.167,0.074)T yˆ =f(x ̃)=0.333−0.167·x1+0.074·x2 Since the response vector Y is binary, we predict the response class as y = 1 if f (x ̃ ) ≥ 0.5, and y = 0 otherwise. The linear regression model results in 17 points being misclassified (dark gray points), as shown in Figure 24.2(b). Since there are n = 150 points in total, this results in a training set or in-sample accuracy of 88.7% for linear regression. On the other hand, logistic regression misclassifies only 5 points, for an in-sample accuracy of 96.7%, which is a much better fit, as is also apparent in Figure 24.2. 24.2 MULTICLASS LOGISTIC REGRESSION We now generalize logistic regression to the case when the response variable Y can take on K distinct nominal categorical values called classes, i.e., Y ∈ {c1,c2,··· ,cK}. We model Y as a K-dimensional multivariate Bernoulli random variable (see Section 3.1.2). Since Y can assume only one of the K values, we use the one-hot encoding approach to map each categorical value ci to the K-dimensional binary vector i−1 K−i 􏰾􏱁􏱀􏰿 􏰾􏱁􏱀􏰿T 􏰥 ei = ( 0,...,0,1,0,...,0 ) whose ith element eii = 1, and all other elements eij = 0, so that Kj=1 eij = 1. Henceforth, we assume that the categorical response variable Y is a multivariate BernoullivariableY∈{e1,e2,···,eK},withYj referringtothejthcomponentofY. The probability mass function for Y given X ̃ = x ̃ is P(Y = ei|X ̃ = x ̃) = πi(x ̃), for i = 1,2,...,K where πi (x ̃ ) is the (unknown) probability of observing class ci given X ̃ = x ̃ . Thus, there are K unknown parameters, which must satisfy the following constraint: 􏰪K i=1 πi(x ̃)= 􏰪K i=1 P(Y=ei|X ̃ =x ̃)=1 Multiclass Logistic Regression 631 Given that only one element of Y is 1, the probability mass function of Y can be written compactly as P(Y|X ̃ =x ̃)=􏰫K 􏰄πj(x ̃)􏰅Yj (24.15) j=1 NotethatifY=ei,onlyYi =1andtherestoftheelementsYj =0forj̸=i. In multiclass logistic regression, we select one of the values, say cK, as a reference or base class, and consider the log-odds ratio of the other classes with respect to cK; we assume that each of these log-odd ratios are linear in X ̃ , but with a different augmented weight vector ω ̃ i , for class ci . That is, the log-odds ratio of class ci with respect to class cK is assumed to satisfy ln(odds(Y=ei|X ̃ =x ̃))=ln􏰞P(Y=ei|X ̃ =x ̃)􏰟=ln􏰕πi(x ̃)􏰖=ω ̃Ti x ̃ where ωi0 = βi is the true bias value for class ci . We can rewrite the above set of equations as follows: π i ( x ̃ ) = e x p { ω ̃ Ti x ̃ } π K ( x ̃ ) = ⇒ π i ( x ̃ ) = e x p { ω ̃ Ti x ̃ } · π K ( x ̃ ) G i v e n t h a t 􏰥 Kj = 1 π j ( x ̃ ) = 1 , w e h a v e 􏰪K πj(x ̃)=1 j 􏰪= 1 = ⇒ 􏰓 e x p { ω ̃ jT x ̃ } · π K ( x ̃ ) 􏰔 + π K ( x ̃ ) = 1 ( 2 4 . 1 6 ) j ̸=K = ⇒ π K ( x ̃ ) = 􏰥 1 1+ j̸=K exp{ω ̃jTx ̃} P ( Y = e K | X ̃ = x ̃ ) π K ( x ̃ ) =ωi0 ·x0 +ωi1 ·x1 +···+ωid ·xd Substituting the above into Eq. (24.16), we have T e x p { ω ̃ Ti x ̃ } πi(x ̃)=exp{ω ̃i x ̃}·πK(x ̃)=1+􏰥j̸=Kexp{ω ̃jTx ̃} Finally, setting ω ̃ K = 0, we have exp{ω ̃ TK x ̃ } = 1, and thus we can write the full model for multiclass logistic regression as follows: πi(x ̃)=􏰥exp{ω ̃Ti x ̃} , foralli=1,2,···,K (24.17) Kj = 1 e x p { ω ̃ j T x ̃ } 632 Logistic Regression This function is also called the softmax function. When K = 2, this formulation yields exactly the same model as in binary logistic regression. It is also interesting to note that the choice of the reference class is not important, since we can derive the log-odds ratio for any two classes ci and cj as follows: ln􏰕πi(x ̃)􏰖 = ln􏰕 πi(x ̃) · πK(x ̃)􏰖 π j ( x ̃ ) π K ( x ̃ ) π j ( x ̃ ) = l n 􏰕 π i ( x ̃ ) 􏰖 + l n 􏰕 π K ( x ̃ ) 􏰖 􏰕 π K ( x ̃ ) 􏰖 = l n π i ( x ̃ ) π K ( x ̃ ) 􏰕 π j ( x ̃ ) 􏰖 π j ( x ̃ ) π K ( x ̃ ) − l n That is, the log-odds ratio between any two classes can be computed from the difference of the corresponding weight vectors. 24.2.1 Maximum Likelihood Estimation Let D ̃ be the augmented dataset comprising n points x ̃ i and their labels yi . We assume that yi is a one-hot encoded (multivariate Bernoulli) response vector, so that yij denotesthejthelementofyi.Forexample,ifyi =ea,thenyij =1forj=a,andyij =0 for all j ̸= a. We assume that all the yi’s are independent. Let w ̃ i ∈ Rd+1 denote the estimated augmented weight vector for class ci , with wi0 = bi denoting the bias term. To find the K sets of regression weight vectors w ̃ i, for i = 1,2,··· ,K, we use the gradient ascent approach to maximize the log-likelihood function. The likelihood of the data is given as 􏰫n 􏰫n􏰫K􏰄 􏰅y L(W ̃)=P(Y|W ̃)= P(yi|X ̃=x ̃i)= πj(x ̃i) ij i=1 i=1 j=1 where W ̃ = {w ̃ 1,w ̃ 2,··· ,w ̃ K} is the set of K weight vectors. The log-likelihood is then given as: 􏰓 ̃ 􏰔 􏰪n 􏰪K ln L(W) = i=1 j=1 yij ·ln(πj(x ̃i))= 􏰪n 􏰪K 􏰞 exp{w ̃jTx ̃i} 􏰟 yij ·ln 􏰥K exp{w ̃Tx ̃i} (24.18) i=1 j=1 a=1 a = ω ̃ Ti x ̃ − ω ̃ jT x ̃ = ( ω ̃ i − ω ̃ j ) T x ̃ Note that the negative of the log-likelihood function can be regarded as an error function, commonly known as cross-entropy error. We note the following facts: ∂ l n ( π j ( x ̃ i ) ) 􏰢= 1 ∂ πj(x ̃i)= πa(x ̃i)·(1−πa(x ̃i))·x ̃i ifj=a ∂ w ̃ a −πa (x ̃ i ) · πj (x ̃ i ) · x ̃ i if j ̸= a ∂ π j ( x ̃ i ) π j ( x ̃ i ) Multiclass Logistic Regression 633 Let us consider the gradient of the log-likelihood function with respect to the weight v e c t o r w ̃ a : ∇ ( w ̃ a ) = = ∂ 􏰵 l n ( L ( W ̃ ) ) 􏰶 ∂ w ̃ a 􏰪n 􏰪K ∂ l n ( π j ( x ̃ i ) ) ∂ π j ( x ̃ i ) yij· ∂πj(x ̃i) · ∂w ̃a =􏰪n 􏰕yia · 1 ·πa(x ̃i)·(1−πa(x ̃i))·x ̃i +􏰪yij · 1 i=1 j=1 i=1 πa(x ̃i) j̸=a πj (x ̃i) ·(−πa(x ̃i)·πj(x ̃i))·x ̃i􏰖 Kj =1 yij = 1, since only one element of yi can =􏰪n 􏰓yia −yia ·πa(x ̃i)−􏰪yij ·πa(x ̃i)􏰔·x ̃i i=1 j̸=a 􏰪n􏰓􏰪K 􏰔 = yia− yij·πa(x ̃i) ·x ̃i i=1 j=1 =􏰪n 􏰄yia −πa(x ̃i)􏰅·x ̃i i=1 The last step follows from the fact that be 1. 􏰥 For stochastic gradient ascent, we update the weight vectors by considering only onepointatatime.Thegradientofthelog-likelihoodfunctionwithrespecttow ̃j ata given point x ̃i is given as ∇(w ̃j,x ̃i)=􏰄yij −πj(x ̃i)􏰅·x ̃i (24.19) which results in the following update rule for the j th weight vector: w ̃ t+1 =w ̃ t +η·∇(w ̃ t, x ̃i) (24.20) jjj where w ̃jt denotes the estimate of w ̃j at step t, and η is the learning rate. The pseudo-code for the stochastic gradient ascent method for multiclass logistic regression is shown in Algorithm 24.2. Notice that the weight vector for the base class cK is never updated; it remains w ̃ K = 0, as required. Once the model has been trained, we can predict the class for any new augmented test point z ̃ as follows: yˆ =argmax􏰌πi(z ̃)􏰍=argmax􏰢􏰥exp{w ̃Ti z ̃} 􏰣 (24.21) c c K e x p { w ̃ T z ̃ } That is, we evalute the softmax function, and then predict the class of z ̃ as the one with the highest probability. i ij=1j 634 Logistic Regression Algorithm 24.2: Multiclass Logistic Regression Algorithm LOGISTICREGRESSION-MULTICLASS (D,η,ǫ): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 foreach(xTi,yi)∈Ddo x ̃ Ti ← 􏰄 1 x Ti 􏰅 / / m a p t o R d + 1 yi ← ej if yi = cj // map yi to K-dimensional Bernoulli vector t ← 0 // step/iteration counter foreach j = 1,2,··· ,K do w ̃jt ← (0,...,0)T ∈ Rd+1 // initial weight vector repeat foreach j = 1, 2, · · · , K − 1 do w ̃j ←w ̃jt//makeacopyofw ̃jt X1 􏱅􏱋 􏱅􏱋􏱋􏱅􏱋􏱅􏱋􏱅 􏱋􏱅􏱅􏱋 􏱋􏱅􏱋􏱅􏱋􏱅 􏱅􏱋􏱅􏱋􏱋􏱅 􏱅􏱋 􏱅􏱋 􏱋􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋 􏱅􏱋􏱋􏱅􏱋􏱅 􏱋􏱅 􏱅􏱋􏱋􏱅􏱋􏱅􏱋􏱅􏱅􏱋 􏱋􏱅􏱋􏱅􏱋􏱅􏱅􏱋 􏱋􏱅 􏱋􏱅 􏱋􏱅􏱋􏱅 􏱅􏱋􏱋􏱅 foreach x ̃i ∈ D ̃ in random order do foreach j = 1, 2, · · · , K − 1 do e x p 􏰌 w ̃ jT x ̃ i 􏰍 π j ( x ̃ i ) ← 􏰥 Ka = 1 e x p 􏰌 w ̃ Ta x ̃ i 􏰍 ∇(w ̃j,x ̃i)←􏰄yij −πj(x ̃i)􏰅·x ̃i //computegradientat w ̃j w ̃j ←w ̃j +η·∇(w ̃j,x ̃i)//updateestimatefor w ̃j foreach j = 1, 2, · · · , K − 1 do w ̃ t + 1 ← w ̃ / / u p d a t e w ̃ t + 1 jjj t ←􏰥 t + 1 until K−1∥w ̃t−w ̃t−1∥≤ǫ j=1 j j π ( x ̃ ) 1 􏱄􏱈 􏱄􏱈􏱄􏱈􏱈􏱄 􏱈􏱄􏱄􏱈 􏱄􏱈 􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱈􏱄􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱄􏱈􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱄􏱈􏱈􏱄􏱄􏱈 􏱈􏱄 Y π 3 ( x ̃ ) π 2 ( x ̃ ) 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃 X2 Figure 24.3. Multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. All the points actually lie in the (X1,X2) plane, but c1 and c2 are shown displaced along Y with respect to the base class c3 purely for illustration purposes. Example 24.3. Consider the Iris dataset, with n = 150 points in a 2D space spanned by the first two principal components, as shown in Figure 24.3. Here, the response variable takes on three values: Y = c1 corresponds to Iris-setosa (shown as 􏱋􏱅 􏱅􏱋 􏱋􏱅 􏱅􏱋􏱋􏱅 􏱋􏱅 􏱋􏱅 􏱋􏱅􏱅􏱋 􏱋􏱅􏱅􏱋 􏱆􏱃 Further Reading 635 squares), Y = c2 corresponds to Iris-versicolor (as circles) and Y = c3 corresponds to Iris-virginica (as triangles). Thus, we map Y = c1 to e1 = (1, 0, 0)T , Y = c2 to e2 = (0,1,0)T and Y = c3 to e3 = (0,0,1)T. The multiclass logistic model uses Y = c3 (Iris-virginica; triangles) as the reference or base class. The fitted model is given as: w ̃1 =(−3.52,3.62,2.61)T w ̃2 =(−6.95,−5.18,−3.40)T w ̃ 3 = ( 0 , 0 , 0 ) T Figure 24.3 plots the decision surfaces corresponding to the softmax functions: π 1 ( x ̃ ) = π 2 ( x ̃ ) = π 3 ( x ̃ ) = e x p { w ̃ T1 x ̃ } 1 + exp{w ̃ T1 x ̃} + exp{w ̃ T2 x ̃} e x p { w ̃ T2 x ̃ } 1 + exp{w ̃ T1 x ̃} + exp{w ̃ T2 x ̃} 1 1 + exp{w ̃ T1 x ̃} + exp{w ̃ T2 x ̃} The surfaces indicate regions where one class dominates over the others. It is important to note that the points for c1 and c2 are shown displaced along Y to emphasize the contrast with c3, which is the reference class. Overall, the training set accuracy for the multiclass logistic classifier is 96.7%, since it misclassifies only five points (shown in dark gray). For example, for the point x ̃ = (1, −0.52, −1.19)T, we have: π1(x ̃) = 0 π2(x ̃) = 0.448 π3(x ̃) = 0.552 Thus, the predicted class is yˆ = argmaxci {πi(x ̃)} = c3, whereas the true class is y = c2. 24.3 FURTHER READING For a description of the class of generalized linear models, of which logistic regression is a special case, see Agresti (2015). Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John Wiley & Sons. 24.4 EXERCISES Q1. Show that ∂ θ (z) = θ (z) · θ (−z), where θ (·) is the logistic function. ∂z Q2. Showthatthelogitfunctionistheinverseofthelogisticfunction. 636 Logistic Regression Q3. Giventhesoftmaxfunction: Show that π ( x ̃ ) = j e x p { w ̃ jT x ̃ } 􏰥 Ki = 1 e x p { w ̃ Ti x ̃ } ∂πj(x ̃) =􏰢πa(x ̃)·(1−πa(x ̃))·x ̃ ∂ w ̃ a −πa (x ̃ ) · πj (x ̃ ) · x ̃ ifj =a if j ̸= a CHAPTER 25 Neural Networks Artificial neural networks or simply neural networks are inspired by biological neuronal networks. A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an axon that leads to synaptic terminals. A neuron transmits information via electrochemical signals. When there is enough concentration of ions at the dendrites of a neuron it generates an electric pulse along its axon called an action potential, which in turn activates the synaptic terminals, releasing more ions and thus causing the information to flow to dendrites of other neurons. A human brain has on the order of 100 billion neurons, with each neuron having between 1,000 to 10,000 connections to other neurons. Thus, a human brain is a neuronal network with 100 trillion to a quadrillion (1015) interconnections! Interestingly, as far as we know, learning happens by adjusting the synaptic strengths, since synaptic signals can be either excitatory or inhibitory, making the post-synaptic neuron either more or less likely to generate an action potential, respectively. Artificial neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighted directed graph G = (V, E), with each node vi ∈ V representing a neuron, and each directed edge (vi , vj ) ∈ E representing a synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes the synaptic strength. Neural networks are characterized by the type of activation function used to generate an output, and the architecture of the network in terms of how the nodes are interconnected. For example, whether the graph is a directed acyclic graph or has cycles, whether the graph is layered or not, and so on. It is important to note that a neural network is designed to represent and learn information by adjusting the synaptic weights. 25.1 ARTIFICIAL NEURON: ACTIVATION FUNCTIONS An artificial neuron acts as a processing unit, that first aggregates the incoming signals via a weighted sum, and then applies some function to generate an output. For example, a binary neuron will output a 1 whenever the combined signal exceeds a threshold, or 0 otherwise. 637 638 Neural Networks 1 x0 bk zk wk· 􏰥 di = 1 w i k · x i + b k x1 w1k x2 w2k . wdk . xd Figure 25.1 shows the schematic of a neuron zk that has incoming edges from neurons x1,··· ,xd. For simplicity, both the name and the (output) value of a neuron are denoted by the same symbol. Thus, xi denotes neuron i, and also the value of that neuron. The net input at zk , denoted netk , is given as the weighted sum 􏰪d i=1 wherewk =(w1k,w2k,···,wdk)T ∈Rd andx=(x1,x2,···,xd)T ∈Rd isaninputpoint. Notice that x0 is a special bias neuron whose value is always fixed at 1, and the weight from x0 to zk is bk, which specifies the bias term for the neuron. Finally, the output value of zk is given as some activation function, f (·), applied to the net input at zk zk =f(netk) The value zk is then passed along the outgoing edges from zk to other neurons. Neurons differ based on the type of activation function used. Some commonly used activation functions, illustrated in Figure 25.2, are: Linear/IdentityFunction: Identityisthesimplestactivationfunction;itsimplyreturns its argument, and is also called a linear activation function: f (netk ) = netk (25.2) Figure 25.2(a) plots the identity function. To examine the role of the bias term, note that netk > 0 is equivalent to wTx > −bk. That is, the output transitions from negative to positive when the weighted sum of the inputs exceeds −bk, as shown in Figure 25.2(b).
StepFunction: Thisisabinaryactivationfunction,wheretheneuronoutputsa0ifthe net value is negative (or zero), and 1 if the net value is positive (see Figure 25.2(c)).
f (netk) = 􏰢0 if netk ≤ 0 (25.3) 1 ifnetk>0
Figure 25.1. Artificial neuron: aggregation and activation.
netk =bk +
wik ·xi =bk +wTx (25.1)

25.1 Artificial Neuron: Activation Functions
639
+∞ +∞
00
−∞ −∞
1.0
0.5
0
netk (a)Linear(zk versusnetk)
−∞ +∞ −∞
1.0
0.5
−bk 0 +∞ wTx
(b)Linear(zk versuswTx)
−bk 0 +∞ wTx
(d) Step (zk versus wTx)
−bk 0 +∞ wTx
(f) Rectified Linear (zk versus wTx)
00
−∞ 0 netk
(c) Step (zk versus netk)
+∞
00
+∞ −∞ +∞
1.0
0.5
−∞
0
netk
(e) Rectified Linear (zk versus netk)
+∞ −∞
1.0
0.5
00
−∞ 0 +∞−∞−bk0 +∞
netk wTx
(g) Sigmoid (zk versus netk) (h) Sigmoid (zk versus wTx)
11
00
−1
−∞ 0
netk
(i) Hyperbolic Tangent (zk versus netk )
−1
+∞ −∞ −bk0 +∞
wTx
(j) Hyperbolic Tangent (zk versus wT x)
Figure 25.2. Neuron activation functions; also illustrating the effect of bias.
zk zk zk zk zk
zk
zk
zk
zk
zk

640 Neural Networks It is interesting to note that the transition from 0 to 1 happens when the weighted
sum of the inputs exceeds −bk, as shown in Figure 25.2(d).
Rectified Linear Unit (ReLU): Here, the neuron remains inactive if the net input is less than or equal to zero, and then increases linearly with netk, as shown in Figure 25.2(e).
f(netk)=􏰢0 ifnetk ≤0 (25.4) netk if netk > 0
An alternative expression for the ReLU activation is given as f (netk ) = max{0, netk }. The transition from zero to linear output happens when the weighted sum of the inputs exceeds −bk (see Figure 25.2(f)).
Sigmoid: The sigmoid function, illustrated in Figure 25.2(g) squashes its input so that the output ranges between 0 and 1
f (netk ) = 1 (25.5) 1 + exp{−netk }
When netk = 0, we have f (netk ) = 0.5, which implies that the transition point where the output crosses 0.5 happens when the weighted sum of the inputs exceeds −bk (see Figure 25.2(h)).
Hyperbolic Tangent (tanh): The hyperbolic tangent or tanh function is similar to the sigmoid, but its output ranges between −1 and +1 (see Figure 25.2(i)).
f(netk)= exp{netk}−exp{−netk} = exp{2·netk}−1 (25.6) exp{netk } + exp{−netk } exp{2 · netk } + 1
When netk = 0, we have f (netk ) = 0, which implies that the output transitions from negative to positive when the weighted sum of the inputs exceeds −bk, as shown in Figure 25.2(j).
Softmax: Softmax is a generalization of the sigmoid or logistic activation function. Softmax is mainly used at the output layer in a neural network, and unlike the other functions it depends not only on the net input at neuron k, but it depends on the net signal at all other neurons in the output layer. Thus, given the net input vector,
zk
netj netk
Figure 25.3. Softmax (netk versus netj ).

Artificial Neuron: Activation Functions 641 net = (net1,net2,··· ,netp)T, for all the p output neurons, the output of the softmax
function for the kth neuron is given as
f(net |net)= 􏰥exp{netk} (25.7)
Figure 25.3 plots the softmax activation for netk versus the net signal netj , with all other net values fixed at zero. The output behaves similar to a sigmoid curve for any given value of netj .
25.1.1 DerivativesforActivationFunctions
For learning using a neural network, we need to consider the derivative of an activation function with respect to its argument. The derivatives for the activation functions are given as follows:
Identity/Linear: The identity (or linear) activation function has a derivative of 1 with respect to its argument, giving us:
∂f (netj ) = 1 (25.8) ∂netj
Step: The step function has a derivative of 0 everywhere except for the discontinuity at 0, where the derivative is ∞.
ReLU: The ReLU function [Eq. (25.4)] is non-differentiable at 0, nevertheless for other values its derivative is 0 if netj <0 and 1 if netj >0. At 0, we can set the derivative to be any value in the range [0,1], a simple choice being 0. Putting it all together, we have
∂f(netj)=􏰢0 ifnetj ≤0 (25.9) ∂netj 1 ifnetj>0
Even though ReLU has a discontinuity at 0 it is a popular choice for training deep neural networks.
k pi=1 exp{neti}
Sigmoid: Thederivativeofthesigmoidfunction[Eq.(25.5)]isgivenas
∂f(netj) =f(netj)·(1−f(netj)) (25.10)
∂ n e tj
HyperbolicTangent: Thederivativeofthetanhfunction[Eq.(25.6)]isgivenas
∂f(netj) =1−f(netj)2 (25.11) ∂ n e tj
Softmax: The softmax activation function [Eq. (25.7)] is a vector valued function, which maps a vector input net = (net1 , net2 , · · · , netp )T to a vector of probability

642 Neural Networks values. Softmax is typically used only for the output layer. The partial derivative
off(netj)withrespecttonetj isgivenas
∂f(netj|net) =f(netj)·(1−f(netj))
∂netj
whereas the partial derivative of f (netj ) with respect to netk , with k ̸= j is given as
∂f(netj|net) =−f(netk)·f(netj) ∂ netk
Since softmax is used at the output layer, if we denote the ith output neuron as oi, then f (neti ) = oi , and we can write the deri􏰢vative as:
∂f(netj|net)= ∂oj = oj·(1−oj) ifk=j (25.12) ∂ netk ∂ netk −ok · oj if k ̸= j
25.2 NEURAL NETWORKS: REGRESSION AND CLASSIFICATION
Networks of (artificial) neurons are capable of representing and learning arbitrarily
complex functions for both regression and classification tasks.
25.2.1 Regression
Consider the multiple (linear) regression problem, where given an input xi ∈ Rd , the
goal is to predict the response as follows
yˆi =b+w1xi1 +w2xi2 +···+wdxid
Here,bisthebiasterm,andwj istheregressioncoefficientorweightforattributeXj. Given a training data D comprising n points xi in a d-dimensional space, along with their corresponding true response value yi, the bias and weights for linear regression are chosen so as to minimize the sum of squared errors between the true and predicted response over all data points
􏰪n i=1
As shown in Figure 25.4(a), a neural network with d + 1 input neurons x0 , x1 , · · · , xd , including the bias neuron x0 = 1, and a single output neuron o, all with identity activation functions and with yˆ = o, represents the exact same model as multiple linear regression. Whereas the multiple regression problem has a closed form solution, neural networks learn the bias and weights via a gradient descent approach that minimizes the squared error.
Neural networks can just as easily model the multivariate (linear) regression task, where we have a p-dimensional response vector yi ∈ Rp instead of a single value yi. That is, the training data D comprises n points xi ∈ Rd and their true response vectors yi ∈ Rp . As shown in Figure 25.4(b), multivariate regression can be modeled by a neural
S S E =
( y i − yˆ i ) 2

Neural Networks: Regression and Classification
643
1 x0
x1 .
xd
1 x0
o x1 w11
(a) Single Output
Figure 25.4. Linear and logistic regression via neural networks.
b1
bp
b w1
wd
w1p wd1
o1 .
op
.
xd wdp
network with d + 1 input neurons, and p output neurons o1,o2,··· ,op, with all input and output neurons using the identity activation function. A neural network learns the weights by comparing its predicted output yˆ = o = (o1,o2,··· ,op)T with the true response vector y = (y1,y2,··· ,yp)T. That is, training happens by first computing the error function or loss function between o and y. Recall that a loss function assigns a score or penalty for predicting the output to be o when the desired output is y. When the prediction matches the true output the loss should be zero. The most common loss function for regression is the squared error function
1 1􏰪p
Ex = 2∥y−o∥2 = 2 (yj −oj)2
j=1
where Ex denotes the error on input x. Across all the points in a dataset, the total sum
of squared errors is
􏰪n 1􏰪n
E= Exi =2· ∥yi−oi∥2 i=1 i=1
(b) Multiple Outputs
Example 25.1 (Neural Networks for Multiple and Multivariate Regression).
Consider the multiple regression of sepal length and petal length on the dependent attribute petal width for the Iris dataset with n = 150 points. From Example 23.3 we find that the solution is given as
yˆ =−0.014−0.082·x1+0.45·x2
The squared error for this optimal solution is 6.179 on the training data.
Using the neural network in Figure 25.4(a), with linear activation for the output and minimizing the squared error via gradient descent, results in the following learned parameters, b = 0.0096, w1 = −0.087 and w2 = 0.452, yielding the regression
model
o=0.0096−0.087·x1+0.452·x2
with a squared error of 6.18, which is very close to the optimal solution.

644 Neural Networks
Multivariate Linear Regression For multivariate regression, we use the neural network architecture in Figure 25.4(b) to learn the weights and bias for the Iris dataset, where we use sepal length and sepal width as the independent attributes, and petal length and petal width as the response or dependent attributes. Therefore, each input point xi is 2-dimensional, and the true response vector yi is also 2-dimensional. That is, d = 2 and p = 2 specify the size of the input and output layers. Minimizing the squared error via gradient descent, yields the following parameters:
 b1 b2  −1.83
w11 w12= 1.72 0.72 o2 = −1.47+0.72·x1−0.50·x2
−1.47 􏰕o1􏰖 􏰕−1.83+1.72·x1 −1.46·x2􏰖 w21 w22 −1.46 −0.50
The squared error on the training set is 84.9. Optimal least squared multivariate regression yields a squared error of 84.16 with the following parameters
􏰕yˆ1􏰖=􏰕−2.56+1.78·x1−1.34·x2􏰖 yˆ2 −1.59+0.73·x1 −0.48·x2
25.2.2 Classification
Networks of artificial neurons can also learn to classify the inputs. Consider the binary classification problem, where y = 1 denotes that the point belongs to the positive class, and y = 0 means that it belongs to the negative class. Recall that in logistic regression, we model the probability of the positive class via the logistic (or sigmoid) function:
π (x) = P (y = 1|x) = 1
1 + exp{−(b + wTx)}
where b is the bias term and w = (w1,w2,··· ,wd )T is the vector of estimated weights or regression coefficients. On the other hand
P(y =0|x)=1−P(y =1|x)=1−π(x)
A simple change to the neural network shown in Figure 25.4(a) allows it to solve the logistic regression problem. All we have to do is use a sigmoid activation function at the output neuron o, and use the cross-entropy error instead of squared error. Given input x, true response y, and predicted response o, recall that the cross-entropy error (see Eq. (24.8)) is defined as
Ex =−􏰄y·ln(o)+(1−y)·ln(1−o)􏰅
Thus, with sigmoid activation, the output of the neural network in Figure 25.4(a) is
given as
o=f(neto)=sigmoid(b+wTx)= 1 =π(x) 1 + exp{−(b + wTx)}
which is the same as the logistic regression model.

Neural Networks: Regression and Classification 645
Multiclass Logistic Regression In a similar manner, the multiple output neural network architecture shown in Figure 25.4(b) can be used for multiclass or nominal logistic regression. For the general classification problem with K classes {c1,c2,··· ,cK}, the true response y is encoded as a one-hot vector. Thus, class c1 is encoded as e1 = (1,0,··· ,0)T, class c2 is encoded as e2 = (0,1,··· ,0)T, and so on, with ei ∈ {0,1}K for i = 1,2,··· ,K. Thus, we encode y as a multivariate vector y ∈ {e1,e2,··· ,eK}. Recall that in multiclass logistic regression (see Section 24.2) the task is to estimate the per class bias bi and weight vector wi ∈ Rd , with the last class cK used as the base class with fixed bias bK = 0 and fixed weight vector wK = (0,0,··· ,0)T ∈ Rd. The probability vector across all K classes is modeled via the softmax function (see Eq. (24.17)):
πi(x)= 􏰥exp{bi +wTi x} , foralli=1,2,···,K Kj = 1 e x p { b j + w j T x }
Therefore, the neural network shown in Figure 25.4(b) (with p = K) can solve the multiclass logistic regression task, provided we use a softmax activation at the outputs, and use the K-way cross-entropy error (see Eq. (24.18)), defined as
Ex =−􏰓y1 ·ln(o1)+···+yK ·ln(oK)􏰔
where x is an input vector, o = (o1,o2,··· ,oK)T is the predicted response vector, and y = (y1,y2,··· ,yK)T is the true response vector. Note that only one element of y is 1, and the rest are 0, due to the one-hot encoding.
With softmax activation, the output of the neural network in Figure 25.4(b) (with p = K) is given as
o =P(y=e |x)=f(net |net)= 􏰥exp{neti} =π (x) i i i pj = 1 e x p { n e t j } i
which matches the multiclass logistic regression task. The only restriction we have to impose on the neural network is that the weights on edges into the last output neuron should be zero to model the base class weights wK. However, in practice, we can relax this restriction, and just learn a regular weight vector for class cK.
Example 25.2 (Logistic Regression: Binary and Multiclass). We applied the neural network in Figure 25.4(a), with logistic activation at the output neuron and cross-entropy error function, on the Iris principal components dataset. The output is a binary response indicating Iris-virginica (Y = 1) or one of the other Iris types (Y = 0). As expected, the neural network learns an identical set of weights and bias as shown for the logistic regression model in Example 24.2, namely:
o = −6.79 − 5.07 · x1 − 3.29 · x2
Next, we we applied the neural network in Figure 25.4(b), using a softmax activation and cross-entropy error function, to the Iris principal components data with three classes: Iris-setosa (Y = 1), Iris-versicolor (Y = 2) and Iris-virginica(Y = 3). Thus, we need K = 3 output neurons, o1, o2, and o3. Further, to obtain the same model as in the multiclass logistic regression from Example 24.3,

646
Neural Networks
X1
􏱅􏱋 􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅 􏱋􏱅􏱅􏱋 􏱋􏱅􏱅􏱋􏱋􏱅 􏱅􏱋􏱋􏱅􏱅􏱋 􏱅􏱋
􏱋􏱅 􏱋􏱅􏱋􏱅􏱋􏱅􏱅􏱋􏱅􏱋􏱅􏱋􏱋􏱅􏱅􏱋 􏱋􏱅􏱋􏱅􏱋􏱅 􏱅􏱋 􏱋􏱅􏱅􏱋􏱋􏱅􏱋􏱅􏱅􏱋 􏱅􏱋􏱅􏱋􏱅􏱋􏱅􏱋 􏱋􏱅 􏱅􏱋 􏱋􏱅􏱋􏱅 􏱅􏱋􏱋􏱅
π1 (x) 􏱄􏱈 􏱈􏱄􏱄􏱈􏱈􏱄 􏱈􏱄􏱄􏱈 􏱄􏱈 􏱈􏱄􏱈􏱄􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱈􏱄􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱈􏱄􏱄􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱈􏱄􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄􏱈􏱄 􏱄􏱈
Y
π3 (x)
π2 (x)
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱃􏱆
X2
Figure 25.5. Neural networks for multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. Points in class c1 and c2 are shown displaced with respect to the base class c3 only for illustration.
we fix the incoming weights and bias for output neuron o3 to be zero. The model is given as
o1 =−3.49+3.61·x1+2.65·x2 o2 =−6.95−5.18·x1−3.40·x2 o3 =0+0·x1+0·x2
which is essentially the same as in Example 24.3.
If we do not constrain the weights and bias for o3 we obtain the following model:
o1 =−0.89+4.54·x1+1.96·x2 o2 =−3.38−5.11·x1−2.88·x2 o3 =4.24+0.52·x1+0.92·x2
The classification decision surface for each class is illustrated in Figure 25.5. The points in class c1 are shown as squares, c2 as circles, and c3 as triangles. This figure should be contrasted with the decision boundaries shown for multiclass logistic regression in Figure 24.3, which has the weights and bias set to 0 for the base class c3.
25.2.3 Error Functions
Typically, for a regression task, we use squared error as the loss function, whereas for classification, we use the cross-entropy loss function. Furthermore, when learning from neural networks, we will require the partial derivatives of the error function with respect to the output neurons. Thus, the commonly used error functions and their derivatives are listed below:
􏱅􏱋 􏱋􏱅 􏱅􏱋 􏱋􏱅􏱅􏱋 􏱋􏱅 􏱋􏱅 􏱋􏱅􏱋􏱅 􏱋􏱅􏱅􏱋
􏱆􏱃

Neural Networks: Regression and Classification 647
SquaredError: Givenaninputvectorx∈Rd,thesquarederrorlossfunctionmeasures the squared deviation between the predicted output vector o ∈ Rp and the true response y ∈ Rp , defined as follows:
1 1􏰪p
Ex = 2∥y−o∥2 = 2 (yj −oj)2 (25.13)
j=1
The partial derivative of the squared error function with respect to a particular
where Ex denotes the error on input x. output neuron oj is
∂Ex =1·2·(yj−oj)·−1=oj−yj ∂oj 2
Across all the output neurons, we can write this as
∂Ex =o−y ∂o
(25.14)
(25.15)
Cross-EntropyError: Forclassificationtasks,withKclasses{c1,c2,···,cK},weusually set the number of output neurons p = K, with one output neuron per class. Furthermore,eachoftheclassesiscodedasaone-hotvector,withclassci encodedas the ith standard basis vector ei = (ei1,ei2,··· ,eiK)T ∈ {0,1}K, with eii = 1 and eij = 0 for all j ̸= i. Thus, given input x ∈ Rd, with the true response y = (y1,y2,··· ,yK)T, where y ∈ {e1,e2,··· ,eK}, the cross-entropy loss is defined as
􏰓􏰔
The partial derivative of the cross-entropy error function with respect to a particular output neuron oj is
∂Ex =−yj (25.17) ∂ oj oj
The vector of partial derivatives of the error function with respect to the output neurons is therefore given as
∂Ex 􏰕∂Ex ∂Ex ∂Ex 􏰖T 􏰕 y1 y2 yK 􏰖T
∂o = ∂o ,∂o ,···, ∂o = −o ,−o ,···,−o (25.18)
12K12K
Binary Cross-Entropy Error: For classification tasks with binary classes, it is typical to encode the positive class as 1 and the negative class as 0, as opposed to using a one-hot encoding as in the general K-class case. Given an input x ∈ Rd , with
􏰪K i=1
Ex =−
Note that only one element of y is 1 and the rest are 0 due to the one-hot encoding.
yi ·ln(oi)=− y1 ·ln(o1)+···+yK ·ln(oK) (25.16) Thatis,ify=ei,thenonlyyi =1,andtheotherelementsyj =0forj̸=i.

648 Neural Networks true response y ∈ {0,1}, there is only one output neuron o. Therefore, the binary
cross-entropy error is defined as
Ex =−􏰄y·ln(o)+(1−y)·ln(1−o)􏰅 (25.19)
Here y is either 1 or 0. The partial derivative of the binary cross-entropy error function with respect to the output neuron o is
(25.20)
∂Ex = ∂ 􏰵−y·ln(o)−(1−y)·ln(1−o)􏰶 ∂o ∂o
=−􏰕y + 1−y ·−1􏰖= −y·(1−o)+(1−y)·o o 1−o o·(1−o)
= o−y o·(1−o)
25.3 MULTILAYER PERCEPTRON: ONE HIDDEN LAYER
A multilayer perceptron (MLP) is a neural network that has distinct layers of neurons. The inputs to the neural network comprise the input layer, and the final outputs from the MLP comprise the output layer. Any intermediate layer is called a hidden layer, and an MLP can have one or many hidden layers. Networks with many hidden layers are called deep neural networks. An MLP is also a feed-forward network. That is, information flows in the forward direction, and from a layer only to the subsequent layer. Thus, information flows from the input to the first hidden layer, from the first to the second hidden layer, and so on, until it reaches the output layer from the last hidden layer. Typically, MLPs are fully connected between layers. That is, each neuron in the input layer is connected to all the neurons in the first hidden layer, and each neuron in the first hidden layer is connected to all neurons in the second hidden layer, and so on, and finally, each neuron in the last hidden layer is connected to all neurons in the output layer.
For ease of explanation, in this section, we will consider an MLP with only one hidden layer, and we will later generalize the discussion to deep MLPs. For example, Figure 25.6 shows an MLP with one hidden layer. The input layer has d neurons, x1,x2,···,xd, and an additional neuron x0 that specifies the biases for the hidden layer. The hidden layer has m neurons, z1,z2,··· ,zm, and an additional neuron z0 that specifies the biases for the output neurons. Finally, the output layer has p neurons, o1,o2,··· ,op. The bias neurons have no incoming edges, since their value is always fixed at 1. Thus, in total there are d × m + m × p weight parameters (wij ) and a further m + p bias parameters (bi ) that need to be learned by the neural network. These parameters also correspond to the total number of edges in the MLP.
25.3.1 Feed-forward Phase
Let D denote the training dataset, comprising n input points xi ∈ Rd and corresponding true response vectors yi ∈ Rp . For each pair (x, y) in the data, in the feed-forward phase,

Multilayer Perceptron: One Hidden Layer
649
Input Layer
1 x0
Hidden Layer
1 z0
Output Layer
bk
x1 z1 o1
w1k
wik
. .
wk1
zk wkj oj
wkp
. xi .
Figure 25.6. Multilayer perceptron with one hidden layer. Input and output links for neuron zk are shown in bold. Neurons x0 and z0 are bias neurons.
thepointx=(x1,x2,···,xd)T ∈Rd issuppliedasaninputtotheMLP.Theinputneurons do not use any activation function, and simply pass along the supplied input values as their output value. This is equivalent to saying that the net at input neuron k is netk = xk , and the activation function is the identity function f (netk ) = netk , so that the output value of neuron k is simply xk .
Given the input neuron values, we compute the output value for each hidden neuron zk as follows:
􏰞􏰪d 􏰟 zk=f(netk)=f bk+ wik·xi
i=1
where f is some activation function, and wik denotes the weight between input neuron xi and hidden neuron zk . Next, given the hidden neuron values, we compute the value for each output neuron oj as follows:
􏰞 􏰪m 􏰟 oj =f(netj)=f bj + wij ·zi
i=1
where wij denotes the weight between hidden neuron zi and output neuron oj .
We can write the feed-forward phase as a series of matrix-vector operations. For this we define the d × m matrix Wh comprising the weights between input and hidden layer neurons, and vector bh ∈ Rm comprising the bias terms for hidden layer neurons,
.
xd zm op
wdk
.

650
Neural Networks
given as
w11 w12 ··· w1m
w21 w22 ··· w2m Wh = . . ··· . 
wd1 wd2 ··· wdm
b1 b2 
bh = .  (25.21) bm
where wij denotes the weight on the edge between input neuron xi and hidden neuron zj , and bi denotes the bias weight from x0 to zi . The net input and the output for all the hidden layer neurons can be computed via a matrix-vector multiplication operation, as follows:
n e t h = b h + W hT x ( 2 5 . 2 2 ) z = f (neth) = f 􏰄bh + WhTx􏰅 (25.23)
Here, neth = (net1,··· ,netm)T denotes the net input at each hidden neuron (excluding the bias neuron z0 whose value is always fixed at z0 = 1), and z = (z1,z2,··· ,zm)T denotes the vector of hidden neuron values. The activation function f (•) applies to, or distributes over, each element of neth, i.e., f (neth) = (f (net1),··· ,f (netm))T ∈ Rm. Typically, all neurons in a given layer use the same activation function, but they can also be different if desired.
Likewise, let Wo ∈ Rm×p denote the weight matrix between the hidden and output layers, and let bo ∈ Rp be the bias vector for output neurons, given as
w11 w12 ··· w1p b1 w21 w22 ··· w2p b2
Wo = . . ··· .  bo = .  (25.24) wm1 wm2 ··· wmp bp
where wij denotes the weight on the edge between hidden neuron zi and output neuron oj , and bi the bias weight between z0 and output neuron oi . The output vector can then be computed as follows:
n e t o = b o + W oT z ( 2 5 . 2 5 ) o = f (neto) = f 􏰄bo + WoTz􏰅 (25.26)
To summarize, for a given input x ∈ D with desired response y, an MLP computes the output vector via the feed-forward process, as follows:
o = f 􏰓 b o + W oT z 􏰔 = f 􏰓 b o + W oT · f 􏰄 b h + W hT x 􏰅 􏰔 ( 2 5 . 2 7 ) where, o = (o1,o2,··· ,op)T is the vector of predicted outputs from the single hidden
layer MLP.
25.3.2 Backpropagation Phase
Backpropagation is the algorithm used to learn the weights between successive layers in an MLP. The name comes from the manner in which the error gradient is propagated

Multilayer Perceptron: One Hidden Layer 651
backwards from the output to input layers via the hidden layers. For simplicity of exposition, we will consider backpropagation for an MLP with a single hidden layer with m neurons, with squared error function, and with sigmoid activations for all neurons. We will later generalize to multiple hidden layers, and other error and activation functions.
LetDdenotethetrainingdataset,comprisingninputpointsxi =(xi1,xi2,···,xid)T ∈ Rd and corresponding true response vectors yi ∈ Rp. Let Wh ∈ Rd×m denote the weight matrix between the input and hidden layer, and bh ∈ Rm the vector of bias terms for the hidden neurons from Eq. (25.21). Likewise, let Wo ∈ Rm×p denote the weight matrix between the hidden and output layer, and bo ∈ Rp the bias vector for output neurons from Eq. (25.24).
For a given input pair (x, y) in the training data, the MLP first computes the output vector o via the feed-forward step in Eq. (25.27). Next, it computes the error in the predicted output vis-a-vis the true response y using the squared error function
1 1􏰪p
Ex = 2∥y−o∥2 = 2 (yj −oj)2 (25.28)
j=1
The basic idea behind backpropagation is to examine the extent to which an output neuron, say oj , deviates from the corresponding target response yj , and to modify the weights wij between each hidden neuron zi and oj as some function of the error – large error should cause a correspondingly large change in the weight, and small error should result in smaller changes. Likewise, the weights between all input and hidden neurons should also be updated as some function of the error at the output, as well as changes already computed for the weights between the hidden and output layers. That is, the error propagates backwards.
The weight update is done via a gradient descent approach to minimize the error. Let ∇wij be the gradient of the error function with respect to wij , or simply the weight gradient at wij . Given the previous weight estimate wij , a new weight is computed by taking a small step η in a direction that is opposite to the weight gradient at wij
wij =wij −η·∇wij (25.29) Inasimilarmanner,thebiastermbj isalsoupdatedviagradientdescent
bj =bj −η·∇bj (25.30) where∇bj isthegradientoftheerrorfunctionwithrespecttobj,whichwecallthebias
gradient at bj .
Updating Parameters Between Hidden and Output Layer
Consider the weight wij between hidden neuron zi and output neuron oj , and the bias term bj between z0 and oj . Using the chain rule of differentiation, we compute the weightgradientatwij andbiasgradientatbj,asfollows:
∇wij=∂Ex =∂Ex ·∂netj=δj·zi ∂wij ∂netj ∂wij
∇bj=∂Ex=∂Ex ·∂netj=δj ∂bj ∂netj ∂bj
(25.31)

652 Neural Networks whereweusethesymbolδj todenotethepartialderivativeoftheerrorwithrespectto
net signal at oj , which we also call the net gradient at oj
δj = ∂Ex (25.32)
∂ n e tj Furthermore,thepartialderivativeofnetj withrespecttowij andbj isgivenas
∂netj∂􏰢􏰪m 􏰣 ∂netj∂􏰢􏰪m 􏰣
∂w = ∂w bj + wkj ·zk =zi ∂b = ∂b bj + wkj ·zk =1
ij ij k=1 j j k=1
whereweusedthefactthatbj andallwkj fork̸=iareconstantswithrespecttowij. Next, we need to compute δj , the net gradient at oj . This can also be computed via
the chain rule
δj = ∂Ex = ∂Ex ·∂f(netj) (25.33) ∂ netj ∂f (netj ) ∂ netj
Notethatf(netj)=oj.Thus,δj iscomposedoftwoterms,namelythepartialderivative of the error term with respect to the output or activation function applied to the net signal, and the derivative of the activation function with respect to its argument. Using the squared error function and from Eq. (25.14), for the former, we have
∂Ex∂Ex∂􏰢1􏰪p 􏰣􏰄􏰅 ∂f(netj)=∂oj =∂oj 2 (yk−ok)2 = oj−yj
k=1
where we used the observation that all ok for k ̸= j are constants with respect to oj .
Since we assume a sigmoid activation function, for the latter, we have via Eq. (25.10)
∂f(netj) =oj ·􏰄1−oj􏰅 ∂netj
Putting it all together, we get
δj =􏰄oj −yj􏰅·oj ·􏰄1−oj􏰅
Let δo = (δ1,δ2,…,δp)T denote the vector of net gradients at each output neuron, which we call the net gradient vector for the output layer. We can write δo as
δo =o ⊙ (1−o) ⊙ (o−y) (25.34)
where ⊙ denotes the element-wise product (also called the Hadamard product) between the vectors, and where o = (o1,o2,··· ,op)T is the predicted output vector, y = (y1,y2,··· ,yp)T is the (true) response vector, and 1 = (1,··· ,1)T ∈ Rp is the p-dimensional vector of all ones.
Letz=(z1,z2,···,zm)T denotethevectorcomprisingthevaluesofallhiddenlayer neurons (after applying the activation function). Based on Eq. (25.31), we can compute thegradients∇wij forallhiddentooutputneuronconnectionsviatheouterproductof

Multilayer Perceptron: One Hidden Layer
653
z and δo:
∇w11 ∇w12 ··· ∇ =∇w21 ∇w22 ···
Wo  . . ··· ∇wm1 ∇wm2 · · ·
∇w1p 
∇w2p =z·δT (25.35)
where ∇Wo ∈ Rm×p is the matrix of weight gradients. The vector of bias gradients is given as:
∇bo =􏰄∇b1,∇b2,···,∇bp􏰅T =δo (25.36) Once the gradients have been computed, we can update all the weights and biases
.  o ∇wmp
where∇bo ∈Rp. as follows
Wo = Wo − η · ∇Wo bo = bo − η · ∇bo
where η is the step size (also called the learning rate) for gradient descent.
Updating Parameters Between Input and Hidden Layer
(25.37)
Consider the weight wij between input neuron xi and hidden neuron zj , and the bias term between x0 and zj . The weight gradient at wij and bias gradient at bj is computed similarly to Eq. (25.31)
∇wij=∂Ex =∂Ex ·∂netj=δj·xi ∂wij ∂netj ∂wij
∇bj=∂Ex=∂Ex ·∂netj=δj ∂bj ∂netj ∂bj
(25.38)
which follows from
∂netj∂􏰢􏰪m 􏰣 ∂netj∂􏰢􏰪m 􏰣
bj + wkj ·xk =xi ∂b = ∂b bj + wkj ·xk =1 ij ij k=1 j j k=1
∂w = ∂w
To compute the net gradient δj at the hidden neuron zj we have to consider the error gradients that flow back from all the output neurons to zj . Applying the chain rule, we get:
δj=∂Ex =􏰪p ∂Ex ·∂netk·∂zj =∂zj ·􏰪p ∂Ex ·∂netk ∂netj k=1 ∂netk ∂zj ∂netj ∂netj k=1 ∂netk ∂zj
􏰪p k=1
=zj ·(1−zj)·
= zj · (1 − zj ), since we assume a sigmoid activation function for the hidden
δk ·wjk
neurons. The chain rule leads to a natural interpretation for backpropagation, namely,
where ∂zj ∂ n e tj

654
Neural Networks
Input Layer
1 x0
bj
xi wij
Hidden Layer
Output Layer
o1 δ1 .
ok δk .
op δp
􏰪p k=1
zj
δk ·wjk
wj 1 wj k wjp
Figure 25.7. Backpropagation of gradients from output to hidden layer.
tofindthenetgradientatzj wehavetoconsiderthenetgradientsateachoftheoutput neurons δk but weighted by the strength of the connection wjk between zj an􏰥d ok, as illustrated in Figure 25.7. That is, we compute the weighted sum of gradients pk=1 δk · wj k , which is used to compute δj , the net gradient at hidden neuron zj .
Let δo = (δ1,δ2,…,δp)T denote the vector of net gradients at the output neurons, and δh = (δ1,δ2,…,δm)T the net gradients at the hidden layer neurons. We can write δh compactly as
δh =z ⊙ (1−z) ⊙ (Wo ·δo) (25.39)
where ⊙ is the element-wise product, 1 = (1,1,··· ,1) ∈ Rm is the vector of all ones and z = (z1,z2,··· ,zm)T is the vector of hidden layer outputs. Furthermore, Wo · δo ∈ Rm is the vector of weighted gradients at each hidden neuron, since
􏰞􏰪p 􏰪p 􏰪p 􏰟T Wo·δo= δk·w1k, δk·w2k, ···, δk·wmk
k=1 k=1 k=1
Let x = (x1,x2,··· ,xd)T denote the input vector, then based on Eq. (25.38) we can compute the gradients ∇wij for all input to hidden layer connections via the outer product:
∇w11 ···∇w1m
∇ =∇w21 ··· ∇w2m=x·δT (25.40)
Wh  . ··· .  h ∇wd1 ··· ∇wdm
where ∇Wh ∈ Rd×m is the matrix of weight gradients. The vector of bias gradients is given as:
∇bh =􏰄∇b1,∇b2,···,∇bm􏰅T =δh (25.41)
where ∇bh ∈ Rm.

Multilayer Perceptron: One Hidden Layer 655
Algorithm 25.1: MLP Training: Stochastic Gradient Descent MLP-TRAINING (D,m,η,maxiter):
// Initialize bias vectors
1 bh ← random m-dimensional vector with small values
2 bo ← random p-dimensional vector with small values
// Initialize weight matrices
3 Wh ← random d × m matrix with small values
4 Wo ← random m × p matrix with small values
5 t ← 0 // iteration counter
6 repeat
9 o i ← f b o + W oT z i
7 foreach (xi , yi ) ∈ D in random order do // Feed-forward phase
8 zi ←f􏰄bh+WTxi􏰅 􏰓h􏰔
// Backpropagation phase: net gradients
10 δo ←oi ⊙ (1−oi) ⊙ (oi −yi)
11 δh←zi ⊙(1−zi)⊙(Wo·δo)
// Gradient descent for bias vectors
12 ∇bo←δo; bo←bo−η·∇bo
13 ∇bh←δh; bh←bh−η·∇bh
// Gradient descent for weight matrices
14 ∇Wo←zi·δTo; Wo←Wo−η·∇Wo
15 ∇Wh←xi·δTh; Wh←Wh−η·∇Wh
16 t←t+1
17 until t ≥ maxiter
Once the gradients have been computed, we can update all the weights and biases as follows
Wh = Wh − η · ∇Wh bh = bh − η · ∇bh
where η is the step size (or learning rate). 25.3.3 MLP Training
(25.42)
Algorithm 25.1 shows the pseudo-code for learning the weights considering all of the input points in D via stochastic gradient descent. The code is shown for an MLP with a single hidden layer, using a squared error function, and sigmoid activations for all hidden and output neurons. The approach is called stochastic gradient descent since we compute the weight and bias gradients after observing each training point (in random order).
The MLP algorithm takes as input the dataset D (with points xi and desired responsesyi fori=1,2,···,n),thenumberofhiddenlayerneuronsm,thelearningrate

656 Neural Networks
η, and an integer threshold maxiter that specifies the maximum number of iterations. The size of the input (d) and output (p) layers is determined automatically from D. The MLP first initializes the d × m input to hidden layer weight matrix Wh, and the m × p hidden to output layer matrix Wo to small values, for example, uniformly random in the range [−0.01,0.01]. It is important to note that weights should not be set to 0, otherwise, all hidden neurons will be identical in their values, and so will be the output neurons.
The MLP training takes multiple iterations over the input points. For each input xi, the MLP computes the output vector oi via the feed-forward step. In the backpropagation phase, we compute the error gradient vector δo with respect to the net at output neurons, followed by δh for hidden neurons. In the stochastic gradient descent step, we compute the error gradients with respect to the weights and biases, which are used to update the weight matrices and bias vectors. Thus, for every input vector xi , all the weights and biases are updated based on the error incurred between the predicted outputoi andthetrueresponseyi.Aftereachinputhasbeenprocessed,thatcompletes one iteration of training, called an epoch. Training stops when the maximum number of iterations, maxiter, has been reached. On the other hand, during testing, for any input x, we apply the feed-forward steps and print the predicted output o.
In terms of computational complexity, each iteration of the MLP training algorithm takes O(dm + mp) time for the feed-forward phase, p + mp + m = O(mp) time for the backpropagation of error gradients, and O(dm + mp) time for updating the weight matrices and bias vectors. Thus, the total training time per iteration is O(dm+mp).
Example 25.3 (MLP with one hidden layer). We now illustrate an MLP with a hid- den layer using a non-linear activation function to learn the sine curve. Figure 25.8(a) shows the training data (the gray points on the curve), which comprises n = 25 points xi sampled randomly in the range [−10, 10], with yi = sin(xi ). The testing data comprises 1000 points sampled uniformly from the same range. The figure also shows the desired output curve (thin line). We used an MLP with one input neuron (d = 1), ten hidden neurons (m = 10) and one output neuron (p = 1). The hidden neurons use tanh activations, whereas the output unit uses an identity activation. The step size is η = 0.005.
The input to hidden weight matrix Wh ∈ R1×10 and the corresponding bias vector bh ∈ R10×1 are given as:
Wh =(−0.68,0.77,−0.42,−0.72,−0.93,−0.42,−0.66,−0.70,−0.62,−0.50) bh =(−4.36,2.43,−0.52,2.35−1.64,3.98,0.31,4.45,1.03,−4.77)T
The hidden to output weight matrix Wo ∈ R10×1 and the bias term bo ∈ R are given as: Wo =(−1.82,−1.69,−0.82,1.37,0.14,2.37,−1.64,−1.92,0.78,2.17)T
bo = −0.16
Figure 25.8(a) shows the output of the MLP on the test set, after the first iteration of training (t = 1). We can see that initially the predicted response deviates

1.0 0.5 0 −0.5 −1.0 −1.5 −2.0 −2.5
1.00 0.75 0.50 0.25
0 −0.25 −0.50 −0.75 −1.00 −1.25
1.00 0.75 0.50 0.25
0 −0.25 −0.50 −0.75 −1.00 −1.25
􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃
1.00 􏱆􏱃 􏱃􏱆 􏱃􏱆 0.75
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
25.3 Multilayer Perceptron: One Hidden Layer
657
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆
􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆
􏱆􏱃
􏱆􏱃
0 −0.25 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 −0.50
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆
􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃
􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
(a) t = 1
1.25 1.00 0.75 0.50 0.25
0 −0.25 −0.50 −0.75 −1.00 −1.25
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆
(b) t = 1000
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆
􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃
􏱆􏱃 􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱃
􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃 −0.75
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃 −1.00
􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆
−1.25 XX
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 (e) t = 15000
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 (f) t = 30000
􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃
􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃
􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆 􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱃􏱆
􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 XX
􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃
􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆
0.50 0.25 0 −0.25 −0.50 −0.75 −1.00 −1.25
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆 􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃
􏱃􏱆
􏱃􏱆􏱃􏱆 􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆
􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 XX
(c) t = 5000 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 (d) t = 10000
􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
1.00 0.75 0.50 0.25
􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱃􏱆
􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆
􏱆􏱃􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱃􏱆􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱃􏱆
2.0 1.5 1.0 0.5
0 −0.5 −1.0 −1.5 −2.0 −2.5
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆 􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆 􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
−20 −15 −10 −5
(g) Test range [−20,20]
15 20
0 5 10
X
Figure 25.8. MLP for sine curve: 10 hidden neurons with hyperbolic tangent activation functions. The gray dots represent the training data. The bold line is the predicted response, whereas the thin line is the true response. (a)–(f): Predictions after different number of iterations. (g): Testing outside the training range. Good fit within the training range [−10, 10] shown in the box.
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃
􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆
􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃
􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆
􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱃􏱆􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱆􏱃
Y
YYY
YYY

658 Neural Networks
significantly from the true sine response. Figure 25.8(a)–(f) show the output from the MLP after different number of training iterations. By t = 15000 iterations the output on the test set comes close to the sine curve, but it takes another 15000 iterations to get a closer fit. The final SSE is 1.45 over the 1000 test points.
We can observe that, even with a very small training data of 25 points sampled randomly from the sine curve, the MLP is able to learn the desired function. However, it is also important to recognize that the MLP model has not really learned the sine function; rather, it has learned to approximate it only in the specified range [−10, 10]. We can see in Figure 25.8(g) that when we try to predict values outside this range, the MLP does not yield a good fit.
Example 25.4 (MLP for handwritten digit classification). In this example, we apply an MLP with one hidden layer for the task of predicting the correct label for a hand-written digit from the MNIST database, which contains 60,000 training images that span the 10 digit labels, from 0 to 9. Each (grayscale) image is a 28 × 28 matrix of pixels, with values between 0 and 255. Each pixel is converted to a value in the interval [0, 1] by dividing by 255. Figure 25.9 shows an example of each digit from the MNIST dataset.
Since images are 2-dimensional matrices, we first flatten them into a vector x ∈ R784 with dimensionality d = 28 × 28 = 784. This is done by simply concatenating all of the rows of the images to obtain one long vector. Next, since the output labels are categorical values that denote the digits from 0 to 9, we need to convert them into binary (numerical) vectors, using one-hot encoding. Thus, the label 0 is encoded as e1 = (1,0,0,0,0,0,0,0,0,0)T ∈ R10, the label 1 as e2 = (0,1,0,0,0,0,0,0,0,0)T ∈ R10, and so on, and finally the label 9 is encoded as e10 = (0,0,0,0,0,0,0,0,0,1)T ∈ R10. That is, each input image vector x has a corresponding target response vector y ∈ {e1,e2,···,e10}.Thus,theinputlayerfortheMLPhasd=784neurons,andtheoutput layer has p = 10 neurons.
For the hidden layer, we consider several MLP models, each with a different number of hidden neurons m. We try m = 0,7,49,98,196,392, to study the effect of increasing the number of hidden neurons, from small to large. For the hidden layer, we use ReLU activation function, and for the output layer, we use softmax activation, since the target response vector has only one neuron with value 1, with the rest being 0. Note that m = 0 means that there is no hidden layer – the input layer is directly connected to the output layer, which is equivalent to a multiclass logistic regression model. We train each MLP for t = 15 epochs, using step size η = 0.25.
During training, we plot the number of misclassified images after each epoch, on the separate MNIST test set comprising 10,000 images. Figure 25.10 shows the number of errors from each of the models (with a different number of hidden neurons m), after each epoch. The final test error at the end of training is given as
m
0
7
10
49
98
196
392
errors
1677
901
792
546
495
470
454

Multilayer Perceptron: One Hidden Layer 659 00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
(a) label 0 (b) label 1 (c) label 2 (d) label 3 (e) label 4
00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
(f) label 5 (g) label 6 (h) label 7 (i) label 8 (j) label 9 Figure 25.9. MNIST dataset: Sample handwritten digits.
3,000
2,500
2,000
1,500
1,000
500
We can observe that adding a hidden layer significantly improves the prediction accuracy. Using even a small number of hidden neurons helps, compared to the logistic regression model (m = 0). For example, using m = 7 results in 901 errors (or error rate 9.01%) compared to using m = 0, which results in 1677 errors (or error rate 16.77%). On the other hand, as we increase the number of hidden neurons, the error rate decreases, though with diminishing returns. Using m = 196, the error rate is 4.70%, but even after doubling the number of hidden neurons (m = 392), the error rate goes down to only 4.54%. Further increasing m does not reduce the error rate.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 epochs
Figure 25.10. MNIST: Prediction error as a function of epochs.
m=0
m=7 m=10 m=49 m=98 m=196 m=392

660 Neural Networks 25.4 DEEP MULTILAYER PERCEPTRONS
We now generalize the feed-forward and backpropagation steps for many hidden layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers as shown in Figure 25.11. We assume that the input to the MLP comprises n points xi ∈ Rd with the corresponding true response vector yi ∈ Rp . We denote the input neurons as layer l = 0, the first hidden layer as l = 1, the last hidden layer as l = h, and finally the output layer as layer l = h+1. We use nl to denote the number of neurons in layer l. Since the input points are d-dimensional, this implies n0 = d, and since the true response vector is p-dimensional, we have nh+1 = p. The hidden layers have n1 neurons for the first hidden layer, n2 for the second layer, and nh for the last hidden layer. The vector of neuron values for layer l (for l = 0,··· ,h+1) is denoted as
z l = 􏰓 z 1l , · · · , z nl 􏰔 T l
Each layer except the output layer has one extra bias neuron, which is the neuron at
index 0. Thus, the bias neuron for layer l is denoted z0l
and its value is fixed at z0l = 1.
l = 0 1 x0
x 1 . x i . x d
l =0
W0,b0
l = 1 1 z01
z 1 1 . z i1 .
l = 2 1 z02
z 12 . z i2 .

1 ··· · · ·
.
· · ·
l = h 1 z0h
z 1h . z ih .
l = h+1
o 1 . o i . o p
l =h+1
Wh,bh
.
z n1 z n2 · · · z nh
12h
(a) detail view
l =1 l =2 … l =h
W1,b1
(b) layer view
Figure 25.11. Deep multilayer perceptron, with h hidden layers.
x
z1
z2
···
zh
o

Deep Multilayer Perceptrons 661
Figure 25.11(a) displays a detailed view of an MLP with h hidden layers, showing the individual neurons in each layer including the bias neuron. Note that the vector of input neuron values is also written as
x=(x1,x2,···,xd)T =􏰄z10,z20,···,zd0􏰅T =z0 and the vector of output neuron values is also denoted as
o = (o1,o2,··· ,op)T = 􏰄zh+1,zh+1,··· ,zh+1􏰅T = zh+1 12p
Theweightmatrixbetweenlayerlandlayerl+1neuronsisdenotedWl ∈Rnl×nl+1, andthevectorofbiastermsfromthebiasneuronz0l toneuronsinlayerl+1isdenoted bl ∈ Rnl+1 , for l = 0,1,··· ,h. Thus, W0 ∈ Rd×n1 is the weight matrix between the input and first hidden layer, W1 ∈ Rn1×n2 is the weight matrix between the first and second hidden layer, and so on; finally, Wh ∈ Rnh ×p is the weight matrix between the last hidden layerandtheoutputlayer.Forthebiasvectors,b0∈Rn1 specifiesthebiasesforneurons inthefirsthiddenlayer,b1∈Rn2 thebiasesforneuronsinthesecondhiddenlayer,and so on. Thus, bh ∈ Rp specifies the biases for the output neurons. Figure 25.11(b) shows a layer view for an MLP with h hidden layers. This is a more compact representation that clearly specifies the architecture or topology of the MLP. Each layer l is represented as a rectangular node, and is marked with the vector of neuron values, zl. The bias neurons are not shown, but are assumed to be present in each layer except the output. An edge between layer l and l + 1 is labeled with the weight matrix Wl and bias vector bl that specify the parameters between those layers.
For training deep MLPs, we will refer to several partial derivative vectors, described next. Define δil as the net gradient, i.e., the partial derivative of the error function with respect to the net value at zil
δil = ∂Ex ∂ neti
andletδl denotethenetgradientvectoratlayerl,forl=1,2,···,h+1 δl=􏰄δ1l,···,δnl 􏰅T
l
(25.43)
(25.44)
Let f l denote the activation function for layer l, for l = 0,1,··· ,h + 1, and further let ∂fl denote the vector of the derivatives of the activation function with respect to neti for all neurons zil in layer l:
l 􏰕∂fl(net1) ∂fl(netnl)􏰖T
∂f = ∂net ,···, ∂net (25.45)
1 nl
Finally, let ∂ Ex denote the vector of partial derivatives of the error function with respect
to the values oi for all output neurons:
􏰕∂Ex ∂Ex ∂Ex 􏰖T
∂Ex= ∂o,∂o,···,∂o (25.46)
12p

662 Neural Networks 25.4.1 Feed-forward Phase
Typically in a deep MLP, the same activation function f l is used for all neurons in a given layer l. The input layer always uses the identity activation, so f 0 is the identity function. Also, all bias neurons also use the identity function with a fixed value of 1. The output layer typically uses sigmoid or softmax activations for classification tasks, or identity activations for regression tasks. The hidden layers typically use sigmoid, tanh, or ReLU activations. In our discussion, we assume a fixed activation function f l for all neurons in a given layer. However, it is easy to generalize to a neuron specific activation function f l for neuron zl in layer l.
feed-forward process:
o = f h + 1 􏰓 b h + W hT · z h 􏰔
=fh+1􏰓b +WT·fh􏰄b +WT ·zh−1􏰅􏰔 h h h−1 h−1
ii
For a given input pair (x, y) ∈ D, the deep MLP computes the output vector via the
= .
=fh+1􏰕b +WT ·fh􏰕b +WT ·fh−1􏰓···f2􏰓b +WT ·f1􏰄b +WT ·x􏰅􏰔􏰔􏰖􏰖
h h h−1 h−1 1 1 0 0 Note that each f l distributes over its argument. That is,
f l 􏰄 b l − 1 + W Tl − 1 · x 􏰅 = 􏰓 f l ( n e t 1 ) , f l ( n e t 2 ) , · · · , f l 􏰄 n e t n l 􏰅 􏰔 T 25.4.2 Backpropagation Phase
Consider the weight update between a given layer and another, including between the
input and hidden layer, or between two hidden layers, or between the last hidden layer
and the output layer. Let zl be a neuron in layer l, and zl+1 a neuron in the next layer ij
l+1.Letwl betheweightbetweenzl andzl+1,andletbl denotethebiastermbetween ij ijj
zl andzl+1.Theweightandbiasareupdatedusingthegradientdescentapproach 0j
wl =wl −η·∇l bl=bl−η·∇l ij ij wij j j bj
where ∇wl is the weight gradient and ∇bl is the bias gradient, i.e., the partial derivative ij j
of the error function with respect to the weight and bias, respectively:
∇l =∂Ex ∇l=∂Ex
wij ∂wl bj
ij j
As noted earlier in Eq. (25.31), we can use the chain rule to write the weight and bias gradient, as follows
∇l =∂Ex =∂Ex ·∂netj=δl+1·zl=zl·δl+1 wij ∂wl ∂netj ∂wl j i i j
∇l =∂Ex = ∂Ex ·∂netj =δl+1 b j ∂ b jl ∂ n e t j ∂ b jl j
∂bl
ij ij

Deep Multilayer Perceptrons 663 where δl+1 is the net gradient [Eq. (25.43)], i.e., the partial derivative of the error
j
function with respect to the net value at zl+1, and we have j
􏰢n􏰣􏰢n􏰣 ∂ n e t j ∂ 􏰪l ∂ n e t j ∂ 􏰪l
= bl+ wl ·zl =zl
∂wl ∂wl j kj k i ∂bl ∂bl j kj k
= bl+ wl ·zl =1 ij ij k=0 j j k=0
Given the vector of neuron values at layer l, namely zl = 􏰓z1l ,··· ,znl 􏰔T, we can l
compute the entire weight gradient matrix via an outer product operation
and the bias gradient vector as:
∇Wl =zl·􏰄δl+1􏰅T (25.47) ∇bl =δl+1 (25.48)
withl=0,1,···,h.Hereδl+1 isthenetgradientvectoratlayerl+1[Eq.(25.44)]. This also allows us to update all the weights and biases as follows
gradients for layer l we need to compute the net gradients δl+1 at layer l + 1. 25.4.3 Net Gradients at Output Layer
Let us consider how to compute the net gradients at the output layer h + 1. If all of the output neurons are independent (for example, when using linear or sigmoid activations), the net gradient is obtained by differentiating the error function with respect to the net signal at the output neurons. That is,
δh+1= ∂Ex = ∂Ex ·∂fh+1(netj)=∂Ex ·∂fh+1(netj) j ∂netj ∂f h+1(netj ) ∂netj ∂oj ∂netj
Thus, the gradient depends on two terms, the partial derivative of the error function with respect to the output neuron value, and the derivative of the activation function with respect to its argument. The net gradient vector across all output neurons is given as
δh+1 =∂fh+1 ⊙ ∂Ex (25.50)
where ⊙ is the element-wise or Hadamard product, ∂fh+1 is the vector of derivatives of the activation function with respect to its argument [Eq. (25.45)] at the output layer l = h + 1, and ∂ E x is the vector of error derivatives with respect to the output neuron values [Eq. (25.46)].
Wl =Wl −η·∇Wl bl =bl −η·∇bl
(25.49) where η is the step size. However, we observe that to compute the weight and bias

664 Neural Networks
On the other hand, if the output neurons are not independent (for example, when using a softmax activation), then we have to modify the computation of the net gradient at each output neuron as follows:
δh+1= ∂Ex =􏰪p ∂Ex ·∂fh+1(neti) j ∂netj i=1 ∂fh+1(neti) ∂netj
Across all output neurons, we can write this compactly as follows:
δh+1 =∂Fh+1 ·∂Ex (25.51)
where∂Fh+1 isthematrixofderivativesofoi =fh+1(neti)withrespecttonetj forall
i,j = 1,2,··· ,p, given as
∂o1 ∂o1 ···∂o1
∂net1 ∂net2 ∂netp 
∂o2 ∂o2 ··· ∂o2 = ∂net1 ∂net2 ∂netp 
Typically, for regression tasks, we use the squared error function with linear activation function at the output neurons, whereas for logistic regression and classification, we use the cross-entropy error function with a sigmoid activation for binary classes, and softmax activation for multiclass problems. For these common cases, the net gradient vector at the output layer is given as follows:
∂F
h+1
 . .
∂op ∂op ···∂op
SquaredError: FromEq.(25.15),theerrorgradientisgivenas ∂Ex=∂Ex =o−y
··· .
∂ net1 ∂ net2 ∂ netp

∂o The net gradient at the output layer is given as
δh+1 =∂fh+1 ⊙∂Ex
where ∂fh+1 depends on the activation function at the output. Typically, for regression tasks, we use a linear activation at the output neurons. In that case, we have ∂ f h+1 = 1 (see Eq. (25.8)).
Cross-EntropyError(binaryoutput,sigmoidactivation): Considerthebinarycasefirst, with a single output neuron o with sigmoid activation. Recall that the binary cross-entropy error [Eq. (25.19)] is given as
Ex =−􏰄y·ln(o)+(1−y)·ln(1−o)􏰅 From Eq. (25.20) we have
∂Ex=∂Ex= o−y ∂o o·(1−o)

Deep Multilayer Perceptrons 665 Further, for sigmoid activation, we have
∂fh+1 = ∂f(neto) =o·(1−o) ∂ neto
Therefore, the net gradient at the output neuron is
δh+1 =∂Ex ·∂fh+1 = o−y ·o·(1−o)=o−y
o·(1−o)
Cross-Entropy Error (K outputs, softmax activation): Recall that the cross-entropy
error function [Eq. (25.16)] is given as
􏰓􏰔
􏰪K i=1
􏰕∂Ex ∂Ex ∂Ex􏰖T 􏰕 y1 y2 yK􏰖T ∂Ex = ∂o ,∂o ,···, ∂o = −o ,−o ,···,−o
Ex =−
Using Eq. (25.17), the vector of error derivatives with respect to the output neurons
is given as
12K12K where p = K is the number of output neurons.
Cross-entropy error is typically used with the softmax activation so that we get a (normalized) probability value for each class. That is,
yi ·ln(oi)=− y1 ·ln(o1)+···+yK ·ln(oK)
exp{netj } oj = softmax(netj ) = 􏰥􏰥
so that the output neuron values sum to one, Kj =1 oj = 1. Since an output neuron depends on all other output neurons, we need to compute the matrix of derivatives of each output with respect to each of the net signals at the output neurons [see Equations (25.12) and (25.18)]:
∂o1 ∂o1 ···∂o1
∂net1 ∂net2 ∂netK  o1 ·(1−o1)
−o1 ·o2 ··· o2 ·(1−o2) ···
 −o2 ·oK 
Ki=1 exp{neti }
 ∂o2 ∂o2 ··· ∂o2   −o1 ·o2
−o1 ·oK
. 
∂Fh+1 =∂net1 ∂net2  . .
∂netK= . .   .
.
. ···
.   . . ··· .  −o1·oK −o2·oK ··· oK·(1−oK)
∂oK ∂oK ··· ∂oK ∂ net1 ∂ net2 ∂ netK

666
Neural Networks
Therefore, the net gradient vector at the output layer is δh+1 =∂Fh+1 ·∂Ex

o1 ·(1−o1)
 . . . . . . · · · −o1 ·oK −o2 ·oK ···
−o1 ·o2 ··· o ·(1−o ) ···
(25.52)  −y1 
2K·o2 . . .   . . . 
 −o ·o
=12 2 2
−o1 ·oK −o ·o
o1
 −y2 
−y1 +o1 ·􏰥Ki=1 yi   .􏰥K .􏰥K
−yK+yK·oK+ i̸=Kyi·oK −yK+oK· i=1yi
−y1 +o1 
  − y 2 + o 2   􏰪K
= . ,since −yK +oK
=o−y
25.4.4 Net Gradients at Hidden Layers
Let us assume that we have already computed the net gradients at layer l + 1, namely
δl+1.Sinceneuronzjl inlayerlisconnectedtoalloftheneuronsinlayerl+1(except
for the bias neuron zl+1), to compute the net gradient at zl , we have to account for the 0j
error from each neuron in layer l + 1, as follows:
n
i=1
yi =1
oK ·(1−oK) −yK oK
 −y1 +y1 ·o1 +􏰥Ki̸=1 yi ·o1 
= −y2+y2·o2+􏰥Ki̸=2yi ·o2 =−y2+o2·􏰥Ki=1yi 
􏰪l
∂E l+1 ∂E ∂net ∂f (net )
δjl=x= x· k· j
=
·
k=1
k=1 ∂netk ∂fl(netj) ∂netj
∂netj
Sothenetgradientatzjl inlayerldependsonthederivativeoftheactivationfunction
with respect to its netj , and the weighted sum of the net gradients from all the neurons
zl+1 at the next layer l + 1. k
We can compute the net gradients for all the neurons in level l in one step, as follows:
δl =∂fl ⊙ 􏰄Wl ·δl+1􏰅 (25.53)
where ⊙ is the element-wise product, and ∂fl is the vector of derivatives of the activation function with respect to its argument [Eq. (25.45)] at layer l. For the commonly used activation functions at the hidden layer, using the derivatives from
n l􏰪 ∂f(netj) l+1
∂netj
δl+1 ·wl k jk

Deep Multilayer Perceptrons
667
Section 25.1.1, we have
For ReLU, we have to apply Eq. (25.9) to each neuron. Note that softmax is generally not used for hidden layers.
The net gradients are computed recursively, starting from the output layer h + 1, then hidden layer h, and so on, until we finally compute the net gradients at the first hidden layer l = 1. That is,
δh=∂fh ⊙􏰄Wh·δh+1􏰅 􏰓 􏰔 δh−1 =∂fh−1 ⊙ 􏰄Wh−1 ·δh􏰅=∂fh−1 ⊙ Wh−1 ·􏰄∂fh ⊙ 􏰄Wh ·δh+1􏰅􏰅
.
δ1 =∂f1 ⊙ 􏰕W1 ·􏰕∂f2 ⊙ 􏰓W2 ····􏰄∂fh ⊙ 􏰄Wh ·δh+1􏰅􏰅􏰔􏰖􏰖
25.4.5 Training Deep MLPs
Algorithm 25.2 shows the pseudo-code for learning the weights and biases for a deep MLP. The inputs comprise the dataset D, the number of hidden layers h, the step size or learning rate for gradient descent η, an integer threshold maxiter denoting the number of iterations for training, parameters n1,n2,··· ,nh that denote the number of neurons (excluding bias, which will be added automatically) for each of the hidden layers l = 1,2,···,h,andthetypeofactivationfunctionsf1,f2,···,fh+1 foreachofthelayers (other than the input layer that uses identity activations). The size of the input (d) and output (p) layers is determined directly from D.
The MLP first initializes the (nl × nl+1) weight matrices Wl between layers l and l + 1 with small values chosen uniformly at random, e.g., in the range [−0.01, 0.01]. The MLP considers each input pair (xi , yi ) ∈ D, and computes the predicted response oi via the feed-forward process. The backpropagation phase begins by computing the error between oi and true response yi, and computing the net gradient vector δh+1 at the output layer. These net gradients are backpropagated from layer h + 1 to layer h, from h to h − 1, and so on until we obtain the net gradients at the first hidden layer l=1.Thesenetgradientsareusedtocomputetheweightgradientmatrix∇Wl atlayer l, which can in turn be used to update the weight matrix Wl . Likewise, the net gradients specify the bias gradient vector ∇bl at layer l, which is used to update bl. After each point has been used to update the weights, that completes one iteration or epoch of training. The training stops when maxiter epochs have been reached. On the other hand, during testing, for any input x, we apply the feed-forward steps and print the predicted output o.
It is important to note that Algorithm 25.2 follows a stochastic gradient descent approach, since the points are considered in random order, and the weight and bias gradients are computed after observing each training point. In practice, it is common
1
∂fl = zl(1−zl)
for linear
for sigmoid (1−zl ⊙zl) for tanh

668
Neural Networks
Algorithm 25.2: Deep MLP Training: Stochastic Gradient Descent
1 2
3 4 5
6 7 8
9 10
11 12
13 14 15 16
17 18
19 20 21
22 23 24
25 26
DEEP-MLP-TRAINING (D,h,η,maxiter,n1,n2,··· ,nh,f 1,f 2,··· ,f h+1): n0 ← d // input layer size
nh+1 ← p // output layer size
// Initialize weight matrices and bias vectors
for l = 0,1,2,··· ,h do
bl ← random nl+1 vector with small values
Wl ← random nl × nl+1 matrix with small values
t ← 0 // iteration counter repeat
foreach (xi , yi ) ∈ D in random order do // Feed-Forward Phase
z0 ← xi
for l = 0,1,2,…,h do
zl+1 ←fl+1􏰓bl +WTl ·zl􏰔
oi ← zh+1
// Backpropagation Phase if independent outputs then
δh+1 ← ∂fh+1 ⊙ ∂Exi// net gradients at output else
// Gradient Descent Step
δh+1 ← ∂Fh+1 ·∂Exi// net gradients at output for l = h, h − 1, · · · , 1 do
δl←∂fl ⊙􏰄Wl·δl+1􏰅//netgradientsatlayerl
for l = 0,1,··· ,h do
∇Wl ←zl·􏰄δl+1􏰅T//weightgradientmatrixatlayerl ∇bl ←δl+1//biasgradientvectoratlayerl
for l = 0,1,··· ,h do
Wl ← Wl − η · ∇Wl // update Wl bl ← bl −η·∇bl // update bl
t←t+1 until t ≥ maxiter
to update the gradients by considering a fixed sized subset of the training points called a minibatch instead of using single points. That is, the training data is divided into minibatches using an additional parameter called batch size, and a gradient descent step is performed after computing the bias and weight gradient from each minibatch. This helps better estimate the gradients, and also allows vectorized matrix operations over the minibatch of points, which can lead to faster convergence and substantial speedups in the learning.

Deep Multilayer Perceptrons 669
One caveat while training very deep MLPs is the problem of vanishing and exploding gradients. In the vanishing gradient problem, the norm of the net gradient can decay exponentially with the distance from the output layer, that is, as we backpropagate the gradients from the output layer to the input layer. In this case the network will learn extremely slowly, if at all, since the gradient descent method will make minuscule changes to the weights and biases. On the other hand, in the exploding gradient problem, the norm of the net gradient can grow exponentially with the distance from the output layer. In this case, the weights and biases will become exponentially large, resulting in a failure to learn. The gradient explosion problem can be mitigated to some extent by gradient thresholding, that is, by resetting the value if it exceeds an upper bound. The vanishing gradients problem is more difficult to address. Typically sigmoid activations are more susceptible to this problem, and one solution is to use alternative activation functions such as ReLU. In general, recurrent neural networks, which are deep neural networks with feedback connections, are more prone to vanishing and exploding gradients; we will revisit these issues in Section 26.2.
Example 25.5 (Deep MLP). We now examine deep MLPs for predicting the labels for the MNIST handwritten images dataset that we considered in Example 25.4. Recall that this dataset has n = 60000 grayscale images of size 28 × 28 that we treat as d = 784 dimensional vectors. The pixel values between 0 and 255 are converted to the range 0 and 1 by dividing each value by 255. The target response vector is a one-hot encoded vector for class labels {0,1,…,9}. Thus, the input to the MLP xi has dimensionality d = 784, and the output layer has dimensionality p = 10. We use softmax activation for the output layer. We use ReLU activation for the hidden layers, and consider several deep models with different number and sizes of the hidden layers. We use step size η = 0.3 and train for t = 15 epochs. Training was done using minibatches, using batch size of 1000.
During the training of each of the deep MLPs, we evaluate its performance on the separate MNIST test datatset that contains 10,000 images. Figure 25.12 plots the
n1 = 392
n1 =196,n2 =49
n1 = 392,n2 = 196,n3 = 49
n1 = 392,n2 = 196,n3 = 98,n4 = 49
5,000
4,000
3,000
2,000
1,000
epochs
Figure 25.12. MNIST: Deep MLPs; prediction error as a function of epochs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
errors

670 Neural Networks
number of errors after each epoch for the different deep MLP models. The final test error at the end of training is given as
n1 = 392
n1 =196,n2 =49
n1 = 392,n2 = 196,n3 = 49
n1 = 392,n2 = 196,n3 = 98,n4 = 49
We can observe that as we increase the number of layers, we do get performance improvements. The deep MLP with four hidden layers of sizes n1 = 392, n2 = 196, n3 = 98, n4 = 49 results in an error rate of 2.78% on the training set, whereas the MLP with a single hidden layer of size n1 = 392 has an error rate of 3.96%. Thus, the deeper MLP significantly improves the prediction accuracy. However, adding more layers does not reduce the error rate, and can also lead to performance degradation.
hidden layers
errors
396
303
290
278
25.5 FURTHER READING
Artificial neural networks have their origin in the work of McCulloch and Pitts (1943). The first application of a single neuron, called a perceptron, to supervised learning was by Rosenblatt (1958). Minsky and Papert (1969) pointed out limitations of perceptrons, which were not overcome until the development of the backpropagation algorithm, which was introduced by Rumelhart, Hinton, and Williams (1986) to train general multilayer perceptrons.
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5 (4), 115–133.
Minsky, M. and Papert, S. (1969). Perceptron: An Introduction to Computational Geometry. Cambridge, MA: The MIT Press.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65 (6), 386.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323 (6088), 533.
25.6 EXERCISES
Q1. Consider the neural network in Figure 25.13. Let bias values be fixed at 0, and let the weight matrices between the input and hidden, and hidden and output layers, respectively, be:
W = (w1,w2,w3) = (1,1,−1) W′ = (w1′ ,w2′ ,w3′ )T = (0.5,1,2)T
Assume that the hidden layer uses ReLU, whereas the output layer uses sigmoid activation. Assume SSE error. Answer the following questions, when the input is x = 4 and the true response is y = 0:

Exercises
671
z1
w1
x w2 z2 w2′ o
w1′
w3′ Figure 25.13. Neural network for Q1.
(a) Useforwardpropagationtocomputethepredictedoutput. (b) Whatisthelossorerrorvalue?
(c) Computethenetgradientvectorδofortheoutputlayer.
(d) Computethenetgradientvectorδhforthehiddenlayer.
(e) Computetheweightgradientmatrix∇W′ betweenthehiddenandoutputlayers.
(f) Computetheweightgradientmatrix∇Wbetweentheinputandhiddenlayers.
Q2. Show that the derivative of the sigmoid function [Eq. (25.5)] with respect to its
argument is given as
∂f(z) =f(z)·(1−f(z)) ∂z
Q3. Showthatthederivativeofthehyperbolictangentfunction[Eq.(25.6)]withrespect to its argument is given as
∂f(z) =1−f(z)2 ∂z
Q4. Showthatthederivativeofthesoftmaxfunctionisgivenas ∂f(zi|z)=􏰢f(zi)·(1−f(zi)) ifj=i
w3
z3
∂ zj −f (zi ) · f (zj ) if j ̸= i
Q5. Derive an expression for the net gradient vector at the output neurons when using
where z = {z1,z2,··· ,zp}.
softmax activation with the squared error function.
Q6. Showthatiftheweightmatrixandbiasvectorsareinitializedtozero,thenallneurons in a given layer will have identical values in each iteration.
Q7. Provethatwithlinearactivationfunctions,amultilayerednetworkisequivalenttoa single-layered neural network.
Q8. Compute the expression for the net 􏰥gradient vector at the output layer, δh+1, assuming cross-entropy error, Ex = − Ki=1 yi · ln(oi ), with K independent binary output neurons that use sigmoid activation, that is, oi = sigmoid(neti ).
Q9. GivenanMLPwithonehiddenlayer,derivetheequationsforδhandδousingvector derivatives, i.e., by computing ∂Ex and ∂Ex , where neth and neto are the net input
∂ neth ∂ neto vectors at the hidden and output layers.

CHAPTER 26 Deep Learning
In this chapter, we first look at deep neural networks that include feedback from one layer to another. Such network are called recurrent neural networks (RNNs), and they can typically be trained by unfolding (or unrolling) the recurrent connections, resulting in deep networks whose parameters can be learned via the backpropagation algorithm. Since RNNs are prone to the vanishing and exploding gradients problem, we next consider gated RNNs that introduce a new type of layer that endows the network with the ability to selectively read, store, and write the hidden state via an internal memory layer. Gated RNNs are highly effective for prediction tasks on sequence inputs. Finally, we consider convolutional neural networks (CNNs) that are deep MLPs that exploit spatial or temporal relationships between different elements of each layer to construct a hierarchy of features that can be used for regression or classification tasks. Unlike regular MLPs whose layers are fully connected, CNNs have layers that are localized and sparse. In particular, CNNs are highly effective for image inputs.
26.1 RECURRENT NEURAL NETWORKS
Multilayer perceptrons are feed-forward networks in which the information flows in only one direction, namely from the input layer to the output layer via the hidden layers. In contrast, recurrent neural networks (RNNs) are dynamically driven (e.g., temporal), with a feedback loop between two (or more) layers, which makes such networks ideal for learning from sequence data. For example, Figure 26.1 shows a simple RNN where there is a feedback loop from the hidden layer ht to itself via a temporal delay of one time unit, denoted by the −1 on the loop.
Let X = ⟨x1,x2,··· ,xτ ⟩ denote a sequence of vectors, where xt ∈ Rd is a d-dimensional vector (t = 1,2,··· ,τ). Thus, X is an input sequence of length τ, with xt denoting the input at time step t. Let Y = ⟨y1,y2,··· ,yτ ⟩ denote a sequence of vectors, with yt ∈ Rp a p-dimensional vector. Here, Y is the desired target or response sequence, with yt denoting the response vector at time t. Finally, let O = ⟨o1,o2,··· ,oτ ⟩ denote the predicted or output sequence from the RNN. Here ot ∈ Rp is also a p-dimensional vector to match the corresponding true response yt . The task of an RNN is to learn a function that predicts the target sequence Y given the input sequence X . That is, the
672

Recurrent Neural Networks
673
Wh,bh −1
xt
ht
Wo,bo
Figure 26.1. Recurrent neural network. Recurrent connection shown in gray.
predicted output ot on input xt should be similar or close to the target response yt , for each time point t.
To learn dependencies between elements of the input sequence, an RNN maintains a sequence of m-dimensional hidden state vectors ht ∈ Rm , where ht captures the essential features of the input sequences up to time t, as illustrated in Figure 26.1. The hidden vector ht at time t depends on the input vector xt at time t and the previous hidden state vector ht −1 from time t − 1, and it is computed as follows:
h t = f h ( W i T x t + W hT h t − 1 + b h ) ( 2 6 . 1 )
Here, f h is the hidden state activation function, typically tanh or ReLU. Also, we need an initial hidden state vector h0 that serves as the prior state to compute h1. This is usually set to the zero vector, or seeded from a prior RNN prediction step. The matrix Wi ∈ Rd×m specifies the weights between the input vectors and the hidden state vectors. The matrix Wh ∈ Rm×m specifies the weight matrix between the hidden state vectors at time t − 1 and t , with bh ∈ Rm specifying the bias terms associated with the hidden states. Note that we need only one bias vector bh associated with the hidden state neurons; we do not need a separate bias vector between the input and hidden neurons.
Given the hidden state vector at time t, the output vector ot at time t is computed as follows:
ot =fo(WoTht +bo) (26.2)
Here, Wo ∈ Rm×p specifies the weights between the hidden state and output vectors, with bias vector bo . The output activation function f o typically uses linear or identity activation, or a softmax activation for one-hot encoded categorical output values.
It is important to note that all the weight matrices and bias vectors are independent of the time t. For example, for the hidden layer, the same weight matrix Wh and bias vector bh is used and updated while training the model, over all time steps t. This is an example of parameter sharing or weight tying between different layers or components of a neural network. Likewise, the input weight matrix Wi , the output weight matrix Wo and the bias vector bo are all shared across time. This greatly reduces the number of parameters that need to be learned by the RNN, but it also relies on the assumption that all relevant sequential features can be captured by the shared parameters.
Figure 26.1 shows an illustration of an RNN, with a recurrent hidden layer ht , denoted by the feed-back loop with a time delay of −1 noted on the loop. The figure also shows the shared parameters between the layers. Figure 26.2(a) shows the same
Wi
ot

674
Deep Learning
t =0
t =1
t =2

t =τ −1
Wo,bo
Wh,bh
Wi
t =τ
Wo,bo
Wi
ot
o1
o2
···
oτ −1

Wh,bh
Wo,bo −1
Wi
(a) RNN
Wo,bo
Wh,bh Wi
Wo,bo
Wi
ht
h0
h1
h2
···
hτ −1

Wh,bh
xt
x1
x2
(b) RNN: unfolded in time Figure 26.2. RNN unfolded in time.
RNN in vertical format, where the layers are stacked vertically, and Figure 26.2(b) shows the RNN with the layers unfolded in time; it shows the input (xt ), hidden (ht ), and output (ot ) layers at each time step t . We can observe that the feed-back loop has been unfolded for τ time steps, starting with t = 1 and ending at t = τ . The time step t = 0 is used to denote the previous or initial hidden state h0. We can also explicitly observe the parameter sharing across time, since all weight matrices and bias vectors are independent of t. Figure 26.2(b) makes it clear that an RNN is a deep neural network with τ layers, where τ is the maximum input sequence length.
The training data for the RNN is given as D = {Xi,Yi}ni=1, comprising n input sequences Xi and the corresponding target response sequences Yi, with sequence length τi . Given each pair (X , Y ) ∈ D, with X = ⟨x1 , x2 , · · · , xτ ⟩ and Y = ⟨y1 , y2 , · · · , yτ ⟩, the RNN has to update the model parameters Wi,Wh,bh,Wo,bo for the input, hidden and output layers, to learn the corresponding output sequence O = ⟨o1 , o2 , · · · , oτ ⟩. For training the network, we compute the error or loss between the predicted and response vectors over all time steps. For example, the squared error loss is given as
􏰪τ 1􏰪τ
EX = Ext = 2 · ∥yt − ot ∥2
t=1 t=1
On the other hand, if we use a softmax activation at the output layer, then we use the
cross-entropy loss, given as
···
xτ −1

􏰪τ 􏰪τ􏰪p
EX = Ext =− yti ·ln(oti)
t=1 t=1 i=1
where yt = (yt1,yt2,··· ,ytp)T ∈ Rp and ot = (ot1,ot2,··· ,otp)T ∈ Rp. On training input of length τ we first unfold the RNN for τ steps, following which the parameters can be learned via the standard feed-forward and backpropagation steps, keeping in mind the connections between the layers.

Recurrent Neural Networks 675 26.1.1 Feed-forwardinTime
The feed-forward process starts at time t = 0, taking as input the initial hidden state vector h0, which is usually set to 0 or it can be user-specified, say from a previous prediction step. Given the current set of parameters, we predict the output ot at each time step t = 1,2,··· ,τ via Eq. (26.1) and Eq. (26.2):
o t = f o 􏰄 W oT h t + b o 􏰅
= f o 􏰓 W oT f h 􏰄 W i T x t + W hT h t − 1 + b h 􏰅 + b o 􏰔 􏱀 􏰿􏰾 􏱁
ht = .
=fo􏰓WoTfh􏰓WiTxt +WhTfh􏰄···fh􏰄WiTx1 +WhTh0 +bh􏰅+···􏰅+bh􏰔+bo􏰔 􏱀 􏰿􏰾 􏱁
h1
We can observe that the RNN implicitly makes a prediction for every prefix of the inputsequence,sinceot dependsonallthepreviousinputvectorsx1,x2,···,xt,butnot on any future inputs xt+1,··· ,xτ .
26.1.2 Backpropagation in Time
Once the feed-forward phase computes the output sequence O = ⟨o1 , o2 , · · · , oτ ⟩, we can compute the error in the predictions using the squared error (or cross-entropy) loss function, which can in turn be used to compute the net gradient vectors that are backpropagated from the output layers to the input layers for each time step.
For the backpropagation step it is easier to view the RNN in terms of the distinct layers based on the dependencies, as opposed to unfolding in time. Figure 26.3(a) shows the RNN unfolded using this layer view as opposed to the time view shown in Figure 26.2(b). The first layer is l = 0 comprising the hidden states h0 and the input vector x1, both of which are required to compute h1 in layer l = 1, which also includes the input vector x2. In turn, both h1 and x2 are required to compute h2 in layer l = 2, and so on for other layers. Note also that o1 is not output until layer l = 2 since we can compute the output only once h1 has been computed in the previous layer. The layer view is essentially indexed by the hidden state vector index, except the final layer l=τ+1thatoutputsoτ thatdependsonhτ fromlayerl=τ.
Let Ext denote the loss on input vector xt from the input sequence X = ⟨x1,x2,··· ,xτ ⟩. The unfolded feed-forward RNN for X has l = τ + 1 layers, as shown inFigure26.3(a).Defineδot asthenetgradientvectorfortheoutputvectorot,i.e.,the derivative of the error function Ext with respect to the net value at each neuron in ot , given as
δo=􏰞∂Ext ,∂Ext ,···,∂Ext 􏰟T t ∂neto ∂neto ∂neto
t1 t2 tp
whereot =(ot1,ot2,···,otp)T ∈Rp isthep-dimensionaloutputvectorattimet,andneto
ti is the net value at output neuron oti at time t. Likewise, let δht denote the net gradient

676
Deep Learning
l = 0
Wh,bh
l = 1
l = 2 o1
l = 3 o2

l = τ oτ−1
l = τ +1 oτ
h2
···
Wh,bh
Wh,bh
l =0
Wh ·δh1
l =1
l =2 o1
δ o1
l =3 o2
δ o2

hτ −1 δhτ−1
l =τ oτ−1
δ oτ − 1
l =τ +1 oτ
δ oτ
···
(a) Feed-forward step
Wh ·δh2
Wh ·δhτ
(b) Backpropagation step Figure 26.3. RNN unfolded as layers.
vectorforthehiddenstateneuronsht attimet
δh=􏰕∂Ext ,∂Ext ,···,∂Ext 􏰖T
t ∂neth ∂neth ∂neth t1 t2 tm
where ht = (ht1,ht2,··· ,htm)T ∈ Rm is the m-dimensional hidden state vector at time t, and neth is the net value at hidden neuron h at time t. Let fh and fo denote the
ti ti activationfunctionsforthehiddenstateandoutputneurons,andlet∂fth and∂fto denote
the vector of the derivatives of the activation function with respect to the net signal (that is, its argument) for the hidden and output neurons at time t, given as
􏰕∂fh(neth) ∂fh(neth) ∂fh(neth )􏰖T ∂fh= t1 , t2 ,···, tm
···
h0
h1
hτ −1
···

x1
x2
xτ −1

h1 δh1
h2 δh2
hτ δhτ
h0
···
x1
x2
···
xτ −1

t ∂neth ∂neth ∂neth t1 t2 tm
Wo,bo
Wi
Wo,bo
Wo,bo
Wi
Wo,bo
Wi
Wi
W o · δ o2
W o · δ τo
W o · δ 1o
W o · δ oτ − 1

Recurrent Neural Networks 677
􏰞∂fo(neto) ∂fo(neto) ∂fo(neto)􏰟T ∂fo= t1 , t2 ,···, tp
Finally, let ∂Ext respect to ot :
denote the vector of partial derivatives of the error function with
∂Ext =􏰕∂Ext ,∂Ext , ···, ∂Ext 􏰖T ∂ot1 ∂ot2 ∂otp
t ∂neto ∂neto ∂neto t1 t2 tp
Computing Net Gradients
The key step in backpropagation is to compute the net gradients in reverse order, starting from the output neurons to the input neurons via the hidden neurons. Given the layer view in Figure 26.3(a), the backpropagation step reverses the flow direction for computing the net gradients δot and δht , as shown in the backpropagation graph in Figure 26.3(b).
Inparticular,thenetgradientvectorattheoutputot canbecomputedasfollows:
δ ot = ∂ f t o ⊙ ∂ E x t ( 2 6 . 3 )
where ⊙ is the element-wise or Hadamard product, For example, if Ext is the squared error function, and the output layer uses the identity function, then via Eq. (25.8) and Eq. (25.15) we have
δ ot = 1 ⊙ ( o t − y t )
On the other hand, the net gradients at each of the hidden layers need to account for the incoming net gradients from ot and from ht+1 as seen in Figure 26.3(b). Thus, generalizingEq.(25.53),thenetgradientvectorforht (fort=1,2,…,τ−1)isgiven as
δht =∂fth ⊙􏰓􏰄Wo·δot􏰅+􏰄Wh·δht+1􏰅􏰔 (26.4)
Notethatforhτ,itdependsonlyonoτ (seeFigure26.3(b)),therefore δ hτ = ∂ f τ h ⊙ 􏰄 W o · δ oτ 􏰅
For the tanh activation, which is commonly used in RNNs, the derivative of the activationfunction(seeEq.(25.11))withrespecttothenetvaluesatht isgivenas
∂fth =(1−ht ⊙ht)
Finally, note that the net gradients do not have to be computed for h0 or for any of the input neurons xt , since these are leaf nodes in the backpropagation graph, and thus do not backpropagate the gradients beyond those neurons.
Stochastic Gradient Descent
The net gradients for the output δot and hidden neurons δht at time t can be used to compute the gradients for the weight matrices and bias vectors at each time point.

678 Deep Learning
However, since an RNN uses parameter sharing across time, the gradients are obtained bysummingupallofthecontributionsfromeachtimestept.Define∇tWo and∇tbo asthe gradientsoftheweightsandbiasesbetweenthehiddenneuronsht andoutputneurons ot fortimet.Usingthebackpropagationequations,[Eq.(25.47)]and[Eq.(25.48)],for deep multilayer perceptrons, these gradients are computed as follows:
􏰪τ 􏰪τ ∇bo = ∇tbo = δot
t=1 t=1
∇Wo =
􏰪τ 􏰪τ 􏰄􏰅T ∇tWo = ht · δot
t=1 t=1
Likewise, the gradients of the other shared parameters between hidden layers ht−1 and ht,andbetweentheinputlayerxt andhiddenlayerht,areobtainedasfollows:
􏰪τ
δ ht
∇Wi = ∇tWi = xt · δht
t=1 t=1
∇ b h =
t=1
∇ W h =
􏰪τ t=1
􏰪τ t=1
􏰪τ t=1
􏰄 􏰅T h t − 1 · δ ht
∇ t b h =
􏰪τ 􏰪τ􏰄􏰅T
∇ t W h =
where ∇tWh and ∇tbh are the gradient contributions from time t for the weights and biases for the hidden neurons, and ∇tWi the gradient contributions for the weights for the input neurons. Finally, we update all the weight matrices and bias vectors as follows
Wi =Wi −η·∇Wi Wh =Wh −η·∇Wh bh =bh −η·∇bh Wo = Wo − η · ∇Wo bo = bo − η · ∇bo
where η is the gradient step size (or learning rate). 26.1.3 Training RNNs
(26.5)
Algorithm 26.1 shows the pseudo-code for learning the weights and biases for an RNN. TheinputscomprisethedatasetD={Xi,Yi}i=1,···,n,thestepsizeforgradientdescentη, an integer threshold maxiter denoting the number of iterations for training, the size of the hidden state vectors m, and the activation functions f o and f h for the output and hidden layers. The size of the input (d) and output (p) layers is determined directly from D. For simplicity we assume that all inputs Xi have the same length τ, which determines the number of layers in the RNN. It is relatively easy to handle variable length input sequences by unrolling the RNN for different input lengths τi .
The RNN first initializes the weight matrices and bias vectors with random values drawn uniformly from a small range, e.g., [−0.01,0.01]. The RNN considers each input pair (X , Y ) ∈ D, and computes the predicted output ot for each time step via the feed-forward process. The backpropagation phase begins by computing the error betweenot andtrueresponseyt,andthenthenetgradientvectorδot attheoutputlayer for each time step t. These net gradients are backpropagated from the output layers at time t to the hidden layers at time t , which are in turn used to compute the net gradients δht at the hidden layers for all time steps t = 1,2,··· ,τ. Note that Line 15 shows the case where the output layer neurons are independent; if they are not independent we can replace it by ∂Fo ·∂Ext (see Eq. (25.51)). Next, we compute the weight gradient

Recurrent Neural Networks 679 Algorithm 26.1: RNN Training: Stochastic Gradient Descent
RNN-TRAINING (D,η,maxiter,m,f o,f h):
// Initialize bias vectors
1 bh ← random m-dimensional vector with small values
2 bo ← random p-dimensional vector with small values
// Initialize weight matrix
3 Wi ← random d × m matrix with small values
4 Wh ← random m × m matrix with small values
5 Wo ← random m × p matrix with small values
6 r ← 0 // iteration counter
7 repeat
8 foreach (X,Y) ∈ D in random order do
9 τ ← |X | // length of training sequence
// Feed-Forward Phase
10 h0 ← 0 ∈ Rm // initialize hidden state
11 for t = 1,2,…,τ do
12 h t ← f h ( W i T x t + W hT h t − 1 + b h )
13 o t ← f o ( W oT h t + b o )
// Backpropagation Phase
14 for t = τ , τ − 1, . . . , 1 do
15 δot ← ∂fto ⊙ ∂Ext // net gradients at output
16 δhτ ←∂fth ⊙􏰄Wo·δot􏰅//netgradientsathτ
17 for t = τ − 1, τ − 2, · · · , 1 do
18 δht ←∂fth ⊙􏰓􏰄Wo·δot􏰅+􏰄Wh·δht+1􏰅􏰔//netgradientsatht
// Gradients of weight matrices and bias vectors
19 ∇bo←􏰥τt=1δot; ∇Wo←􏰥τt=1ht·􏰄δot􏰅T
20 ∇bh ←􏰥τt=1δht ; ∇Wh ←􏰥τt=1ht−1 ·􏰄δht 􏰅T; // Gradient Descent Step
21 bo←bo−η·∇bo; Wo←Wo−η·∇Wo
22 bh ←bh −η·∇bh; Wh ←Wh −η·∇Wh;
23 r←r+1
24 until r ≥ maxiter
∇Wi ←􏰥τt=1xt ·􏰄δht 􏰅T Wi ←Wi −η·∇Wi
matrices,∇Wi,∇Wh,and∇Wo,andbiasgradientvectors,∇bh and∇bo.Thesegradients are used to update the weights and biases via stochastic gradient descent. After each point has been used to update the weights, that completes one epoch of training. The training stops when maxiter epochs have been reached.
Note that, whereas Algorithm 26.1 shows the pseudo-code for stochastic gradient descent, in practice, RNNs are trained using subsets or minibatches of input sequences instead of single sequences. This helps to speed up the computation and convergence of gradient descent, since minibatches provide better estimates of the bias and weight gradients and allow the use of vectorized operations.

680
Deep Learning
S
2X4 TS
0B1XP6E7 PV
3V5
T
Figure 26.4. Reber grammar automata.
Example 26.1 (RNN). We use an RNN to learn the Reber grammar, which is generated according to the automata shown in Figure 26.4. Let 􏰂 = {B,E,P,S,T,V,X}
denote the alphabet comprising the seven symbols. Further, let
symbol. Starting from the initial node, we can generate strings that follow the Reber grammar by emitting the symbols on the edges. If there are two transitions out of a node, each one can be chosen with equal probability. The sequence ⟨B,T,S,S,X,X,T,V,V,E⟩ is a valid Reber sequence (with the corresponding state sequence ⟨0,1,2,2,2,4,3,3,5,6,7⟩). On the other hand, the sequence ⟨B,P,T,X,S,E⟩ is not a valid Reber sequence, since there is no edge out of state 3 with the symbol X.
The task of the RNN is to learn to predict the next symbol for each of the positions in a given Reber sequence. For training, we generate Reber sequences from theautomata.LetSX =⟨s1,s2,···,sτ⟩beaRebersequence.Thecorrespondingtrue output Y is then given as the set of next symbols from each of the edges leaving the state corresponding to each position in SX . For example, consider the Reber sequence SX = ⟨B,P,T,V,V,E⟩, with the state sequence π = ⟨0,1,3,3,5,6,7⟩. The
}, where
the terminal symbol. Here, P|T denotes that the next symbol can be either P or T. We can see that SY comprises the sequence of possible next symbols from each of the
desired output sequence is then given as SY = {P|T,T|V,T|V,P|V,E,
􏱌
􏱌
states in π (excluding the start state 0).
To generate the training data for the RNN, we have to convert the symbolic
Reber strings into numeric vectors. We do this via a binary encoding of the symbols, as follows:
􏱌
denote a terminal
is
B E P S T V X 􏱌
(1,0,0,0,0,0,0)T (0,1,0,0,0,0,0)T (0,0,1,0,0,0,0)T (0,0,0,1,0,0,0)T (0,0,0,0,1,0,0)T (0,0,0,0,0,1,0)T (0,0,0,0,0,0,1)T (0,0,0,0,0,0,0)T

Recurrent Neural Networks 681
That is, each symbol is encoded by a 7-dimensional binary vector, with a 1 in the column corresponding to its position in the ordering of symbols in 􏰂. The terminal symbol 􏱌 is not part of the alphabet, and therefore its encoding is all 0’s. Finally, to encode the possible next symbols, we follow a similar binary encoding with a 1 in the column corresponding to the allowed symbols. For example, the choice P|T is encoded as (0,0,1,0,1,0,0)T. Thus, the Reber sequence SX and the desired output sequence SY are encoded as:
X
Y
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6
􏰂
BPTVVE
P|T T|V T|V P|V E 􏱌
B E P S T V X
100000 000001 010000 000000 001000 000110 000000
000000 000010 100100 000000 111000 011100 000000
For training, we generate n = 400 Reber sequences with a minimum length of 30. The maximum sequence length is τ = 52. Each of these Reber sequences is used to create a training pair (X,Y) as described above. Next, we train an RNN with m = 4 hidden neurons using tanh activation. The input and ouput layer sizes are determined by the dimensionality of the encoding, namely d = 7 and p = 7. We use a sigmoid activation at the output layer, treating each neuron as independent. We use the binary cross entropy error function. The RNN is trained for r = 10000 epochs, using gradient step size η = 1 and the entire set of 400 input sequences as the batch size. The RNN model learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the RNN model on 100 previously unseen Reber sequences (with minimum length 30, as before). The RNN makes no errors on the test sequences. On the other hand, we also trained an MLP with a single hidden layer, with size m varying between 4 and 100. Even after r = 10000 epochs, the MLP is not able to correctly predict any of the output sequences perfectly. It makes 2.62 mistakes on average per sequence for both the training and testing data. Increasing the number of epochs or the number of hidden layers does not improve the MLP performance.
26.1.4 Bidirectional RNNs
AnRNNmakesuseofahiddenstateht thatdependsontheprevioushiddenstateht−1 and the current input xt at time t. In other words, it only looks at information from the past. A bidirectional RNN (BRNN), as shown in Figure 26.5, extends the RNN model to also include information from the future. In particular, a BRNN maintains a backward hidden state vector bt ∈ Rm that depends on the next backward hidden state bt+1 and the current input xt. The output at time t is a function of both ht and bt. In

682
Deep Learning
t =0
t =1
Wbo Who, bo
t =2
Wbo Who, bo
t =τ −1 Wbo Who, bo
Wb,bb Wih Wib Wih
t =τ +1
Wh,bh
o1
o2

t =τ
···
oτ −1

Wh,bh
Wbo Who, bo Wh,bh
h0
h1
···
hτ −1

bτ +1
b1
h2
b2
···
bτ −1

Wb,bb Wih Wib Wih
Wb,bb
Wib
Wib
x1
x2
···
Figure 26.5. Bidirectional RNN: Unfolded in time.
particular, we compute the forward and backward hidden state vectors as follows:
xτ −1

h=fh(WTx+WTh +b) t ihtht−1h
bt =fb(WTxt +WTbt+1 +bb) ib b
(26.6)
Also, a BRNN needs two initial state vectors h0 and bτ+1 to compute h1 and bτ, respectively. These are usually set to 0 ∈ Rm. The forward and backward hidden states are computed independently, with the forward hidden states computed by considering the input sequence in the forward direction x1,x2,··· ,xτ , and with the backward hidden states computed by considering the sequence in reverse order xτ,xτ−1,···,x1. The output at time t is computed only when both ht and bt are available, and is given as
ot =fo(WT ht +WT bt +bo) ho bo
It is clear that BRNNs need the complete input before they can compute the output. We can also view a BRNN as having two sets of input sequences, namely the forward input sequence X = ⟨x1,x2,···,xτ⟩ and the reversed input sequence Xr =⟨xτ,xτ−1,…,x1⟩,withthecorrespondinghiddenstatesht andbt,whichtogether determine the output ot . Thus, a BRNN is comprised of two “stacked” RNNs with independent hidden layers that jointly determine the output.
26.2 GATED RNNS: LONG SHORT-TERM MEMORY NETWORKS
One of the problems in training RNNs is their susceptibility to either the vanishing gradient or the exploding gradient problem. For example, consider the task of computing the net gradient vector δht for the hidden layer at time t, given as
δht =∂fth⊙􏰓􏰄Wo·δot􏰅+􏰄Wh·δht+1􏰅􏰔

Gated RNNS: Long Short-Term Memory Networks 683
Assume for simplicity that we use a linear activation function, i.e., ∂fth = 1, and let us ignore the net gradient vector for the output layer, focusing only on the dependence on the hidden layers. Then for an input sequence of length τ , we have
δh=W·δh =W(W·δh )=W2·δh =···=Wτ−t·δh t h t+1 h h t+2 h t+2 h τ
Inparticular,forexample,attimet=1,wehaveδh =Wτ−1·δh.Wecanobservethat 1hτ
the net gradient from time τ affects the net gradient vector at time t as a function of Wτ−t, i.e., as powers of the hidden weight matrix W . Let the spectral radius of W ,
hhh defined as the absolute value of its largest eigenvalue, be given as |λ1|. It turns out that if |λ1| < 1, then ∥Whk∥ → 0 as k → ∞, that is, the gradients vanish as we train on long sequences. On the other hand, if |λ1| > 1, then at least one element of Whk becomes unbounded and thus ∥Whk ∥ → ∞ as k → ∞, that is, the gradients explode as we train on long sequences. To see this more clearly, let the eigendecomposition of the square m×mmatrixWh begivenas
λ1 0 W =U·0 λ2
··· 0
··· 0·U−1
… .  ··· λm
where |λ1| ≥ |λ2| ≥ ··· ≥ |λm| are the eigenvalues of Wh, and U is the matrix comprising the corresponding eigenvectors, u1,u2,··· ,um, as columns. Thus, we have
··· 0
··· 0   · U−1
… .  ··· λkm
It is clear that the net gradients scale according to the eigenvalues of Wh. Therefore, if|λ1|<1,then|λ1|k →0ask→∞,andsince|λ1|≥|λi|foralli=1,2,···,m,then necessarily |λi |k → 0 as well. That is, the gradients vanish. On the other hand, if |λ1 | > 1, then |λ1 |k → ∞ as k → ∞, and the gradients explode. Therefore, for the error to neither vanish nor explode, the spectral radius of Wh should remain 1 or very close to it.
Long short-term memory (LSTM) networks alleviate the vanishing gradients problem by using gate neurons to control access to the hidden states. Consider the m-dimensional hidden state vector ht ∈ Rm at time t. In a regular RNN, we update the hidden state as follows (as per Eq. (26.1)):
h t = f h ( W i T x t + W hT h t − 1 + b h )
Let g ∈ {0, 1}m be a binary vector. If we take the element-wise product of g and ht , namely, g ⊙ ht , then elements of g act as gates that either allow the corresponding element of ht to be retained or set to zero. The vector g thus acts as logical gate that allows selected elements of ht to be remembered or forgotten. However, for backpropagation we need differentiable gates, for which we use sigmoid activation on the gate neurons so that their value lies in the range [0,1]. Like a logical gate, such neurons allow the inputs to be completely remembered if the value is 1, or
h  . . 0 0
λk1 0 Wk=U·0λk2
h  . . 0 0

684 Deep Learning forgotten if the value is 0. In addition, they allow a weighted memory, allowing partial
remembrance of the elements of ht , for values between 0 and 1.
Example26.2(DifferentiableGates). Asanexample,considerahiddenstatevector ht = 􏰄−0.94 1.05 0.39 0.97 0.90􏰅T
First consider a logical gate vector
g=􏰄0 1 1 0 1􏰅T
Their element-wise product gives
g⊙ht =􏰄0 1.05 0.39 0 0.90􏰅T
We can see that the first and fourth elements have been “forgotten.” Now consider a differentiable gate vector
g=􏰄0.1 0 1 0.9 0.5􏰅T The element-wise product of g and ht gives
g ⊙ ht = 􏰄−0.094 0 0.39 0.873 0.45􏰅T
Now, only a fraction specified by an element of g is retained as a memory after the element-wise product.
26.2.1 Forget Gate
To see how gated neurons work, we consider an RNN with a forget gate. Let ht ∈ Rm be the hidden state vector, and let φt ∈ Rm be a forget gate vector. Both these vectors have the same number of neurons, m.
In a regular RNN, assuming tanh activation, the hidden state vector is updated unconditionally, as follows:
h t = t a n h 􏰄 W i T x t + W hT h t − 1 + b h 􏰅
Instead of directly updating ht , we will employ the forget gate neurons to control how much of the previous hidden state vector to forget when computing its new value, and also to control how to update it in light of the new input xt .
Figure26.6showsthearchitectureofanRNNwithaforgetgate.Giveninputxt and previous hidden state ht −1 , we first compute a candidate update vector ut , as follows:
ut =tanh􏰄WTxt +WT ht−1 +bu􏰅 (26.7) u hu
The candidate update vector ut is essentially the unmodified hidden state vector, as in a regular RNN.

Gated RNNS: Long Short-Term Memory Networks
685
ot
Wo,bo
ht
Whφ −1
−1
1−φt
−1Whu ⊙⊙
φt
ut
Wφ,bφ
Wu,bu
Figure 26.6. RNN with a forget gate φt. Recurrent connections shown in gray, forget gate shown doublelined. ⊙ denotes element-wise product.
Using the forget gate, we can compute the new hidden state vector as follows:
ht =φt ⊙ht−1 +(1−φt)⊙ut (26.8)
Here ⊙ is the element-wise product operation. We can see that the new hidden state vector retains a fraction of the previous hidden state values, and a (complementary) fraction of the candidate update values. Observe that if φt = 0, i.e., if we want to entirely forgettheprevioushiddenstate,then1−φt =1,whichmeansthatthehiddenstatewill be updated completely at each time step just like in a regular RNN. Finally, given the hiddenstateht wecancomputetheoutputvectorot asfollows
o t = f o 􏰄 W oT h t + b o 􏰅
How should we compute the forget gate vector φt ? It makes sense to base it on
xt
both the previous hidden state and the new input value, so we compute it as follows: φ =σ􏰄WTx +WT h +b 􏰅 (26.9)
where we use a sigmoid activation function, denoted σ , to ensure that all the neuron values are in the range [0, 1], denoting the extent to which the corresponding previous hidden state values should be forgotten.
To summarize, a forget gate vector φt is a layer that depends on the previous hidden state layer ht−1 and the current input layer xt; these connections are fully connected, and are specified by the corresponding weight matrices Whφ and Wφ , and the bias vector bφ. On the other hand, the output of the forget gate layer φt needs to modify the previous hidden state layer ht−1, and therefore, both φt and ht−1 feed into what is essentially a new element-wise product layer, denoted by ⊙ in Figure 26.6.
t φthφt−1φ

686 Deep Learning
Finally, the output of this element-wise product layer is used as input to the new hidden layerht thatalsotakesinputfromanotherelement-wisegatethatcomputestheoutput fromthecandidateupdatevectorut andthecomplementedforgetgate,1−φt.Thus, unlike regular layers that are fully connected and have a weight matrix and bias vector between the layers, the connections between φt and ht via the element-wise layer are all one-to-one, and the weights are fixed at the value 1 with bias 0. Likewise the connections between ut and ht via the other element-wise layer are also one-to-one, with weights fixed at 1 and bias at 0.
Example 26.3. Let m = 5. Assume that the previous hidden state vector and the candidate update vector are given as follows:
ht−1 = 􏰄−0.94 1.05 0.39 0.97 0.9􏰅T ut = 􏰄0.5 2.5 −1.0 −0.5 0.8􏰅T Let the forget gate and its complement be given as follows:
φt = 􏰄0.9 1 0 0.1 0.5􏰅T 1−φt =􏰄0.1 0 1 0.9 0.5􏰅T
The new hidden state vector is then computed as the weighted sum of the previous hidden state vector and the candidate update vector:
ht =φt ⊙ht−1 +(1−φt)⊙ut
= 􏰄0.9 1 0 0.1 􏰄0.1 0 1 0.9
0.5􏰅T ⊙ 􏰄−0.94 1.05 0.39 0.97 0.9􏰅T +
0.5􏰅T ⊙ 􏰄0.5 2.5 −1.0 −0.5 0.8􏰅T
0 0.097 0.45􏰅T + 􏰄0.05 0 −1.0 −0.45 0.40􏰅T
= 􏰄−0.846 1.05
= 􏰄−0.796 1.05 −1.0 −0.353 0.85􏰅T
Computing Net Gradients
It is instructive to compute the net gradients for an RNN with a forget gate, since a similar approach is used to compute the net gradients when training an LSTM network. An RNN with a forget gate has the following parameters it needs to learn, namely the weight matrices Wu, Whu, Wφ, Whφ, and Wo, and the bias vectors bu, bφ and bo. The computation of the hidden state vector ht adds together the inputs from the new element-wise layer that multiplies its incoming edges to compute the net input as opposed to computing a weighted sum. We will look at how to account for the element-wise layers during backpropagation.
Figure 26.7 shows a forget gate RNN unfolded in time for two time steps. Let δot , δht , δφt , and δut denote the net gradient vectors at the output, hidden, forget gate, and candidate update layers, respectively. During backpropagation, we need to compute the net gradients at each layer. The net gradients at the outputs are computed by

Gated RNNS: Long Short-Term Memory Networks 687
o1
Wo,bo Wo,bo
⊙⊙⊙⊙
h0
h1
h2
φ1
1−φ1 1−φ2
u1
φ2
u2
Whφ Whu
Whφ
Whu
x1
Wφ,bφ
Figure 26.7. RNN with a forget gate unfolded in time (recurrent connections in gray).
considering the partial derivatives of the activation function (∂fto) and the error function (∂Ext ):
δ ot = ∂ f t o ⊙ ∂ E x t
For the other layers, we can reverse all the arrows to determine the dependencies
between the layers. Therefore, to compute the net gradient for the update layer
δut , notice that in backpropagation it has only one incoming edge from ht via the
element-wise product (1 − φ ) ⊙ u . The net gradient δu at update layer neuron i at tt ti
􏰅
time t is given as
δu = x = x · ti · ti =δh ·(1−φ )· 1−u2
Wu,bu
Wφ,bφ
Wu,bu
∂E ∂E ∂neth ∂u 􏰄
ti ∂netu ∂neth ∂uti ∂netu ti ti ti
x2
ti ti ti ∂neth 􏰌 􏰍
ti = ∂ φti ·ht−1,i +(1−φti)·uti = 1 − φti, and we use the fact that the ∂uti ∂uti
where
update layer uses a tanh activation function. Across all neurons, we obtain the net gradient at ut as follows:
δut =δht ⊙(1−φt)⊙(1−ut ⊙ut)
To compute the net gradient vector for the forget gate, we observe from
Figure 26.7 that there are two incoming flows into φt during backpropagation — one
from ht via the element-wise product φt ⊙ ht−1, and the other also from ht via the
element-wise product (1 − φ ) ⊙ u . Therefore, the net gradient δφ at forget gate neuron tt ti
i at time t is given as
∂E ∂E ∂neth ∂φ
δφ = x = x · ti · ti =δh ·(h −u )·φ (1−φ )
ti ∂netφ ∂neth ∂φti ∂netφ ti t−1,i ti ti ti ti ti ti
o2

688 Deep Learning ∂neth 􏰌 􏰍
where ti = ∂ φti ·ht−1,i +(1−φti)·uti = ht−1,i −uti, and we use the fact that the ∂φti ∂φti
forget gate uses a sigmoid activation function. Across all neurons, we obtain the net gradientatφt asfollows:
δφt =δht ⊙(ht−1 −ut)⊙φt ⊙(1−φt)
Finally, let us consider how to compute δht , the net gradient at the hidden layer at time t. In Figure 26.7 we can observe that if we reverse the arrows, δht depends on the gradients at the output layer ot, the forget gate layer φt+1, the update layer ut+1, and on the hidden layer ht+1 via the element-wise product ht+1 ⊙ φt+1. The output, forget and update layers are treated as in a regular RNN. However, due to the element-wise layer, the flow from ht+1 is handled as follows:
∂E ∂neth ∂h
xt · t+1,i · ti =δh ·φ ·1=δh ·φ
∂neth ∂hti ∂neth t+1,i t+1,i t+1,i t+1,i t+1,i ti
∂neth 􏰌 􏰍
t+1,i = ∂ φt+1,i ·hti +(1−φt+1,i)·ut+1,i = φt+1,i, and we used the fact that
where
ht implicitly uses an identity activation function. Across all the hidden neurons at time t, the net gradient vector component from ht+1 is given as δht+1 ⊙ φt+1. Considering all the layers, including the output, forget, update and element-wise layers, the complete net gradient vector at the hidden layer at time t is given as:
δht =Woδot +Whφδφt+1 +Whuδut+1 +􏰄δht+1 ⊙φt+1􏰅
Given the net gradients, we can compute the gradients for all the weight matrices and bias vectors in a manner similar to that outlined for a regular RNN in Section 26.1.2. Likewise, stochastic gradient descent can be used to train the network.
26.2.2 Long Short-Term Memory (LSTM) Networks
We now describe LSTMs, which use differentiable gate vectors to control the hidden state vector ht , as well as another vector ct ∈ Rm called the internal memory vector. In particular, LSTMs utilize three gate vectors: an input gate vector κt ∈ Rm, a forget gate vector φt ∈ Rm, and an output gate vector ωt ∈ Rm, as illustrated in Figure 26.8, which shows the architecture of an LSTM network. Like a regular RNN, an LSTM also maintains a hidden state vector for each time step. However, the content of the hidden vector is selectively copied from the internal memory vector via the output gate, with the internal memory being updated via the input gate and parts of it forgotten via the forget gate.
Let X = ⟨x1,x2,··· ,xτ ⟩ denote a sequence of d-dimensional input vectors of length τ, Y = ⟨y1,y2,··· ,yτ ⟩ the sequence of p-dimensional response vectors, and O = ⟨o1 , o2 , · · · , oτ ⟩ the p-dimensional output sequence from an LSTM. At each time step t, the three gate vectors are updated as follows:
(26.10)
∂hti ∂hti
κt =σ􏰄WTxt +WT ht−1 +bκ􏰅 κ hκ
φ =σ􏰄WTx +WT h +b 􏰅 t φthφt−1φ
ωt=σ􏰄WTxt+WT ht−1+bω􏰅 ω hω

Gated RNNS: Long Short-Term Memory Networks
689
−1 Whκ
−1
−1
Wo,bo

tanh
−1
Whφ
Whu
Whω
−1 ⊙

Wu,bu
Wκ,bκ
Wφ,bφ
Wω,bω
Figure 26.8. LSTM neural network. Recurrent connections shown in gray, gate layers shown doublelined.
Here σ(·) denotes the sigmoid activation function. We can observe that each gate is a function of the input vector xt at time t, as well as the hidden state ht−1 from the previous time step. Each gate vector has a corresponding weight matrix from the input neurons to the gate neurons, and from the hidden state neurons to the gate neurons, as well as a corresponding bias vector. Each of the gate vectors conceptually plays a different role in an LSTM network. The input gate vector κt controls how much of the input vector, via the candidate update vector ut , is allowed to influence the memory vectorct.Theforgetgatevectorφt controlshowmuchofthepreviousmemoryvector to forget, and finally the output gate vector ωt controls how much of the memory state is retained for the hidden state.
Given the current input xt and the previous hidden state ht−1, an LSTM first computesacandidateupdatevectorut afterapplyingthetanhactivation:
ut =tanh􏰄WTxt +WT ht−1 +bu􏰅 (26.11) u hu
It then then applies the different gates to compute the internal memory and hidden state vectors:
ct =κt ⊙ut +φt ⊙ct−1 ht =ωt ⊙tanh(ct)
(26.12)
ot
ht
ct
κt
φt
ut
ωt
xt

690 Deep Learning
The memory vector ct at time t depends on the current update vector ut and the previous memory ct−1. However, the input gate κt controls the extent to which ut influences ct, and the forget gate φt controls how much of the previous memory is forgotten.Ontheotherhand,thehiddenstateht dependsonatanhactivatedinternal memory vector ct , but the output gate ωt controls how much of the internal memory is reflected in the hidden state. Besides the input vectors xt , the LSTM also needs an initial hidden state vector h0 and an initial memory state vector c0, both typically set to 0∈Rm.
Finally, the output of the network ot is obtained by applying the output activation function f o to an affine combination of the hidden state neuron values:
o t = f o ( W oT h t + b o )
LSTMs can typically handle long sequences since the net gradients for the internal memory states do not vanish over long time steps. This is because, by design, the memory state ct−1 at time t − 1 is linked to the memory state ct at time t via implicit weights fixed at 1 and biases fixed at 0, with linear activation. This allows the error to flow across time steps without vanishing or exploding.
LSTMs can be trained just like regular RNNs by unfolding the layers in time, as illustrated in Figure 26.9, which shows the unfolded layers for two time steps. The first step in training is to use the feed-forward steps to compute the error, followed by the backpropagation of the gradients as a second step. The latter has to be modified to accommodatetheelement-wiseoperationsusedtoupdatethememorystatect andthe hidden state ht. The connections from ct−1 to ct starting from c0 to cτ, which can be thought of as using unit weight matrices and zero biases, appear as a straight line in the figure indicating that the internal memory state can flow across longer periods of time without the gradients vanishing or exploding.
26.2.3 Training LSTMs
Consider the unfolded LSTM in Figure 26.9. During backpropagation the net gradient vector at the output layer at time t is computed by considering the partial derivatives of the activation function, ∂fto, and the error function, ∂Ext , as follows:
δ ot = ∂ f t o ⊙ ∂ E x t
where we assume that the output neurons are independent.
In backpropagation there are two incoming connections to the internal memory
vector ct, one from ht and the other from ct+1. Therefore, the net gradient δc at the ti
internal memory neuron i at time t is given as
∂E ∂E ∂neth ∂c ∂E ∂netc ∂c
δc= x = x · ti· ti + x · t+1,i· ti ti ∂netc ∂neth ∂cti ∂netc ∂netc ∂cti ∂netc
ti ti
ti t+1,i ti
=δh ·ω (1−c2)+δc ·φ
ti ti ti t+1,i t+1,i

Gated RNNS: Long Short-Term Memory Networks 691
o1
o2
Wo,bo Wo,bo
⊙⊙
tanh tanh
h0
h1
h2
Whκ Whφ Whu ⊙⊙
Whω Whκ Whφ Whu Whω
c0
c1
c2
κ1
φ1
u1
ω1
κ2
φ2
u2
ω2
Wκ,bκ
⊙⊙
Wφ,bφ Wu,bu Wω,bω Wκ,bκ Wφ,bφ Wu,bu
Wω,bω
x1
Figure 26.9. LSTM neural network unfolded in time (recurrent connections in gray).
where we use the fact that the internal memory vector implicitly uses an identity activation function, and furthermore
∂neth
ti =

∂cti ti
{ωti ·tanh(cti)}=ωti(1−c2) ∂netc ∂􏰌 􏰍
x2
∂cti
t+1,i = κt+1,i ·ut+1,i +φt+1,i ·cti =φt+1,i ∂cti ∂cti
The net gradient vector δct at ct is therefore given as:
δct =δht ⊙ωt⊙(1−ct⊙ct)+δct+1⊙φt+1
The forget gate has only one incoming edge in backpropagation, from ct, via the element-wise multiplication φt ⊙ ct−1, with sigmoid activation, therefore the net gradient is:
∂E ∂E ∂netc ∂φ
δφ= x = x · ti· ti =δc·c ·φ(1−φ)
∂netφ ti t−1,i ti ti ti ti ti
ti ∂netφ ∂netc ∂φti
where we used the fact that the forget gate uses sigmoid activation and
∂netc ∂􏰌 􏰍
ti = κti ·uti +φti ·ct−1,i =ct−1,i
∂φti ∂φti
Across all forget gate neurons, the net gradient vector is therefore given as δφt =δct ⊙ct−1 ⊙(1−φt)⊙φt

692 Deep Learning
The input gate also has only one incoming edge in backpropagation, from ct , via the element-wise multiplication κt ⊙ ut , with sigmoid activation. In a similar manner, as outlined above for δφt , the net gradient δκt at the input gate κt is given as:
δκt =δct ⊙ut ⊙(1−κt)⊙κt
The same reasoning applies to the update candidate ut , which also has an incoming edge from ct via κt ⊙ ut and tanh activation, so the net gradient vector δut at the update layer is
δ ut = δ ct ⊙ κ t ⊙ ( 1 − u t ⊙ u t )
Likewise, in backpropagation, there is one incoming connection to the output gate
from ht via ωt ⊙ tanh(ct ) with sigmoid activation, therefore δωt =δht ⊙tanh(ct)⊙(1−ωt)⊙ωt
Finally, to compute the net gradients at the hidden layer, note that gradients flow back to ht from the following layers: ut+1,κt+1,φt+1,ωt+1 and ot. Therefore, the net gradient vector at the hidden state vector δht is given as
δth=Woδot +Whκδκt+1+Whφδφt+1+Whωδωt+1+Whuδut+1
The gradients for the weight matrix and bias vector at the output layer are given
as:
given as follows:
∇bκ =􏰥τt=1δκt
∇bφ =􏰥τt=1δφt ∇bω =􏰥τt=1δωt ∇bu =􏰥τt=1δut
∇Wκ =􏰥τt=1xt·􏰄δκt􏰅T ∇Wφ =􏰥τt=1xt ·􏰄δφt 􏰅T ∇Wω =􏰥τt=1xt ·􏰄δωt 􏰅T ∇Wu =􏰥τt=1xt ·􏰄δut 􏰅T
∇ b o =
t=1
∇ W o =
h t · δ ot
∇Whκ =􏰥τt=1ht−1·􏰄δκt􏰅T
∇Whφ =􏰥τt=1ht−1 ·􏰄δφt 􏰅T ∇Whω =􏰥τt=1ht−1 ·􏰄δωt 􏰅T ∇Whu =􏰥τt=1ht−1 ·􏰄δut 􏰅T
􏰪τ
􏰪τ 􏰄􏰅T
δ ot
Likewise, the gradients for the weight matrices and bias vectors for the other layers are
Given these gradients, we can use the stochastic gradient descent approach to train the network.
t=1
Example 26.4 (LSTM). We use an LSTM to learn the embedded Reber grammar, which is generated according to the automata shown in Figure 26.10. This automata has two copies of the Reber automata from Example 26.1. From the state s1, the top automata is reached by following the edge labeled T, whereas the bottom automata is reached via the edge labeled P. The states of the top automata are labeled as t0,t1,··· ,t7, whereas the states of the bottom automata are labeled as p0,p1,··· ,p7. Finally, note that the state e0 can be reached from either the top or the bottom

Gated RNNS: Long Short-Term Memory Networks 693 S
X
t2 TS
BE
t0 t1 Pt6 t7 X
PV TT
t4
t3
V
t5
BTE
s0 s1 S
e0 e1
X
p4 TS
BE
p0 p1 Pp6 p7 X
PV
p2 PP
p3
T
V
p5
Figure 26.10. Embedded Reber grammar automata.
automata by following the edges labeled T and P, respectively. The first symbol is always B and the last symbol is always E. However, the important point is that the second symbol is always the same as the second last symbol, and thus any sequence learning model has to learn this long range dependency. For example, the following is a valid embedded Reber sequence: SX = ⟨B,T,B,T,S,S,X,X,T,V,V,E,T,E⟩.
The task of the LSTM is to learn to predict the next symbol for each of the positions in a given embedded Reber sequence. For training, we generate n = 400 embedded Reber sequences with a minimum length of 40, and convert them into training pairs (X,Y) using the binary encoding described in Example 26.1. The maximum sequence length is τ = 64.
Given the long range dependency, we used an LSTM with m = 20 hidden neurons (smaller values of m either need more epochs to learn, or have trouble learning the grammar). The input and output layer sizes are determined by the dimensionality of encoding, namely d = 7 and p = 7. We use sigmoid activation at the output layer, treating each neuron as independent. Finally, we use the binary cross entropy error function. The LSTM is trained for r = 10000 epochs (using step size η = 1 and batch size 400); it learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the LSTM model on 100 previously unseen embedded Reber sequences (with minimum length 40, as before). The trained LSTM makes no errors on the test sequences. In particular, it is able to learn the long range dependency between the second symbol and the second last symbol, which must always match.
The embedded Reber grammar was chosen since an RNN has trouble learning the long range dependency. Using an RNN with m = 60 hidden neurons, using r = 25000 epochs with a step size of η = 1, the RNN can perfectly learn the training sequences. That is, it makes no errors on any of the 400 training sequences. However,

694 Deep Learning
on the test data, this RNN makes a mistake in 40 out of the 100 test sequences. In fact, in each of these test sequences it makes exactly one error; it fails to correctly predict the second last symbol. These results suggest that while the RNN is able to “memorize” the long range dependency in the training data, it is not able to generalize completely on unseen test sequences.
26.3 CONVOLUTIONAL NEURAL NETWORKS
A convolutional neural network (CNN) is essentially a localized and sparse feedfor- ward MLP that is designed to exploit spatial and/or temporal structure in the input data. In a regular MLP all of the neurons in layer l are connected to all of the neurons in layer l + 1. In contrast, a CNN connects a contiguous or adjacent subset of neurons in layer l to a single neuron in the next layer l + 1. Different sliding windows comprising contiguous subsets of neurons in layer l connect to different neurons in layer l + 1. Furthermore, all of these sliding windows use parameter sharing, that is, the same set of weights, called a filter, is used for all sliding windows. Finally, different filters are used to automatically extract features from layer l for use by layer l + 1.
26.3.1 Convolutions
We begin by defining the convolution operation for one-way, two-way and three-way inputs. By one-way we mean data in the form of a single vector, by two-way we mean data in the form of a matrix, and by three-way we mean data in the form of a tensor. We also call them 1D, 2D or 3D inputs where the dimensionality refers to the number of axes in the input data. We will discuss the convolution operation in the context of the input layer and the first hidden layer, but the same approach can be applied to subsequent layers in the network.
1D Convolution
Let x = (x1,x2,··· ,xn)T be an input vector (a one-way or 1D input) with n points. It is assumed that the input points xi are not independent, but rather, there are dependencies between successive points. Let w = (w1 , w2 , · · · , wk )T be a vector of weights, called a 1D filter, with k ≤ n. Here k is also called the window size. Let xk(i) denote the window of x of length k starting at position i, given as
xk(i)=􏰄xi,xi+1,xi+2,···,xi+k−1􏰅T
with 1 ≤ i ≤ n − k + 1. Given a vector a ∈ Rk , define the summation operator as one that
adds all the elements of the vector. That is,
􏰪k
sum(a) =
A 1D convolution between x and w, denoted by the asterisk symbol ∗, is defined as
􏰓􏰄􏰅􏰄 􏰅􏰔T x∗w= sum xk(1)⊙w ··· sum xk(n−k+1)⊙w
i=1
ai

Convolutional Neural Networks
695
1
3
-1
2
3
1
-2
1
3
-1
2
3
1
-2
1
3
-1
2
3
1
-2
1
3
-1
2
3
1
-2
1
0
2
1
0
2
-1
-1
-1
-1
1
0
2
7
7
7
1
0
2
5
5
4
xwxxxx w
1
3
-1
2
3
1
-2
∗=
∗=
w
∗=
w
∗=
(d) sum(x3 (4) ⊙ w)
􏰄 􏰅􏰪k
sum xk(i)⊙w = xi+j−1 ·wj (26.13)
j=1 fori=1,2,···,n−k+1.Wecanseethattheconvolutionofx∈Rn andw∈Rk results
in a vector of length n − k + 1.
(b) sum(x3 (2) ⊙ w)
Figure26.11. 1DConvolution:(a)–(e)showtheconvolutionbetweendifferentslidingwindowsofxandthe
(c) sum(x3 (3) ⊙ w)
filter w (with window size k = 3). The final convolution output is shown in (e).
(a) sum(x3 (1) ⊙ w)
where ⊙ is the element-wise product, so that
∗=
(e) sum(x3 (5) ⊙ w)
w
-1
7
5
1
0
2
4
-1
Example 26.5 (1D Convolution). Figure 26.11 shows a vector x with n = 7 and a filter w = (1,0,2)T with window size k = 3. The first window of x of size 3 is x3(1) = (1,3,−1)T. Therefore, as seen in Figure 26.11(a), we have
sum(x3(1)⊙w)=sum􏰄(1,3,−1)T ⊙(1,0,2)T􏰅=sum􏰄(1,0,−2)T􏰅=−1
The convolution steps for different sliding windows of x with the filter w are shown in Figure 26.11(a)–(e). The convolution x ∗ w has size n − k + 1 = 7 − 3 + 1 = 5, and is given as
x ∗ w = (−1,7,5,4,−1)T
2D Convolution
We can extend the convolution operation to matrix input, for example for images. Let X be an n×n input matrix, and let W be a k×k matrix of weights, called a 2D filter, with k ≤ n. Here k is called the window size. Let Xk (i, j ) denote the k × k submatrix of X starting at row i and column j , defined as follows:
 x i , j x i , j + 1 · · · x i , j + k − 1  X (i,j) =  xi+1,j xi+1,j+1 ··· xi+1,j+k−1  k  . . ··· . 
xi+k−1,j xi+k−1,j +1 · · · xi+k−1,j +k−1

696 Deep Learning with 1 ≤ i, j ≤ n − k + 1. Given a k × k matrix A ∈ Rk×k , define the summation operator
as one that adds all the elements of the matrix. That is,
sum(A) =
whereai,j istheelementofAatrowiandcolumnj.The2DconvolutionofXandW,
denoted X ∗ W, is defined as:
 sum􏰄Xk(1,1)⊙W􏰅 ··· sum􏰄Xk(1,n−k+1)⊙W􏰅 
 sum􏰄Xk(2,1)⊙W􏰅 ··· sum􏰄Xk(2,n−k+1)⊙W􏰅  X∗W= . ··· .  sum􏰄Xk(n−k+1,1)⊙W􏰅 ··· sum􏰄Xk(n−k+1,n−k+1)⊙W􏰅
where ⊙ is the element-wise product of Xk (i, j ) and W, so that 􏰄 􏰅􏰪k􏰪k
􏰪k 􏰪k i=1 j=1
ai,j
sum Xk(i,j)⊙W = xi+a−1,j+b−1 ·wa,b (26.14) a=1 b=1
fori,j=1,2,···,n−k+1.TheconvolutionofX∈Rn×n andW∈Rk×k resultsina (n − k + 1) × (n − k + 1) matrix.
Example 26.6 (2D Convolution). Figure 26.12 shows a matrix X with n = 4 and a filter W with window size k = 2. The convolution of the first window of X, namely X2(1,1), with W is given as (see Figure 26.12(a))
sum􏰄X2(1,1)⊙W􏰅=sum􏰕􏰕1 2􏰖⊙􏰕1 0􏰖􏰖=sum􏰕􏰕1 0􏰖􏰖=2 310101
The convolution steps for different 2 × 2 sliding windows of X with the filter W are shown in Figure 26.12(a)–(i). The convolution X ∗ W has size 3 × 3, since n − k + 1 =
4 − 2 + 1 = 3, and is given as
2 6 4 X∗W=4 4 8
444
3D Convolution
We now extend the convolution operation to a three-dimensional matrix, which is also called a 3D tensor. The first dimension comprises the rows, the second the columns, and the third the channels. Let X be an n × n × m tensor, with n rows, n columns and m channels. The assumption is that the input X is a collection of n × n matrices obtained by applying m filters, which specify the m channels. For example, for n × n image inputs, each channel may correspond to a different color filter — red, green or blue.

Convolutional Neural Networks 697 XXX
WWW
∗=∗=∗=
(a) sum(X2(1,1)⊙W) (b) sum(X2(1,2)⊙W) (c) sum(X2(1,3)⊙W)
XXX WWW
∗=∗=∗=
(d) sum(X2(2,1)⊙W) (e) sum(X2(2,2)⊙W) (f) sum(X2(2,3)⊙W)
XXX WWW
∗=∗=∗=
(g) sum(X2(3,1)⊙W) (h) sum(X2(3,2)⊙W) (i) sum(X2(3,3)⊙W)
Figure 26.12. 2D Convolution: (a)–(i) show the 2D convolution between different 2 × 2 sliding windows of X and the filter W. The final 2D convolution output is shown in (i).
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
2
2
6
4
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
4
4
i , j , q + 1 xi+1,j,q+1
.
i + k − i , 1 j , +j , 1 q , +q +r −1 1 xi+1,j+1,q+1
.
· i · + · k − 1 , j + 1 , q + i r , j − + 1 k − 1 , · q · + · 1 ··· xi+1,j+k−1,q+1 ··· .
xi+k−1,j+k−1,q+1
xi,j,q+r−1 xi+1,j,q+r−1
.
xi,j+1,q+r−1 ··· xi+1,j+1,q+r−1 ··· . ···
xi,j+k−1,q+r−1 xi+1,j+k−1,q+r−1
. xi+k−1,j+k−1,q+r−1
xxxxx
xxxxx
i , j , q xi+1,j,q
. xi+k−1,j,q
i + k i , − j 1 + , j 1 , , q q + 1 xi+1,j+1,q
. xi+k−1,j+1,q
· · i · + k − 1 , j + 1 , q i +, j 1 + k − 1 · , · q ·
··· ···
···
xi+1,j+k−1,q .
xi+k−1,j+k−1,q
Figure26.13. 3DsubtensorXk(i,j,q):k×k×rsubtensorofXstartingatrowi,columnj,andchannelq.
Let W be a k × k × r tensor of weights, called a 3D filter, with k ≤ n and r ≤ m. Let Xk(i,j,q) denote the k × k × r subtensor of X starting at row i, column j and channel q , as illustrated in Figure 26.13, with 1 ≤ i, j ≤ n − k + 1, and 1 ≤ q ≤ m − r + 1.
Given a k × k × r tensor A ∈ Rk×k×r , define the summation operator as one that adds all the elements of the tensor. That is,
􏰪k 􏰪k 􏰪r i=1 j=1 q=1
sum(A) =
ai,j,q
6
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
2
6
4
4
4
4
8
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
2
6
4
4
4
8
4
4
8
4
4
8
4
4
4
4
4
4

698 Deep Learning where ai,j,q is the element of A at row i, column j, and channel q. The 3D convolution
sum􏰄Xk(1,n−k+1,1)⊙W􏰅  sum􏰄Xk(2,n−k+1,1)⊙W􏰅 
of X and W, denoted X ∗ W, is defined as:
 sum􏰄Xk(1,1,1)⊙W􏰅 ···
X∗W =
.  sum􏰄Xk(n−k+1,n−k+1,2)⊙W􏰅
 sum􏰄Xk(2,1,1)⊙W􏰅 ···
 . ··· sum􏰄Xk(n−k+1,1,1)⊙W􏰅 ···  sum􏰄Xk(1,1,2)⊙W􏰅 ···  sum􏰄Xk(2,1,2)⊙W􏰅 ···
 . ··· sum􏰄Xk(n−k+1,1,2)⊙W􏰅 ···
.  sum􏰄Xk(n−k+1,n−k+1,1)⊙W􏰅
sum􏰄Xk(1,n−k+1,2)⊙W􏰅  sum􏰄Xk(2,n−k+1,2)⊙W􏰅 
. .  sum􏰄Xk(1,1,m−r +1)⊙W􏰅 ···
. sum􏰄Xk(1,n−k+1,m−r +1)⊙W􏰅 
 sum􏰄Xk(2,1,m−r +1)⊙W􏰅 ···  . ···
sum􏰄Xk(2,n−k+1,m−r +1)⊙W􏰅  . 
sum􏰄Xk(n−k+1,1,m−r+1)⊙W􏰅 ···
where ⊙ is the element-wise product of Xk(i,j,q) and W, so that
sum􏰄Xk(n−k+1,n−k+1,m−r+1)⊙W􏰅
sum Xk(i,j,q)⊙W = xi+a−1,j+b−1,q+c−1 ·wa,b,c (26.15)
􏰄 􏰅 􏰪k 􏰪k 􏰪r a=1 b=1 c=1
for i, j = 1, 2, · · · , n − k + 1 and q = 1, 2, · · · , m − r + 1. We can see that the convolution ofX∈Rn×n×m andW∈Rk×k×r resultsina(n−k+1)×(n−k+1)×(m−r+1)tensor.
3D Convolutions in CNNs Typically in CNNs, we use a 3D filter W of size k × k × m, with the number of channels r = m, the same as the number of channels in X ∈ Rn×n×m. Let Xk(i,j) be the k × k × m subtensor of X starting at row i and column j. Then the 3D convolution of X and W is given as:
 sum􏰄Xk(1,1)⊙W􏰅 ··· sum􏰄Xk(1,n−k+1)⊙W􏰅   sum􏰄Xk(2,1)⊙W􏰅 ··· sum􏰄Xk(2,n−k+1)⊙W􏰅 
X∗W= . ··· .  sum􏰄Xk(n−k+1,1)⊙W􏰅 ··· sum􏰄Xk(n−k+1,n−k+1)⊙W􏰅
We can see that when W ∈ Rk×k×m, its 3D convolution with X ∈ Rn×n×m results in a (n − k + 1) × (n − k + 1) matrix, since there is no freedom to move in the third dimension. Henceforth, we will always assume that a 3D filter W ∈ Rk×k×m has the same number of channels as the tensor X on which it is applied. Since the number of channels is fixed based on X, the only parameter needed to fully specify W is the window size k.
Example26.7(3DConvolution). Figure26.14showsa3×3×3tensorXwithn=3 and m = 3, and a 2 × 2 × 3 filter W with window size k = 2 and r = 3. The convolution

26.3 Convolutional Neural Networks
699
X
1
−2
4
W
2
1
−2
1
3
−1
0
1
1
2
1
3

(a) sum(X2 (1, 1) ⊙ W)

(b) sum(X2 (1, 2) ⊙ W)

(c) sum(X2 (2, 1) ⊙ W)

(d) sum(X2 (2, 2) ⊙ W)
Figure 26.14. 3D Convolution: (a)–(d) Convolution between different 2 × 2 × 3 sliding windows of X, and the filter W. The final 3D convolution output is shown in (d).
3
−1
1
0
0
1
0
5
=
1
1
0
1
3
−2
1
−1
1
1
2
1
4
2
0
3
1
2
X
1
−2
4
2
1
−2
1
3
−1
0
1
1
2
1
3
3
−1
1
1
0
=
5
1
1
−2
0
1
1
−1
3
1
1
2
1
4
2
0
11
3
1
2
W
X
1
−2
4
W
2
1
−2
1
3
−1
0
1
2
1
3
1
0
3
−1
1
1
0
5
11
=
1
1
−2
0
1
15
1
−1
3
1
1
2
1
4
2
0
3
1
2
X
1
−2
4
W
2
1
−2
1
3
−1
0
1
1
0
2
1
3
3
−1
1
1
0
5
11
=
1
1
−2
0
1
15
5
1
−1
3
1
1
2
1
4
2
0
3
1
2

700 Deep Learning
of the first window of X, namely X2(1,1), with W is given as (see Figure 26.14(a)) sum􏰄X2(1,1)⊙W􏰅=sum􏰕􏰕1 −1 2 1 1 −2􏰖⊙􏰕1 1 1 0 0 1􏰖􏰖
2 13−12 1 200110 =sum􏰕􏰕1 −1 2 0 0 −2􏰖􏰖=5
4 00−12 0
where we stack the different channels horizontally. The convolution steps for differ- ent 2 × 2 × 3 sliding windows of X with the filter W are shown in Figure 26.14(a)–(d). The convolution X ∗ W has size 2 × 2, since n − k + 1 = 3 − 2 + 1 = 2 and r = m = 3; it is given as
X∗W=􏰕5 11􏰖 15 5
26.3.2 Bias and Activation Functions
We discuss the role of bias neurons and activation functions in the context of a tensor
of neurons at layer l. Let Zl be an nl ×nl ×ml tensor of neurons at layer l so that zl i,j,q
denotes the value of the neuron at row i, column j and channel q for layer l, with 1≤i,j≤nl and1≤q≤ml.
Filter Bias
Let W be a k×k×ml 3D filter. Recall that when we convolve Zl and W, we get a (nl −k+1)×(nl −k+1)matrixatlayerl+1.However,sofar,wehaveignoredthe role of the bias term in the convolution. Let b ∈ R be a scalar bias value for W, and let Zlk(i,j)denotethek×k×ml subtensorofZl atposition(i,j).Then,thenetsignalat neuron zl+1 in layer l + 1 is given as
i,j
netl+1 =sum􏰄Zl(i,j)⊙W􏰅+b i,j k
and the value of the neuron zl+1 is obtained by applying some activation function f to i,j
the net signal
zl+1 =f􏰓sum􏰄Zl (i,j)⊙W􏰅+b􏰔 i,j k
The activation function can be any of the ones typically used in neural networks, for example, identity, sigmoid, tanh, ReLU and so on. In the language of convolutions, the values of the neurons in layer l + 1 is given as follows
Zl+1 =f􏰓􏰄Zl ∗W􏰅 ⊕ b􏰔
where ⊕ indicates that the bias term b is added to each element of the (nl − k + 1) × (nl −k+1)matrixZl ∗W.

Convolutional Neural Networks
701
Zl
W1 ∗
3
1
4
1
1
1
1
2
3
0
1
2
2
0
1
0
0
1
1
0
Zl+1 =
7
8
7
7
6
8
4
6
10
1
2
2
1
0
1
3
1
4
2
1
0
2
1
3
4
W2
=
Figure 26.15. Multiple 3D filters.

1
0
0
1
1
2
3
1
6
8
10
5
6
11
7
5
5
1
0
Multiple 3D Filters
We can observe that one 3D filter W with a corresponding bias term b results in a (nl − k+1)×(nl−k+1)matrixofneuronsinlayerl+1.Therefore,ifwedesireml+1 channels in layer l + 1, then we need ml+1 different k × k × ml filters Wq with a corresponding biastermbq,toobtainthe(nl −k+1)×(nl −k+1)×ml+1 tensorofneuronvaluesat layer l + 1, given as
Zl+1 =􏰛zl+1 =f􏰓sum􏰄Zl (i,j)⊙W 􏰅+b 􏰔􏰜 i,j,q k qq
i,j=1,2,…,nl−k+1 and q=1,2,…,ml+1 Zl+1 =f􏰓􏰄Zl ∗W1􏰅 ⊕ b1, 􏰄Zl ∗W2􏰅 ⊕ b2, ···, 􏰄Zl ∗Wml+1􏰅 ⊕ bml+1􏰔
where the activation function f distributes over all of its arguments.
In summary, a convolution layer takes as input the nl × nl × ml tensor Zl of neurons from layer l, and then computes the nl+1 × nl+1 × ml+1 tensor Zl+1 of neurons for the nextlayerl+1viatheconvolutionofZl withasetofml+1 different3Dfiltersofsizek× k × ml , followed by adding the bias and applying some non-linear activation function f . Note that each 3D filter applied to Zl results in a new channel in layer l + 1. Therefore,
ml+1 filters are used to yield ml+1 channels at layer l + 1.
which can be written more compactly as
26.3.3 Padding and Striding
One of the issues with the convolution operation is that the size of the tensors will necessarily decrease in each successive CNN layer. If layer l has size nl × nl × ml , and
0
1
Example 26.8 (Multiple 3D Filters). Figure 26.15 shows how applying different fil- tersyieldthechannelsforthenextlayer.Itshowsa4×4×2tensorZl withn=4 and m = 2. It also shows two different 2×2×2 filters W1 and W2 with k = 2 and r=2.Sincer=m=2,theconvolutionofZl andWi (fori=1,2)resultsina3×3 matrixsincen−k+1=4−2+1=3.However,W1 yieldsonechannelandW2 yields a second channel, so that the tensor for the next layer Zl+1 has size 3 × 3 × 2, with two channels (one per filter).

702
Deep Learning
Zl
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
11
9
9
11
12
8
8
7
0
0
0
0
0
0
0
0
1
2
2
1
1
0
0
3
1
4
2
1
0
0
2
1
3
4
3
0
0
1
2
3
1
1
0
0
4
1
3
2
1
0
0
0
0
0
0
0
0
6
5
7
4
2
1
0
0
0
1
1
0
1
0
6
7
11
9
5
4
9
11
12
6
7
8
8
7
6
5
5
7
6
2
Zl
W
∗= (a) No padding: p = 0
Zl+1
Zl+1
W
∗=
(b) Padding: p = 1
Figure26.16. Padding:2DConvolution(a)withoutpadding,and(b)withpadding.
weusefiltersofsizek×k×ml,theneachchannelinlayerl+1willhavesize(nl −k+ 1) × (nl − k + 1). That is the number of rows and columns for each successive tensor will shrink by k − 1 and that will limit the number of layers the CNN can have.
Padding
To get around this limitation, a simple solution is to pad each tensor along both the rows and columns in each channel by some default value, typically zero. For uniformity, we always pad by adding the same number of rows at the top and at the bottom, and likewise the same number of columns on the left and on the right. That is, assume that we add p rows both on top and bottom, and p columns both on the left and right. With padding p, the implicit size of layer l tensor is then (nl + 2p) × (nl + 2p) × ml. Assume that each filter is of size k × k × ml , and assume there are ml+1 filters, then the size of thelayerl+1tensorwillbe(nl +2p−k+1)×(nl +2p−k+1)×ml+1.Sincewewant to preserve the size of the resulting tensor, we need to have
nl +2p−k+1≥nl,whichimplies,p=􏰙k−1􏰚 2
With padding, we can have arbitrarily deep convolutional layers in a CNN.
7
Example 26.9 (Padding). Figure 26.16 shows a 2D convolution without and with padding. Figure 26.16(a) shows the convolution of a 5 × 5 matrix Zl (n = 5) with a 3 × 3 filter W (k = 3), which results in a 3 × 3 matrix since n − k + 1 = 5 − 3 + 1 = 3. Thus, the size of the next layer Zl+1 has decreased.
On the other hand, zero padding Zl using p = 1 results in a 7 × 7 matrix as shown in Figure 26.16(b). Since p = 1, we have an extra row of zeros on the top

Convolutional Neural Networks
703
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
Zl
Zl
W Zl+1
∗= ∗=
( a ) s u m ( Z l3 ( 1 , 1 ) ⊙ W ) Zl
W Zl+1
∗= ∗=
(c) sum(Zl3 (3, 1) ⊙ W) (d) sum(Zl3 (3, 3) ⊙ W) Figure26.17. Striding:2DConvolutionwithstrides=2.
( b ) s u m ( Z l3 ( 1 , 3 ) ⊙ W ) Zl
W Zl+1
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
7
W Zl+1
and bottom, and an extra column of zeros on the left and right. The convolution of the zero-padded X with W now results in a 5×5 matrix Zl+1 (since 7−3+1=5), which preserves the size.
If we wanted to apply another convolution layer, we could zero pad the resulting matrix Zl+1 with p = 1, which would again yield a 5×5 matrix for the next layer, using a 3 × 3 filter. This way, we can chain together as many convolution layers as desired, without decrease in the size of the layers.
Striding
Striding is often used to sparsify the number of sliding windows used in the convolutions. That is, instead of considering all possible windows we increment the index along both rows and columns by an integer value s ≥ 1 called the stride. A 3D convolutionofZl ofsizenl ×nl ×ml withafilterWofsizek×k×ml,usingstrides,is given as:
 sum􏰄Zlk(1,1)⊙W􏰅 sum􏰄Zlk(1,1+s)⊙W􏰅 ··· l  sum􏰄Zlk(1+s,1)⊙W􏰅 sum􏰄Zlk(1+s,1+s)⊙W􏰅 ···
Z ∗W= . . ··· sum􏰄Zlk(1+t ·s,1)⊙W􏰅 sum􏰄Zlk(1+t ·s,1+s)⊙W􏰅 ···
sum􏰄Zlk(1,1+t ·s)⊙W􏰅  sum􏰄Zlk(1+s,1+t ·s)⊙W􏰅 
.  sum􏰄Zlk(1+t ·s,1+t ·s)⊙W􏰅
where t = 􏰱 nl −k 􏰲. We can observe that using stride s, the convolution of Zl ∈ Rnl ×nl ×ml s
withW∈Rk×k×ml resultsina(t+1)×(t+1)matrix.

704 Deep Learning
Example 26.10 (Striding). Figure 26.17 shows 2D convolution using stride s = 2 on a5×5matrixZl (nl =5)withafilterWofsize3×3(k=3).Insteadofthedefault stride of one, which would result in a 3 × 3 matrix, we get a (t + 1) × (t + 1) = 2 × 2 matrix Zl+1, since
t = 􏰱 nl − k 􏰲 = 􏰱 5 − 3 􏰲 = 1 s2
We can see that the next window index increases by s along the rows and columns. For example, the first window is Zl3(1,1) and thus the second window is Z l3 ( 1 , 1 + s ) = Z l3 ( 1 , 3 ) ( s e e F i g u r e s 2 6 . 1 7 ( a ) a n d 2 6 . 1 7 ( b ) ) . N e x t , w e m o v e d o w n b y a stride of s = 2, so that the third window is Zl3(1+s,1) = Zl3(3,1), and the final window i s Z l3 ( 3 , 1 + s ) = Z l3 ( 3 , 3 ) ( s e e F i g u r e s 2 6 . 1 7 ( c ) a n d 2 6 . 1 7 ( d ) ) .
26.3.4 Generalized Aggregation Functions: Pooling
Let Zl be a nl × nl × ml tensor at layer l. Our discussion of convolutions so far assumes that we sum together all of the elements in the element-wise product of the k × k × r subtensor Zlk(i,j,q) and the filter W. In fact, CNNs also use other types of aggregation functions in addition to summation, such as average and maximum.
Avg-Pooling
If we replace the summation with the average value over the element-wise product of Zlk(i,j,q) and W, we get
avg􏰄Zl (i,j,q)⊙W􏰅= k
avg 􏰌zl ·w 􏰍
i+a−1,j+b−1,q+c−1 a,b,c = 1 ·sum􏰄Zlk(i,j,q)⊙W􏰅
a=1,2,··· ,k b=1,2,··· ,k c=1,2,··· ,r
k2 ·r
If we replace the summation with the maximum value over the element-wise product
Max-Pooling
of Zlk(i,j,q) and W, we get
max􏰄Zl (i,j,q)⊙W􏰅= max 􏰌zl ·wa,b,c􏰍 (26.16)
c=1,2,··· ,r
The 3D convolution of Zl ∈ Rnl ×nl ×ml with filter W ∈ Rk×k×r using max-pooling, denoted Zl ∗max W,resultsina(nl −k+1)×(nl −k+1)×(ml −r+1)tensor,givenas:
k a=1,2,···,k i+a−1,j+b−1,q+c−1 b=1,2,··· ,k

Convolutional Neural Networks
705
 max􏰄Xk(1,1,1)⊙W􏰅 ···  max􏰄Xk(2,1,1)⊙W􏰅 ···
 . ··· max􏰄Xk(n−k+1,1,1)⊙W􏰅 ···  max􏰄Xk(1,1,2)⊙W􏰅 ···  max􏰄Xk(2,1,2)⊙W􏰅 ···
Zl∗maxW=  . ··· max􏰄Xk(n−k+1,1,2)⊙W􏰅 ···
max􏰄Xk(1,n−k+1,1)⊙W􏰅  max􏰄Xk(2,n−k+1,1)⊙W􏰅 
.  max􏰄Xk(n−k+1,n−k+1,1)⊙W􏰅
max􏰄Xk(1,n−k+1,2)⊙W􏰅  max􏰄Xk(2,n−k+1,2)⊙W􏰅 
.  max􏰄Xk(n−k+1,n−k+1,2)⊙W􏰅
. .  max􏰄Xk(1,1,m−r +1)⊙W􏰅 ···
. max􏰄Xk(1,n−k+1,m−r+1)⊙W􏰅 
 max􏰄Xk(2,1,m−r +1)⊙W􏰅 ···  . ···
max􏰄Xk(2,n−k+1,m−r+1)⊙W􏰅  . 
max􏰄Xk(n−k+1,n−k+1,m−r+1)⊙W􏰅
Max-pooling in CNNs Typically, max-pooling is used more often than avg-pooling. Also, for pooling it is very common to set the stride equal to the filter size (s = k), so that the aggregation function is applied over disjoint k × k windows in each channel in Zl . More importantly, in pooling, the filter W is by default taken to be a k × k × 1 tensor all of whose weights are fixed as 1, so that W = 1k×k×1. In other words, the filter weights are fixed at 1 and are not updated during backpropagation. Further, the filter uses a fixed zero bias (that is, b = 0). Finally, note that pooling implicitly uses an identity activation function. As such, the convolution of Zl ∈ Rnl ×nl ×ml with W ∈ Rk×k×1, using strides=k,resultsinatensorZl+1 ofsize􏰈nl 􏰉×􏰈nl 􏰉×ml.
ss
Zl Zl W Zl+1
max􏰄Xk(n−k+1,1,m−r+1)⊙W􏰅 ···
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl+1 ∗= ∗=
1
1
1
1
1
1
1
1
3
3
4
(a) max(Zl2 (1, 1) ⊙ W) Zl
W Zl+1
∗= ∗=
(c) max(Zl2 (3, 1) ⊙ W) (d) max(Zl2 (3, 3) ⊙ W) Figure26.18. Max-pooling:Strides=2.
(b) max(Zl2 (1, 3) ⊙ W) Zl
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl+1
1
1
1
1
1
1
1
1
3
4
3
4
2
2
4

706 Deep Learning
Example26.11(Max-pooling). Figure26.18showsmax-poolingona4×4matrixZl
(nl = 4), using window size k = 2 and stride s = 2 that equals the window size. The
resultinglayerZl+1 thushassize2×2,since⌊nl ⌋=⌊4⌋=2.Wecanseethatthefilter s2
W has fixed weights equal to 1.
The convolution of the first window of Zl, namely Zl2(1,1), with W is given as
(see Figure 26.18(a))
max􏰄Zl2(1,1)⊙W􏰅=max􏰕􏰕1 2􏰖⊙􏰕1 1􏰖􏰖=max􏰕􏰕1 2􏰖􏰖=3 311131
The other convolution steps are shown in Figures 26.18(b) to 26.18(d).
26.3.5 DeepCNNs
In a typical CNN architecture, one alternates between a convolution layer (with summation as the aggregation function, and learnable filter weights and bias term) and a pooling layer (say, with max-pooling and fixed filter of ones). The intuition is that, whereas the convolution layer learns the filters to extract informative features, the pooling layer applies an aggregation function like max (or avg) to extract the most important neuron value (or the mean of the neuron values) within each sliding window, in each of the channels.
Starting from the input layer, a deep CNN is comprised of multiple, typically alternating, convolution and pooling layers, followed by one or more fully connected layers, and then the final output layer. For each convolution and pooling layer we need to choose the window size k as well as the stride value s, and whether to use padding p or not. We also have to choose the non-linear activation functions for the convolution layers, and also the number of layers to consider.
26.3.6 Training CNNs
To see how to train CNNs we will consider a network with a single convolution layer
and a max-pooling layer, followed by a fully connected layer as shown in Figure 26.19.
For simplicity, we assume that there is only one channel for the input X, and further,
we use only one filter. Thus, X denotes the input matrix of size n0 × n0. The filter W0,
with bias b0, for the convolution layer l = 1, has size k1 × k1, which yields the matrix of
neuronsZ1 ofsizen1×n1,wheren1 =n0−k+1usingstrides1 =1.Sinceweuseonly
one filter, this results in a single channel at layer l = 1, i.e., m1 = 1. The next layer l = 2
is a max-pooling layer Z2 of size n2 × n2 obtained by applying a k2 × k2 filter with stride
s2 = k2. Implicitly, the max-pooling layer uses a filter of weights fixed at 1, with 0 as the
bias,thatisW1=1k ×k andb1=0.Weassumethatn1isamultipleofk2sothatn2=n1. 22 k2
The output of the max-pooling layer Z2 is recast as a vector z2 of length (n2)2, since it is fully connected to the layer l = 3. That is, all of the (n2)2 neurons in z2 are connected to each of the n3 neurons in z3 with the weights specified by the matrix W2, which has size (n2 )2 × n3 , and the bias vector b2 ∈ Rn3 . Finally, all neurons in z3 are

Convolutional Neural Networks 707
input convolution max-pooling fully connected output l=0 l=1 l=2 l=3 l=4
W0,b0 W2,b2 W3,b3 n2 × n2
n0 × n0 n1 × n1 k2 = s2 n3 Figure 26.19. Training: Convolutional neural network.
connected to the p output layer neurons o, with the weight matrix W3 ∈ Rn3×p and bias vectorb3 ∈Rp.
Feed-forward Phase
Let D = {Xi,yi}ni=1 denote the training data, comprising n tensors Xi ∈ Rn0×n0×m0 (with m0 = 1 for ease of explanation) and the corresponding response vector yi ∈ Rp . Given a training pair (X, y) ∈ D, in the feed-forward phase, the predicted output o is given via the following equations:
Z1 =f1􏰄(X∗W0)+b0􏰅
Z2 = Z1 ∗s2,max 1k2×k2 z 3 = f 3 􏰄 W T2 z 2 + b 2 􏰅 o = f o 􏰄 W To z 3 + b o 􏰅
where ∗s2,max denotes max-pooling with stride s2.
Backpropagation Phase
Given the true response y and predicted output o, we can use any loss function EX to evaluate the discrepancy between them. The weights and biases are updated by computing the net gradient vector at the output layer, and then backpropagating the net gradients from layer l = 4 to l = 1. Let δ1, δ2, and δ3 denote the net gradient vectors at layers l = 1, 2, 3, respectively, and let δo denote the net gradient vector at the output layer. The output net gradient vector is obtained in the regular manner by computing the partial derivatives of the loss function (∂EX) and the activation function (∂fo):
δo =∂fo ⊙∂EX
assuming that the output neurons are independent.
Since layer l = 3 is fully connected to the output layer, and likewise the
max-pooling layer l = 2 is fully connected to Z3, the net gradients at these layers are computed as in a regular MLP
δ3 =∂f3 ⊙􏰄Wo ·δo􏰅
δ2 =∂f2 ⊙􏰄W2 ·δ3􏰅=W2 ·δ3
X
Z1 δ1
Z2 δ2
z3 δ3
o
δo
p

708 Deep Learning
Let last step follows from the fact that ∂f2 = 1, since max-pooling implicitly uses an identity activation function. Note that we also implicitly reshape the net gradient vector δ2, so that its size is 􏰄(n2)2 × n3􏰅 × 􏰄n3 × 1􏰅 = (n2)2 × 1 = n2 × n2, as desired.
Consider the net gradient δ1 at neuron z1 in layer l = 1 where i,j = 1,2,··· ,n1. ij ij
Since we assume that the stride s2 equals the filter size k2 for the max-pooling layer,
each sliding window in the convolution layer contributes only to one neuron at the
max-pooling layer. Given stride s2 = k2, the k2 × k2 sliding window that contains z1 is ij
g i v e n a s Z 1k 2 ( a , b ) , w h e r e
a=􏰳i􏰴 b=􏰳j􏰴
s2 s2
Due to the max aggregation function, the maximum valued element in Z1k2(a,b)
specifies the value of neuron z2 in the max-pooling layer l = 2. That is, ab
z2 = max 􏰵z1 􏰶 ab i,j=1,2,…,k2 (a−1)·k2+i,(b−1)·k2+j
i∗,j∗ = argmax 􏰵z1 􏰶
(a−1)·k2+i,(b−1)·k2+j i,j =1,2,…,k2
where i∗,j∗ is the index of the maximum valued neuron in the window Z1k2(a,b). The net gradient δ1 at neuron z1 is therefore given as
ij
ij
∂E
∂E ∂net2 ∂z1 δ1= X = X · ab· ij
ij ∂net1 ∂net2 ∂z1 ∂net1 ij ab ij ij
∂net2
=δ2 · ab·∂f1
ab ∂z1 ij ij
ll21 wherenetij denotesthenetinputatneuronzij inlayerl.However,sincenetab =zi∗,j∗,
∂net2
the partial derivative ab is either 1 or 0, depending on whether z1 is the maximum
∂ z 1i j i j element in the window Z1k2 (a,b) or not. Putting it all together, we have
􏰢δ2 ·∂f1 ifi=i∗andj=j∗ δ1=ab ij
ij 0 otherwise
In other words, the net gradient at neuron z1 in the convolution layer is zero if
ij
this neuron does not have the maximum value in its window. Otherwise, if it is the maximum, the net gradient backpropagates from the max-pooling layer to this neuron and is then multiplied by the partial derivative of the activation function. The n1 × n1 matrixofnetgradientsδ1comprisesthenetgraidentsδ1 foralli,j=1,2,···,n.
ij 1
From the net gradients, we can compute the gradients of the weight matrices and
bias parameters. For the fully connected layers, that is, between l = 2 and l = 3, and l = 3 and l = 4, we have
∇W3 =Z3 ·(δo)T ∇b3 =δo ∇W2 =Z2 ·(δ3)T ∇b2 =δ3 where we treat Z2 as a (n2 )2 × 1 vector.

Convolutional Neural Networks 709
Note that the weight matrix W1 is fixed at 1k2×k2 and the bias term b1 is also fixed at 0, so there are no parameters to learn between the convolution and max-pooling layers. Finally, we compute the weight and bias gradients between the input and convolution layer as follows:
∇W = Xk (i,j)·δ1 ∇b = δ1 0 1 ij 0 ij
nn nn
􏰪1 􏰪1
􏰪1 􏰪1
i=1 j=1
i=1 j=1
where we used the fact that the stride is s1 = 1, and that W0 is a shared filter for all k1 × k1 windows of X, with the shared bias value b0 for all windows. There are n1 × n1 such windows, where n1 = n0 − k1 + 1, therefore, to compute the weight and bias gradients, we sum over all the windows. Note that if there were multiple filters (that is, if m1 > 1), then the bias and weight gradients for the jth filter would be learned from the corresponding channel j in layer l = 1.
Example26.12(CNN). Figure26.20showsaCNNforhandwrittendigitrecognition. This CNN is trained and tested on the MNIST dataset, that contains 60,000 training images and 10,000 test images. Some examples of handwritten digits from MNIST are shown in Figure 26.21. Each input image is a 28 × 28 matrix of pixel values between 0 to 255, which are divided by 255, so that each pixel lies in the interval [0,1]. The corresponding (true) output yi is a one-hot encoded binary vector that denotes a digit from 0 to 9; the digit 0 is encoded as e1 = (1,0,0,0,0,0,0,0,0,0)T, the digit 1 as e2 = (0,1,0,0,0,0,0,0,0,0)T, and so on.
In our CNN model, all the convolution layers use stride equal to one, and do not use any padding, whereas all of the max-pooling layers use stride equal to the window size. Since each input is a 28 × 28 pixels image of a digit with 1 channel (grayscale), we haven0 =28andm0 =1,andtherefore,theinputX=Z0 isan0×n0×m0 =28×28×1 tensor. The first convolution layer uses m1 = 6 filters, with k1 = 5 and stride s1 = 1, without padding. Thus, each filter is a 5 × 5 × 1 tensor of weights, and across the six filterstheresultinglayerl=1tensorZ1 hassize24×24×6,withn1 =n0−k1+1= 28 − 5 + 1 = 24 and m1 = 6. The second hidden layer is a max-pooling layer that
uses k = 2 with a stride of s = 2. Since max-pooling by default uses a fixed filter
2 2 􏰱n 􏰲 􏰈24􏰉 W=1k ×k ×1,theresultingtensorZ2 hassize12×12×6,withn2 = 1 =
=12, and m2 = 6. The third layer is a convolution layer with m3 = 16 channels, with a window size of k3 = 5 (and stride s3 = 1), resulting in the tensor Z3 of size 8×8×16, where n3 = n2 −k3 +1 = 12−5+1 = 8. This is followed by another max-pooling
22 k2 2
Input Convolution
X Z1
Max-pooling
12×12×6 k2=s2=2
Convolution
8×8×16 k3 =5
Max-pooling
4×4×16 k4 =s4 =2
Fully Connected Layers
Z2
Z3
Z4
Z5
Z6
o
28 × 28 × 1
120
10 84
24×24×6 k1=5
Figure 26.20. Convolutional neural network.

710
Deep Learning
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
(a) label 0 (b) label 1 (c) label 2 (d) label 3 (e) label 4
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
(f) label 5 (g) label 6 (h) label 7 (i) label 8 (j) label 9 Figure 26.21. MNIST dataset: Sample handwritten digits.
layerthatusesk =2ands =2,whichyieldsthetensorZ4 thatis4×4×16,where 􏰱n3􏰲 􏰈8􏰉4 4
n4= k4 = 2 =4,andm4=16.
The next three layers are fully connected as in a regular MLP. All of the 4 × 4 ×
16 = 256 neurons in layer l = 4 are connected to layer l = 5, which has 120 neurons. Thus, Z5 is simply a vector of length 120, or it can be considered a degenerate tensor of size 120 × 1 × 1. Layer l = 5 is also fully connected to layer l = 6 with 84 neurons, which is the last hidden layer. Since there are 10 digits, the output layer o comprises 10 neurons, with softmax activation function. The convolution layers Z1 and Z3, and the fully connected layers Z5 and Z6, all use ReLU activation.
We train the CNN model on n = 60000 training images from the MNIST dataset; we train for 15 epochs using step size η = 0.2 and using cross-entropy error (since there are 10 classes). Training was done using minibatches, using batch size of 1000. After training the CNN model, we evaluate it on the test dataset of 10,000 images. The CNN model makes 147 errors on the test set, resulting in an error rate of 1.47%. Figure 26.22 shows examples of images that are misclassified by the CNN. We show the true label y for each image and the predicted label o (converted back from the one-hot encoding to the digit label). We show three examples for each of the labels. For example, the first three images on the first row are for the case when the true label is y = 0, and the next three examples are for true label y = 1, and so on. We can see that several of the misclassified images are noisy, incomplete or erroneous, and would be hard to classify correctly even by a human.
For comparison, we also train a deep MLP with two (fully connected) hidden layers with the same sizes as the two fully connected layers before the output layer in the CNN shown in Figure 26.20. Therefore, the MLP comprises the layers X, Z5, Z6, and o, with the input 28 × 28 images viewed as a vector of size d = 784. The first hidden layer has size n1 = 120, the second hidden layer has size n2 = 84, and the output layer has size p = 10. We use ReLU activation function for all layers, except the output, which uses softmax. We train the MLP model for 15 epochs on the training dataset with n = 60000 images, using step size η = 0.5. On the test dataset, the MLP made 264 errors, for an error rate of 2.64%. Figure 26.23 shows the number of errors on

Convolutional Neural Networks
711
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
y = 0,o = 2 y = 0,o = 8 y = 0,o = 6 y = 1,o = 8 y = 1,o = 7 000000 555555
y=1,o=5 5 10 15 20 25
y=3,o=8 5 10 15 20 25
y=5,o=3 5 10 15 20 25
y=7,o=2 5 10 15 20 25
y=9,o=0 5 10 15 20 25
0510152025
0510152025
0510152025
0510152025
0
0
0
0
0
y=2,o=0 y=2,o=8 y=2,o=7 y=3,o=2 y=3,o=5 000000 555555
0510152025
0510152025
0510152025
0510152025
y=4,o=2 y=4,o=2 y=4,o=7 y=5,o=6 y=5,o=3 000000 555555
0510152025
0510152025
0510152025
0510152025
y=6,o=4 y=6,o=5 y=6,o=1 y=7,o=3 y=7,o=8 000000 555555
0510152025
0510152025
0510152025
0510152025
y=8,o=6 y=8,o=3 y=8,o=2 y=9,o=7 y=9,o=4 000000 555555
0510152025
0510152025
0510152025
0510152025
0510152025
Figure 26.22. MNIST: Incorrect predictions by the CNN model; y is the true label, o is the predicted label. the test set after each epoch of training for both the CNN and MLP model; the CNN
model achieves significantly better accuracy than the MLP.
6,000
4,000
2,000
epochs
Figure 26.23. MNIST: CNN versus Deep MLP; prediction error as a function of epochs.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MLP CNN

712 Deep Learning
26.4 REGULARIZATION
Consider the squared error loss function, given as L(y,yˆ)=(y−yˆ)2
where y is the true response and yˆ = o the predicted response on a given input x. The goal of learning is the minimize the expected loss E[L(y, yˆ )] = E[(y − yˆ )2 ].
As described in Section 22.3, the expected loss can be decomposed into three terms: noise, bias, and variance, given as
E 􏰆 ( y − yˆ ) 2 􏰇 = E 􏰯 􏰄 y − E [ y ] 􏰅 2 􏰰 + E 􏰯 􏰄 yˆ − E [ yˆ ] 􏰅 2 􏰰 + E 􏰯 􏰄 E [ yˆ ] − E [ y ] 􏰅 2 􏰰 ( 2 6 . 1 7 ) 􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 􏱁􏱀 􏰿􏰾 􏱁
noise average variance average bias
The noise term is the expected variance of y, since E[(y − E[y])2] = var(y). It contributes a fixed cost to the loss independent of the model. Since it is the inherent uncertainty or variability of y it can be ignored when comparing different models. The average bias term indicates how much the model deviates from the true (but unknown) function relating the response to the predictor variables. For example, if the response is a non-linear function of the predictor variables, and we fit a simple linear model, then this model will have a high bias. We can try to fit a more complex non-linear model to reduce the bias, but then we run into the issue of overfitting and high variance, captured by the average variance term, which quantifies the variance of the predicted response yˆ, since E[(yˆ − E[yˆ])2] = var(yˆ). That is, our model may try to fit noise and other artifacts in the data, and will therefore be highly susceptible to small changes in the data, resulting in high variance. In general, there is always a trade-off between reducing bias and reducing variance.
Regularization is an approach whereby we constrain the model parameters to reduce overfitting, by reducing the variance at the cost of increasing the bias slightly. For example, in ridge regression [Eq. (23.29)], we add a constraint on the L2-norm of the weight parameters as follows
min J(w)=∥Y−Y􏰭∥2 +α·∥w∥2 =∥Y−Dw∥2 +α·∥w∥2 (26.18) w
where Y is the true response vector and Y􏰭 is the predicted response vector over all training instances. The goal here is to drive the weights to be small, depending on the hyperparameter α ≥ 0 called the regularization constant. If α = 0, then there is no regularization, and we get a low bias, but possibly high variance model. If α → ∞ then the effect is to drive all weights to be nearly zero, which results in a low variance, but high bias model. An intermediate value of α tries to achieve the right balance between these two conflicting objectives. As another example, in L1 regression or Lasso [Eq. (23.43)], we minimize the L1 norm of the weights
min J(w)=∥Y−Dw∥2 +α·∥w∥1 w
where∥w∥1 =􏰥di=1|wi|.ComparedtoL2 norm,whichmerelymakestheweightssmall, the use of the L1 norm sparsifies the model by forcing many of the weights to zero, acting as a feature subset selection method.

Regularization 713
In general, for any learning model M, if L(y,yˆ) is some loss function for a given input x, and 􏰁 denotes all the model parameters, where yˆ = M(x|􏰁). The learning objective is to find the parameters that minimize the loss over all instances:
􏰪n
L(yi,yˆi) =
With regularization, we add a penalty on the parameters 􏰁, to obtain the regularized
objective:
i=1
min J(􏰁)= 􏰁
i=1
min J(􏰁) = 􏰁
􏰪n i=1
L(yi,M(xi|􏰁))
􏰪n
L(yi,yˆi)+αR(􏰁)
(26.19)
where α ≥ 0 is the regularization constant.
Let θ ∈ 􏰁 be a parameter of the regression model. Typical regularization functions
include the L2 norm, the L1 norm, or even a combination of these, called the elastic-net: RL2(θ)=∥θ∥2 RL1(θ)=∥θ∥1 Relastic(θ)=λ·∥θ∥2 +(1−λ)·∥θ∥1
with λ ∈ [0,1].
26.4.1 L2 Regularization for Deep Learning
We now consider approaches for regularizing the deep learning models. We first consider the case of a multilayer perceptron with one hidden layer, and then generalize it to multiple hidden layers. Note that while our discussion of regularization is in the context of MLPs, since RNNs are trained via unfolding, and CNNs are essentially sparse MLPs, the methods described here can easily be generalized for any deep neural network.
MLP with One Hidden Layer
Consider regularization in the context of a feed-forward MLP with a single hidden layer as shown in Figure 26.24. Let the input x ∈ Rd , the hidden layer z ∈ Rm and let yˆ = o ∈ Rp. The set of all the parameters of the model are
􏰁={Wh,bh,Wo,bo}
Whereas it makes sense to penalize large weights, we usually do not penalize the bias
terms since they are just thresholds that shift the activation function and there is no l=0 l=1 l=2
Wh,bh Wo,bo
Figure 26.24. Multilayer perceptron with one hidden layer.
x
z
o

714 Deep Learning need to force them to be small values. The L2 regularized objective is therefore given
as
min J(􏰁)=Ex + α ·RL2(Wo,Wh)=Ex + α ·􏰄∥Wh∥2F +∥Wo∥2F􏰅 􏰁22
Here we added the factor 1/2 to the regularization term for convenience. For the L2 norm of the weight matrices, we use the Frobenius norm, which has the usual sense of L2-norm, since for an n × m matrix A, it is defined as
􏰪n 􏰪m ∥A∥2 = a2
F ij i=1 j=1
where aij is the (i,j)th element of A. The regularized objective tries to minimize the individual weights for pairs of neurons between the input and hidden, and hidden and output layers. This has the effect of adding some bias to the model, but possibly reducing variance, since small weights are more robust to changes in the input data in terms of the predicted output values.
The regularized objective has two separate terms, one for the loss and the other for the L2 norm of the weight matrices. Recall that we have to compute the weight gradients ∇wij and the bias gradients ∇bj by computing
∇wij=∂J(􏰁)=∂Ex +α·∂RL2(Wo,Wh)=δj·zi+α·wij ∂wij ∂wij 2 ∂wij
∇bj =∂J(􏰁)=∂Ex +α·∂RL2(Wo,Wh)=∂Ex =δj
∂bj ∂bj 2 ∂bj whereweuseEq.(25.31)tonotethat ∂Ex =δj ·zi and ∂Ex =δj,andwhereδj = ∂Ex is
∂bj
∂ wij ∂ bj ∂ netj
the net gradient. Further, since the squared L2 norm of a weight matrix is simply the sum of the squared weights, only the term w2 matters, and all other elements are just
ij
constant with respect to the weight wij between neurons i and j (in Wh or Wo). Across
all the neuron pairs between the hidden and output layer, we can write the update rule compactly as follows:
∇ W o = z · δ oT + α · W o ∇ b o = δ o
where δo is the net gradient vector for the output neurons, and z is the vector of hidden layer neuron values. The gradient update rule using the regularized weight gradient matrix is given as
Wo =Wo −η·∇Wo =Wo −η·􏰄z·δoT +α·Wo􏰅=Wo −η·α·Wo −η·􏰄z·δoT􏰅 = ( 1 − η · α ) · W o − η · 􏰄 z · δ oT 􏰅
L2 regularization is also called weight decay, since the updated weight matrix uses decayed weights from the previous step, using the decay factor 1 − η · α.
In a similar manner we get the weight and bias gradients between the input and hidden layers:
∇ W h = x · δ hT + α · W h ∇ b h = δ h

Regularization
715
l =0
W0,b0
l =1
l =2

l =h
l =h+1
x
z1
z2
W1,b1
Figure 26.25. Deep multilayer perceptron.
Wh,bh
The update rule for the weight matrix between the input and hidden layers is therefore given as
W h = W h − η · ∇ W h = ( 1 − η · α ) · W h − η · 􏰄 x · δ hT 􏰅
where δh is the net gradient vector for the hidden neurons, and x is the input vector.
Deep MLPs
Consider the deep MLP shown in Figure 26.25. We denote the input neurons as layer l = 0, the first hidden layer as l = 1, the last hidden layer as l = h, and the final output layer as l = h+1. The vector of neuron values for layer l (for l = 0,··· ,h+1) is denoted as
z l = 􏰓 z 1l , · · · , z nl 􏰔 T l
where nl is the number of neurons in layer l. Thus, x = z0 and o = zh+1. The weight matrix between neurons in layer l and layer l +1 is denoted Wl ∈ Rnl ×nl+1 , and the vector of bias terms from the bias neuron z0l to neurons in layer l + 1 is denoted bl ∈ Rnl+1 , for l = 1, · · · , h + 1.
Given the error function Ex, the L2 regularized objective function is min J(􏰁)=Ex + α ·RL2 (W0,W1,…,Wh)
􏰁 2 􏰞 􏰪h 􏰑 􏰑 􏰟 = E x + α · 􏰑 W l 􏰑 2F
2 l=0
where the set of all the parameters of the model is 􏰁 = {W0,b0,W1,b1,··· ,Wh,bh}. Based on the derivation for the one hidden layer MLP from above, the regularized gradient is given as:
∇Wl =zl·(δl+1)T+α·Wl (26.20) and the update rule for weight matrices is
Wl =Wl −η·∇Wl =(1−η·α)·Wl −η·􏰄zl ·(δl+1)T􏰅 (26.21)
for l = 0,1,··· ,h, where where δl is the net gradient vector for the hidden neurons in layer l. We can thus observe that incorporating L2 regularization within deep MLPs is relatively straightforward. Likewise, it is easy to incorporate L2 regularization in other models like RNNs, CNNs, and so on. For L1 regularization, we can apply the subgradient approach outlined for L1 regression or Lasso in Section 23.6.
···
zh
o

716 Deep Learning 26.4.2 Dropout Regularization
The idea behind dropout regularization is to randomly set a certain fraction of the neuron values in a layer to zero during training time. The aim is to make the network more robust and to avoid overfitting at the same time. By dropping random neurons for each training point, the network is forced to not rely on any specific set of edges. From the perspective of a given neuron, since it cannot rely on all its incoming edges to be present, it has the effect of not concentrating the weight on specific input edges, but rather the weight is spread out among the incoming edges. The net effect is similar to L2 regularization since weight spreading leads to smaller weights on the edges. The resulting model with dropout is therefore more resilient to small perturbations in the input, which can reduce overfitting at a small price in increased bias. However, note that while L2 regularization directly changes the objective function, dropout regularization is a form of structural regularization that does not change the objective function, but instead changes the network topology in terms of which connections are currently active or inactive.
MLP with One Hidden Layer
Consider the one hidden layer MLP in Figure 26.24. Let the input x ∈ Rd , the hidden layer z ∈ Rm and let yˆ = o ∈ Rp . During the training phase, for each input x, we create a random mask vector to drop a fraction of the hidden neurons. Formally, let r ∈ [0, 1] be the probability of keeping a neuron, so that the dropout probability is 1 − r . We create a m-dimensional multivariate Bernoulli vector u ∈ {0,1}m, called the masking vector, each of whose entries is 0 with dropout probability 1 − r , and 1 with probability r . Let u = (u1,u2,··· ,um)T, where 􏰢
ui = 0 with probability 1 − r 1 with probability r
The feed-forward step is then given as
z = f h 􏰄 b h + W hT x 􏰅 z ̃ = u ⊙ z
o = f o 􏰄 b o + W oT z ̃ 􏰅
where ⊙ is the element-wise multiplication. The net effect is that the masking vector zeros out the ith hidden neuron in z ̃ if ui = 0. Zeroing out also has the effect that during the backpropagation phase the error gradients do not flow back from the zeroed out neurons in the hidden layer. The effect is that any weights on edges adjacent to zeroed out hidden neurons are not updated.
Inverted Dropout There is one complication in the basic dropout approach above, namely, the expected output of hidden layer neurons is different during training and testing, since dropout is not applied during the testing phase (after all, we do not want the predictions to be randomly varying on a given test input). With r as the probability of retaining a hidden neuron, its expected output value is
E[zi]=r·zi +(1−r)·0=r·zi

Further Reading 717
On the other hand, since there is no dropout at test time, the outputs of the hidden neurons will be higher at testing time. So one idea is to scale the hidden neuron values by r at testing time. On the other hand, there is a simpler approach called inverted dropout that does not need a change at testing time. The idea is to rescale the hidden neurons after the dropout step during the training phase, as follows:
z = f 􏰄 b h + W hT x 􏰅 z ̃ = 1 · 􏰄 u ⊙ z 􏰅
r􏰄T􏰅 o=f bo+Woz ̃
With the scaling factor 1/r, the expected value of each neuron remains the same as without dropout, since
Dropout in Deep MLPs
E[zi ] = 1 · 􏰄r · zi + (1 − r ) · 0􏰅 = zi r
Dropout regularization for deep MLPs is done in a similar manner. Let rl ∈ [0, 1], for l = 1,2,··· ,h denote the probability of retaining a hidden neuron for layer l, so that 1 − rl is the dropout probability. One can also use a single rate r for all the layers by setting rl = r. Define the maskin􏰢g vector for hidden layer l, ul ∈ {0,1}nl , as follows:
uli = 0 with probability 1 − rl 1 with probability rl
The feed-forward step between layer l and l + 1 is then given as z l = f 􏰄 b l + W l T z ̃ l − 1 􏰅
z ̃ l = 1 · 􏰄 u l ⊙ z l 􏰅 rl
(26.22)
using inverted dropout. Usually, no masking is done for the input and output layers, so we can set r0 = 1 and rh+1 = 1. Also note that there is no dropout at testing time. The dropout rates are hyperparameters of the model that have to be tuned on a separate validation dataset.
26.5 FURTHER READING
The backpropagation through time algorithm was introduced by Werbos (1990). Bidirectional RNNs were proposed by Schuster and Paliwal (1997), and LSTMs were proposed by Hochreiter and Schmidhuber (1997), with the forget gate introduced by Gers, Schmidhuber, and Cummins (2000). Convolutional neural networks trained via backpropagation were proposed by LeCun, Bottou, Bengio, and Haffner (1998), with application to handwritten digit recognition. Dropout regularization was introduced by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014).

718 Deep Learning
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12 (10), 2451–2471.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Compu- tation, 9 (8), 1735–1780.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 (11), 2278–2324.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45 (11), 2673–2681.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1), 1929–1958.
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78 (10), 1550–1560.
26.6 EXERCISES
Q1. Consider the RNN architecture in Figure 26.26. Note that an edge labeled −1 indicates dependence on the previous time step, whereas −2 indicates dependence on values from two time steps back. Use f o and f h as the activation functions for output and hidden layers. Let τ be the longest sequence length. Answer the following questions:
(a) UnfoldtheRNNforthreestepsandshowalltheconnections.
(b) Listalltheparametersrequiredinthismodel,includingtheweightmatricesand
bias vectors. Next write the forward propagation equations to compute ot .
(c) Writetheequationforthenetgradientvectorδot attheoutputneuronsattimet.
(d) Writetheequationforthenetgradientvectorδht atthehiddenneuronsattimet.
Q2. DerivethenetgradientequationsforbackpropagationinanRNNwithaforgetgate using vector derivatives. That is, derive equations for the net gradients at the output δot,updateδut,forgetδφt,andhiddenδht layersbycomputingthederivativeoftheerror function Ext with respect to netot , netut , netφt and netht , the net inputs at the output, update, forget and hidden layers, respectively, at time t.
-1
-1
xt ht ot
-2
Figure 26.26. RNN architecture for Q1.

Exercises 719
Q3. Derive the net gradient equations δat for backpropagation in an LSTM using vector derivatives of the error function Ex with respect to netta, where a ∈ {o,h,c,u,κ,φ,ω}. That is, for the output, hidden state, internal memory, and candidate update layers, as well as for the input, forget and output gate layers.
Q4. Showthatwithstrides,intheconvolutionofX∈Rn×n×mandW∈Rk×k×m,thereare (t + 1) × (t + 1) possible windows, where t = 􏰈 n−k 􏰉.
s
Q5. Consider a CNN that takes as input substrings of length 8 over the alphabet 􏰂 = {A,B,C,D}, and predicts whether the class is positive (P) or negative (N). Assume we use two 1D filters with window size k = 4 and stride s = 2, with weights w1 = (1, 0, 1, 1)T and w2 = (1, 1, 0, 1)T , respectively. Assume bias is zero. The 1D convolution is followed by a max-pooling operation with window size k = 2 and stride s = 1. All of the convolutions use linear activation function, and there is no padding used. Finally, all of the neurons after the max-pooling feed into a single output neuron that uses a sigmoid activation function with bias −15 and all weights set to 1. Given input sequence ABACABCD, show the neuron values after the convolution and max-pooling layer, and the final output. Use one-hot encoding for the input alphabet in lexicographic order, and let P = 1 and N = 0. What is the final output on the given input sequence.
(a) Input tensor: X
(b) 3D filter (W) Figure 26.27. 3D tensor for Q6.
Q6. Giventhe3DtensorinFigure26.27(a).Usingpaddingp=1andstrides=2,answer the following questions:
(a) ComputetheconvolutionofXwiththe3×3×3maskWinFigure26.27(b).
(b) Computethemax-poolingoutputusinga3×3×1maskWofallones.
channel 1
1
1
3
1
1
2
4
4
1
2
2
2
3
1
1
1
2
2
3
1
4
1
1
2
2
channel 2
1
4
3
3
2
2
2
2
2
2
1
1
4
1
1
1
2
3
1
1
1
4
1
1
2
channel 3
1
2
4
3
2
2
2
4
2
1
1
1
3
2
4
1
4
3
2
3
1
1
1
4
4
channel 1
1
0
0
0
1
0
0
0
1
channel 2
0
0
1
0
1
0
1
0
0
channel 3
1
1
1
1
1
1
1
1
1

CHAPTER 27 RegressionEvaluation
Given a set of predictor attributes or independent variables X1,X2,··· ,Xd, and given the response attribute Y, the goal of regression is to learn a f , such that
Y=f(X1,X2,···,Xd)+ε=f(X)+ε
where X = (X1,X2,··· ,Xd)T is the d-dimensional multivariate random variable comprised of the predictor variables. Here, the random variable ε denotes the inherent error in the response that is not explained by the linear model.
When estimating the regression function f , we make assumptions about the form of f , for example, whether f is a linear or some non-linear function of the parameters of the model. For example, in linear regression, we assume that
f (X) = β + ω1 · X1 + . . . + ωd · Xd
whereβisthebiasterm,andωi istheregressioncoefficientforattributeXi.Themodel is linear, since f(X) is a linear function of the parameters β,ω1,··· ,ωd.
Once we have estimated the bias and coefficients, we need to formulate a probabilistic model of regression to evaluate the learned model in terms of goodness of fit, confidence intervals for the parameters, and to test for the regression effects, namely whether X really helps in predicting Y. In particular, we assume that even if the value of X has been fixed, there can still be uncertainty in the response Y. Further, we will assume that the error ε is independent of X and follows a normal (or Gaussian) distribution with mean μ = 0 and variance σ2, that is, we assume that the errors are independent and identically distributed with zero mean and fixed variance.
The probabilistic regression model comprises two components — the deterministic component comprising the observed predictor attributes, and the random error component comprising the error term, which is assumed to be independent of the predictor attributes. With the assumptions on the form of the regression function f , and the error variable ε, we can answer several interesting questions such as how good of a fit is the estimated model to the input data? How close are the estimated bias and coefficients to the true but unknown parameters, and so on. We consider such questions in this chapter.
720

Univariate Regression 721
27.1 UNIVARIATE REGRESSION
In univariate regression we have one dependent attribute Y and one independent attribute X, and we assume that the true relationship can be modeled as a linear function
Y=f(X)+ε=β+ω·X+ε
where ω is the slope of the best fitting line and β is its intercept, and ε is the random
error variable that follows a normal distribution with mean μ = 0 and variance σ 2 .
Mean and Variance of Response Variable
Consider a fixed value x for the independent variable X. The expected value of the response variable Y given x is
E[Y|X=x]=E[β+ω·x+ε]=β+ω·x+E[ε]=β+ω·x
The last step follows from our assumption that E[ε] = μ = 0. Also, since x is assumed to be fixed, and β and ω are constants, the expected value E[β + ω · x] = β + ω · x. Next, consider the variance of Y given X = x, we have
var(Y|X=x)=var(β+ω·x+ε)=var(β+ω·x)+var(ε)=0+σ2 =σ2
Here var(β + ω · x) = 0, since β, ω and x are all constants. Thus, given X = x, the response variable Y follows a normal distribution with mean E[Y|X = x] = β + ω · x, and variance var(Y|X = x) = σ2.
Estimated Parameters
The true parameters β, ω and σ2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi, for i = 1,2,··· ,n. Let b and w denote the estimated bias and weight terms; we can then make predictions for any given value xi as follows:
yˆ i = b + w · x i
The estimated bias b and weight w are obtained by minimizing the sum of squared
errors (SSE), given as
􏰪n
􏰪n i=1
(yi −yˆi)2 =
with the least squares estimates given as (see Eq. (23.11) and Eq. (23.8))
According to our model, the variance in prediction is entirely due to the random error term ε. We can estimate this variance by considering the predicted value yˆi and its deviation from the true response yi , that is, by looking at the residual error
ǫ i = y i − yˆ i
(yi −b−w·xi)2 w = σXY b = μY − w · μX
SSE=
i=1
σX2 27.1.1 EstimatingVariance(σ2)

722 Regression Evaluation One of the properties of the estimated values b and w is that the sum of residual
errors is zero, since
􏰪 􏰪􏰪 nn􏰄􏰅
ǫi = yi −b−w·xi i=1 i=1
n􏰄􏰅 = yi −μY +w·μX −w·xi
i=1
􏰓􏰪n􏰔 􏰓􏰪n􏰔
= yi − n · μY + w · nμX − xi i=1 i=1
=n·μY −n·μY +w·(n·μX −􏰥n·μX)=0 (27.1) Thus, the expected value of ǫi is zero, since E[ǫi ] = 1 n ǫi = 0. In other words, the
n i=1
sum of the errors above and below the regression line cancel each other.
Next, the estimated variance σˆ 2 is given as
1􏰪n􏰄 􏰅2 1􏰪n 1􏰪n
Thus, the estimated variance is
σˆ 2 = S S E n−2
We divide by n − 2 to get an unbiased estimate, since n − 2 is the number of degrees of freedom for estimating SSE (see Figure 27.2). In other words, out of the n training points, we need to estimate two parameters ω and β, with n − 2 remaining degrees of freedom.
The squared root of the variance is called the standard error of regression
σˆ = 􏰺 S S E ( 2 7 . 3 )
n−2
27.1.2 Goodness of Fit
The SSE value gives an indication of how much of the variation in Y cannot be explained by our linear model. We can compare this value with the total scatter, also called total sum of squares, for the dependent variable Y, defined as
􏰪n i=1
Notice that in TSS, we compute the squared deviations of the true response from the true mean for Y, whereas, in SSE we compute the squared deviations of the true response from the predicted response.
σˆ 2 = v a r ( ǫ i ) = n − 2 · ǫ i − E [ ǫ i ] = n − 2 · ǫ i2 = n − 2 ·
i=1 i=1 i=1
( y i − yˆ i ) 2
( 2 7 . 2 )
TSS=
(yi −μY)2

Univariate Regression 723 The total scatter can be decomposed into two components by adding and
subtracting yˆi as follows
TSS=
=
􏰪n
(yi −μY)2 = (yi −yˆi +yˆi −μY)2
i=1
􏰪n 􏰪n 􏰪n
(yi −yˆi)2 + (yˆi −μY)2 +2 (yi −yˆi)·(yˆi −μY) i=1 i=1 i=1
􏰪n i=1
nn
􏰪􏰥2􏰪2
=
whereweusethefactthat ni=1(yi −yˆi)·(yˆi −μY)=0,and
(yi −yˆi) + (yˆi −μY) =SSE+RSS i=1
i=1
is a new term called regression sum of squares that measures the squared deviation of
the predictions from the true mean. TSS can thus be decomposed into two parts: SSE,
which is the amount of variation not explained by the model, and RSS, which is the
amount of variance explained by the model. Therefore, the fraction of the variation
left unexplained by the model is given by the ratio SSE . Conversely, the fraction of TSS
the variation that is explained by the model, called the coefficient of determination or simply the R2 statistic, is given as
R2 = TSS−SSE =1− SSE = RSS (27.4) TSS TSS TSS
The higher the R2 statistic the better the estimated model, with R2 ∈ [0, 1].
R S S =
􏰪n i=1
( yˆ i − μ Y ) 2
Example 27.1 (Variance and Goodness of Fit). Consider Example 23.1 that shows the regression of petal length (X; the predictor variable) on petal width (Y; the response variable) for the Iris dataset. Figure 27.1 shows the scatterplot between the two attributes. There are a total of n = 150 data points. The least squares estimates for the bias and regression coefficients are as follows
w = 0.4164 b = −0.3665 The SSE value is given as
150 150
SSE=􏰪ǫi2 =􏰪(yi −yˆi)2 =6.343
i=1 i=1
Thus, the estimated variance and standard error of regression are given as
2 SSE 6.343 −2 σˆ =􏰺 = =4.286×10
n − 2 1 􏰷4 8
σˆ= SSE= 4.286×10−2=0.207 n−2

724
Regression Evaluation
2.5 2.0 1.5 1.0 0.5
0 −0.5
􏱆􏱃
􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱃􏱆
􏱃􏱆􏱃􏱆
􏱆􏱃 􏱆􏱃􏱆􏱃
􏱃􏱆
􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆􏱃􏱆 􏱃􏱆 􏱆􏱃 􏱆􏱃
􏱃􏱆􏱆􏱃 􏱆􏱆􏱃􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃
􏱃􏱆 􏱃􏱆􏱃􏱆 􏱆􏱆􏱆􏱃􏱃􏱃􏱃􏱆 􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱃 􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱆􏱃 􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱆
􏱃􏱆
􏱃􏱆􏱆􏱃
􏱆􏱃􏱆􏱃
􏱆􏱃
􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱆 􏱆􏱃 􏱆􏱆􏱃􏱃􏱆􏱃􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆􏱆􏱃 􏱃􏱆􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃
􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱆􏱃􏱆􏱃 􏱆􏱃 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱃􏱆 􏱆􏱆􏱃􏱃 􏱆􏱃
􏱆􏱆􏱃􏱃􏱆􏱃
􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱃􏱆􏱆
􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱆􏱃
􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱆􏱃 􏱆􏱃􏱃􏱃􏱆􏱆 􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱆􏱃
􏱆􏱃
􏱆􏱃 􏱃􏱆
􏱃􏱆
􏱆􏱃
􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱃􏱆
􏱆􏱃
0 1.0
Figure27.1. Scatterplot:petallength(X)versuspetalwidth(Y).Meanpointshownasblackcircle.
2.0
3.0
X: petal length
6.0
7.0
Geometry of Goodness of Fit
Recall that Y can be decomposed into two orthogonal parts as illustrated in Figure 27.2(a).
Y=Y􏰭+ǫ
where Y􏰭 is the projection of Y onto the subspace spanned by {1,X}. Using the fact
that this subspace is the same as that span􏰭ned by the orthogonal vectors {1,X}, with X = X − μX · 1, we can further decompose Y as follows
Y􏰭 = p r o j 1 ( Y ) · 1 + p r o j X ( Y ) · X = μ Y · 1 + Y T X · X = μ Y · 1 + w · X ( 2 7 . 5 ) X TX
Likewise, the vectors Y and Y􏰭 can be centered by subtracting their projections along the vector 1
Y = Y − μ Y · 1 Y􏰭 = Y􏰭 − μ Y · 1 = w · X
where the last step follows from Eq. (27.5). The centered vectorsY,Y􏰭 andX all lie in the n − 1 dimensional subspace orthogonal to the vector 1, as illustrated in Figure 27.2(b).
4.0 5.0
For the bivariate Iris data, the values of TSS and RSS are given as TSS = 86.78 RSS = 80.436
We can observe that TSS = SSE + RSS. The fraction of variance explained by the model, that is, the R2 value, is given as
R2 = RSS = 80.436 = 0.927 TSS 86.78
This indicates a very good fit of the linear model.
Y: petal width

Univariate Regression
725
x1 x1Y Y
ǫ = Y − Y􏰭
1
Y

θ
x2
xn
1
x2
xn
􏰭
Y􏰭 YY􏰭
XX XX
(a) Uncentered (b) Centered
Figure 27.2. Geometry of univariate regression: uncentered and centered vectors. The vector space that is the complement of 1 has dimensionality n − 1. The error space (containing the vector ǫ ) is orthogonal to X , and has dimensionality n − 2, which also specifies the degrees of freedom for the estimated variance σˆ 2 .
In this subspac􏰭e, the centered vectors Y and Y􏰭, and the error vector ǫ form a right trian􏰭gle, sinc􏰭e Y is the orthogonal projection of Y onto the vector X . Noting that ǫ = Y − Y = Y −Y , by the Pythagoras theorem, we have
􏰑􏰑 􏰑􏰑2 􏰑􏰑􏰭􏰑􏰑2 2 􏰑􏰑􏰭􏰑􏰑2 􏰑􏰑 􏰭􏰑􏰑2 Y =Y +∥ǫ∥=Y +Y−Y
This equation is equivalent to the decomposition of the total scatter, TSS, into sum of squared errors, SSE, and residual sum of squares, RSS. To see this, note that the total scatter, TSS, is defined as follows:
􏰪n 􏰑􏰑􏰑􏰑
2􏰑 􏰑2􏰑􏰑2 TSS= (yi−μY) = Y−μY·1 = Y
i=1
The residual sum of squares, RSS, is defined as
􏰪n 􏰑􏰑􏰑􏰑
2 􏰑􏰭 􏰑2 􏰑􏰭􏰑2 RSS= (yˆi−μY)=Y−μY·1=Y
i=1
Finally, the sum of squared errors, SSE, is defined as
S S E = ∥ ǫ ∥ 2 = 􏰑􏰑 Y − Y􏰭 􏰑􏰑 2
Thus, the geometry of univariate regression makes it evident that
􏰑􏰑 􏰑􏰑2 􏰑􏰑􏰭􏰑􏰑2
Y =Y +Y−Y
􏰑􏰑 􏰭􏰑􏰑2
􏰑􏰑Y−μY ·1􏰑􏰑2 =􏰑􏰑Y􏰭−μY ·1􏰑􏰑2 +􏰑􏰑Y−Y􏰭􏰑􏰑2 TSS=RSS+SSE

726 Regression Evaluation
􏰭
Notice further that since Y , Y and ǫ form a right triangle, the cosine of the angle 􏰭
betweenY andY is given as the ratio of the base to the hypotenuse. On the other hand, via Eq. (2.30), the cosine of the angle is also the correlation between Y and Y􏰭, denoted
ρYY􏰭. Thus, we have:
􏰑􏰑􏰭􏰑􏰑
Y ρ Y Y􏰭 = c o s θ = 􏰑􏰑 􏰑􏰑
Y
􏰑􏰑􏰭􏰑􏰑 􏰑􏰑 􏰑􏰑 Y = ρ Y Y􏰭 · Y
Note􏰭that, whereas |ρYY􏰭| ≤ 1, due to the project􏰭ion operation, the angle between Y and Y is always less than or equal to 90◦, which means that ρYY􏰭 ∈ [0,1] for univariate regression. Thus, the predicted response vector Y is smaller than the true response vector Y by an amount equal to the correlation between them. Furthermore, by Eq. (27.4), the coefficient of determination is the same as the squared correlation between Y and Y􏰭
We can observe that
􏰑􏰑􏰭􏰑􏰑2 R2=RSS=Y =ρ2
TSS􏰑􏰑􏰑􏰑2 YY􏰭 Y
Example 27.2 (Geometry of Goodness of Fit). Continuing Example 27.1, the corre-
lation between Y and Y􏰭 is given as 􏰑􏰑􏰭􏰑􏰑

ρ Y Y􏰭 = c o s θ = 􏰑􏰑 􏰑􏰑 = √ = √ = 9 . 3 1 6 = 0 . 9 6 3
Y Y

80.436 8.969 86.78
RSS TSS
The square of the correlation is equivalent to R2, since ρ2 􏰭 = (0.963)2 = 0.927
YY The angle between Y and Y􏰭 is given as
θ = cos−1(0.963) = 15.7◦ The relatively small angle indicates a good linear fit.
27.1.3 Inference about Regression Coefficient and Bias Term
The estimated values of the bias and regression coefficient, b and w, are only point estimates for the true parameters β and ω. To obtain confidence intervals for these parameters, we treat each yi as a random variable for the response given the corresponding fixed value xi. These random variables are all independent and identicallydistributedasY,withexpectedvalueβ+ω·xi andvarianceσ2.Ontheother hand, the xi values are fixed a priori and therefore μX and σX2 are also fixed values.

Univariate Regression
727
We can now treat b and w as random variables, with b=μ􏰥−w·μ
YX
ni=1(xi −μX)(yi −μY) 1 􏰪n
􏰪n
where ci is a constant (since xi is fixed), given as ci = xi − μX
sX
w= 􏰥n (x−μ )2 =s (xi−μX)·yi= i=1iX Xi=1 i=1
ci·yi
(27.6)
and sX = 􏰥ni=1(xi − μX)2 is the total scatter for X, defined as the sum of squared deviations of xi from its mean μX. We also use the fact that
􏰪n
i=1
Note that
􏰪n 1􏰪n
􏰪n
(xi −μX)·μY =μY · (xi −μX)=0
i=1 X i=1
i=1
ci = s (xi −μX)=0
Mean and Variance of Regression Coefficient
The expected value of w is given as
􏰠􏰪n􏰡􏰪n 􏰪n
E[w]=E ciyi = ci ·E[yi]= ci(β+ω·xi)
i=1 i=1 i=1
nnn 􏰪􏰪􏰥ω􏰪 ω
=β ci +ω· ci ·xi = sX · (xi −μX)·xi = sX ·sX =ω i=1 i=1 i=1
which follows from the observation that ni=1 ci = 0, and further 􏰪n 􏰓􏰪n􏰔􏰪n
sX = (xi −μX)2 =
i=1 i=1 i=1
var(w)=var
i=1
(27.7)
(27.8)
= sinceci isaconstant,var(yi)=σ2,andfurther
􏰞􏰪n
􏰪n i=1
σ2 ci2 = s
X
xi2 −n·μ2X = (xi −μX)·xi
Thus, w is an unbiased estimator for the true parameter ω.
Using the fact that the variables yi are independent and identically distributed as
Y, we can compute the variance of w as follows
􏰟 􏰪n
ci2 ·var(yi)=σ2 · 􏰪n1􏰪n sX1
ci ·yi
ci2=sX2 · (xi−μX)2=sX2 =sX
i=1
i=1 i=1
The standard deviation of w, also called the standard error of w, is given as
σ se(w) = 􏰷var(w) = √
s
X

728 Regression Evaluation Mean and Variance of Bias Term
The expected value of b is given as
􏰠 1 􏰪n 􏰡
E[b]=E[μY−w·μX]=E n yi−w·μX 􏰞1􏰪n􏰟i=1􏰞1􏰪n 􏰟
= n· E[yi] −μX·E[w]= n (β+ω·xi) −ω·μX i=1 i=1
=β+ω·μX −ω·μX =β
Thus, b is an unbiased estimator for the true parameter β.
Using the observation that all yi are independent, the variance of the bias term can
be computed as follows
var(b) = var(μY − w · μX)
􏰞1􏰪n 􏰟
=var n yi +var(μX·w)
i=1
= 1 ·nσ2+μ2X·var(w)=1·σ2+μ2X·σ2
n2 nsX = 􏰕 1 + μ 2X 􏰖 · σ 2
n sX
where we used the fact that for any two random variables A and B, we have var(A − B) = var(A) + var(B). That is, variances of A and B add, even though we are computing the variance of A − B.
The standard deviation of b, also called the standard error of b, is given as
se(b)=􏰷var(b)=σ ·􏰺1 + μ2X n sX
Covariance of Regression Coefficient and Bias
We can also compute the covariance of w and b, as follows cov(w,b)=E[w·b]−E[w]·E[b]=E[(μY −w·μX)·w]−ω·β
(27.9)
=μY ·E[w]−μX ·E[w2]−ω·β =μY ·ω−μX ·􏰄var(w)+E[w]2􏰅−ω·β
􏰕σ2
= μY · ω − μX ·
sX
= − μX · σ 2 sX
− ω
2􏰖 μX ·σ2 − ω · β = ω · (μY − ω · μX ) −
− ω · β
􏱀 􏰿􏰾 􏱁 sX β
where we use the fact that var(w) = E[w2] − E[w]2, which implies E[w2] = var(w) + E[w]2,andfurtherthatμY−ω·μX =β.

Univariate Regression 729
Confidence Intervals
Since the yi variables are all normally distributed, their linear combination also follows a normal distribution. Thus, w follows a normal distribution with mean ω and variance σ2/sX. Likewise, b follows a normal distribution with mean β and variance (1/n+μ2X/sX)·σ2.
Since the true variance σ 2 is unknown, we use the estimated variance σˆ 2 (from Eq. (27.2)), to define the standardized variables Zw and Zb as follows
Zw = w−E[w] = w−ω Zb = b−E[b] = 􏰸 b−β (27.10)
σˆ se(w) √
These variables follow the Student’s t distribution with n − 2 degrees of freedom. Let Tn−2 denote the cumulative t distribution with n − 2 degrees of freedom, and let tα/2 denote the critical value of Tn−2 that encompasses α/2 of the probability mass in the right tail. That is,
P(Z≥tα/2)=α orequivalentlyTn−2(tα/2)=1−α 22
s X s e ( b ) σˆ ( 1 / n + μ 2X / s X )
Since the t distribution is symmetric, we have
P(Z≥−tα/2)=1−α orequivalentlyTn−2(−tα/2)=α
22
Given confidence level 1 − α, i.e., significance level α ∈ (0,1), the 100(1 − α)%
confidence interval for the true values, ω and β, are therefore as follows P􏰄w−tα/2·se(w) ≤ ω ≤ w+tα/2·se(w)􏰅=1−α
P􏰄b−tα/2·se(b) ≤ β ≤ b+tα/2·se(b)􏰅=1−α
Example 27.3 (Confidence Intervals). Continuing with Example 27.1, we consider the variance of the bias and regression coefficient, and their covariance. However, since we do not know the true variance σ2, we use the estimated variance and the standard error for the Iris data
2SSE −2 σˆ =􏰷 =4.286×10
n−2
σˆ = 4.286 × 10−2 = 0.207
Furthermore, we have
μX = 3.7587 sX = 463.864
Therefore, the estimated variance and standard error of w is given as σˆ2 4.286×10−2 −5
var(w) = sX = 463.864 = 9.24 × 10
se(w) = 􏰷var(w) = 􏰷9.24 × 10−5 = 9.613 × 10−3

730 Regression Evaluation
The estimated variance and standard error of b is v a r ( b ) = 􏰕 1 + μ 2X 􏰖 · σˆ 2
n sX
=􏰕 1 +(3.759)2􏰖·(4.286×10−2)
150 463.864
= (3.712 × 10−2) · (4.286 × 10−2) = 1.591 × 10−3
se(b) = 􏰷var(b) = 􏰷1.591 × 10−3 = 3.989 × 10−2 and the covariance between b and w is
μX · σˆ 2 3.7587 · (4.286 × 10−2) −4 cov(w,b)=− sX =− 463.864 =−3.473×10
For the confidence interval, we use a confidence level of 1 − α = 0.95 (or α = 0.05). The critical value of the t-distribution, with n − 2 = 148 degrees of freedom, that encompasses α/2 = 0.025 fraction of the probability mass in the right tail is tα/2 = 1.976. We have
tα/2 · se(w) = 1.976 · (9.613 × 10−3) = 0.019
Therefore, the 95% confidence interval for the true value, ω, of the regression
coefficient is given as
􏰄w−tα/2·se(w), w+tα/2·se(w)􏰅=(0.4164−0.019,0.4164+0.019)
= (0.397, 0.435)
Likewise, we have:
tα/2 · se(b) = 1.976 · (3.989 × 10−2) = 0.079 Therefore, the 95% confidence interval for the true bias term, β, is
􏰄b−tα/2·se(b), b+tα/2·se(b)􏰅=(−0.3665−0.079,−0.3665+0.079) = (−0.446, −0.288)
27.1.4 Hypothesis Testing for Regression Effects
One of the key questions in regression is whether X predicts the response Y. In the regression model, Y depends on X through the parameter ω, therefore, we can check for the regression effect by assuming the null hypothesis H0 that ω = 0, with the alternative hypothesis Ha being ω ̸= 0:
H0 : ω = 0 Ha : ω ̸= 0
When ω = 0, the response Y depends only on the bias β and the random error ε. In other words, X provides no information about the response variable Y.

Univariate Regression 731 Now consider the standardized variable Zw from Eq. (27.10). Under the null
hypothesis we have E[w] = ω = 0. Thus,
w−E[w] w
We can therefore compute the p-value for the Zw statistic using a two-tailed test via the t distribution with n − 2 degrees of freedom. Given significance level α (e.g., α = 0.01), we reject the null hypothesis if the p-value is below α. In this case, we accept the alternative hypothesis that the estimated value of the slope parameter is significantly different from zero.
We can also define the f -statistic, which is the ratio of the regression sum of squares, RSS, to the estimated variance, given as
Zw= se(w) =σˆ/√
s (27.11)
X
R S S 􏰥 ni = 1 ( yˆ i − μ Y ) 2
f=σˆ2 =􏰥n 2􏰤 i=1(yi −yˆi)
Under the null hypothesis, one can show that E[RSS]=σ2
Further, it is also true that
E[σˆ2]=σ2
Thus, under the null hypothesis the f -statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then E[RSS] ≥ σ2, resulting in a larger f value. In fact, the f -statistic follows a F -distribution with 1, (n − 2) degrees of freedom (for the numerator and denominator, respectively); therefore, we can reject the null hypothesis that w = 0 if the p-value of f is less than the significance level α, say 0.01.
Interestingly the f -test is equivalent to the t -test since Zw2 = f . We can see this as follows:
(27.12)
1 􏰪n
f = σˆ2 ·
i=1
1 􏰪n
= σˆ2 ·
i=1
1 􏰪n
(yˆi −μY)2 = σˆ2 · (b+w·xi −μY)2
i=1
(μY −w·μX +w·xi −μY)2 = σˆ2 ·
(n−2)
1 􏰪n w 2 · s X =σˆ2·w2· (xi−μX)2= σˆ2
i=1 w2 2 = σˆ 2 / s X = Z w
Test for Bias Term
1 􏰪n 􏰄 i=1
􏰅 2
w·(xi −μX)
Note that we can also test if the bias value is statistically significant or not by setting up the null hypothesis, H0 : β = 0, versus the alternative hypothesis Ha : β ̸= 0. We then

732 Regression Evaluation evaluate the Zb statistic (see Eq. (27.10)) under the null hypothesis:
Zb = b − E[b] = 􏰸 b (27.13) s e ( b ) σˆ · ( 1 / n + μ 2X / s X )
since, under the null hypothesis E[b] = β = 0. Using a two-tailed t-test with n − 2 degrees of freedom, we can compute the p-value of Zb. We reject the null hypothesis if this value is smaller than the significance level α.
Example 27.4 (Hypothesis Testing). We continue with Example 27.3, but now we test for the regression effect. Under the null hypothesis we have ω = 0, further E[w] = ω = 0. Therefore, the standardized variable Zw is given as
Zw = w−E[w] = w = 0.4164 =43.32 se(w) se(w) 9.613 × 10−3
Using a two-tailed t -test with n − 2 degrees of freedom, we find that p-value(43.32) ≃ 0
Since this value is much less than the significance level α = 0.01, we conclude that observing such an extreme value of Zw is unlikely under the null hypothesis. Therefore, we reject the null hypothesis and accept the alternative hypothesis that ω̸=0.
Now consider the f -statistic, we have
f = RSS = 80.436 = 1876.71
σˆ 2 4.286 × 10−2
Using the F -distribution with (1, n − 2) degrees of freedom, we have
p-value(1876.71) ≃ 0
In other words, such a large value of the f -statistic is extremely rare, and we can reject the null hypothesis. We conclude that Y does indeed depend on X, since ω ̸= 0. Finally, we test whether the bias term is significant or not. Under the null
hypothesis H0 : β = 0, we have
Zb = b = −0.3665 = −9.188
se(b) 3.989 × 10−2 Using the two-tailed t-test, we find
p-value(−9.188) = 3.35 × 10−16
It is clear that such an extreme Zb value is highly unlikely under the null hypothesis. Therefore, we accept the alternative hypothesis that Ha : β ̸= 0.

Univariate Regression 733 27.1.5 StandardizedResiduals
Our assumption about the true errors εi is that they are normally distributed with mean μ = 0 and fixed variance σ2. After fitting the linear model, we can examine how well the residual errors ǫi = yi − yˆi satisfy the normality assumption. For this, we need to compute the mean and variance of ǫi , by treating it as a random variable.
The mean of ǫi is given as
E[ǫi]=E[yi −yˆi]=E[yi]−E[yˆi]
=β+ω·xi −E[b+w·xi]=β+ω·xi −(β+ω·xi)=0
which follows from the fact that E[b] = β and E[w] = ω.
To compute the variance of ǫi , we will express it as a linear combination of the yj
variables, by noting that
1  􏰪n  1  􏰪n 􏰪n
 􏰪n ( x j − μ X ) 􏰪
w = s xj yj − n · μX · μY = s xj yj − μX · yj = s · yj Xj=1 Xj=1j=1 j=1X
n1
b = μY − w · μX =  j =1 n · yj  − w · μX
Therefore, we can express ǫi , as follows
ǫi =yi −yˆi =yi −b−w·xi =yi −􏰪n 1yj +w·μX −w·xi
j=1 n =yi−􏰪n 1yj−(xi−μX)·w
j=1 n
=yi −􏰪n 1yj −􏰪n (xi −μX)·(xj −μX) ·yj
j=1 n j=1 sX
= 􏰕1 − 1 − (xi − μX )2 􏰖 · yi − 􏰪 􏰕 1 + (xi − μX ) · (xj − μX ) 􏰖 · yj
(27.14) where we have separated yi from the rest of the yj ’s, so that all terms in the summation
nsX j̸=insX are independent. Define aj as follows:
aj =􏰕1+(xi−μX)·(xj−μX)􏰖 n sX
Rewriting Eq. (27.14) in terms of aj , we get
var(ǫi)=var (1−ai)·yi − aj ·yj 􏰓 􏰪􏰪􏰔
j ̸=i
=(1−ai)2 ·var(yi)+ aj2 ·var(yj) j ̸=i
(27.15)

734
Regression Evaluation
=σ2 ·􏰄1−2ai +ai2 +􏰪aj2􏰅, j ̸=i
􏰥 􏰄 􏰪n 􏰅 =σ2· 1−2ai+ aj2
j=1 Consider the term nj=1 aj2, we have
since var(yi)=var(yj)=σ2
(27.16)
􏰪n 􏰕1 (xi −μX)·(xj −μX)􏰖2 aj2 = n + sX
􏰪n
j=1 j=1
=􏰪n 􏰕1 −2·(xi−μX)·(xj−μX)+(xi−μX)2·(xj−μX)2􏰖 j = 1 n 2 n · s X s X2
1 2·(xi−μX)􏰪n (xi−μX)2􏰪n
= n − n·s (xj −μX)+ s2 (xj −μX)2
X j=1 X j=1 since􏰥nj=1(xj −μX)=0and􏰥nj=1(xj −μX)2 =sX,weget
􏰪n 1 (xi−μX)2 aj2 = n + s
j=1 X
Plugging Equations (27.15) and (27.17) into Eq. (27.16), we get
2 􏰕 2 2·(xi −μX)2 1 (xi −μX)2 􏰖 var(ǫi)=σ · 1−n− sX +n+ sX
= σ 2 · 􏰕1 − 1 − (xi − μX )2 􏰖 n sX
(27.17)
We can now define the standardized residual ǫi∗ by dividing ǫi by its standard deviation after replacing σ 2 by its estimated value σˆ 2 . That is,
ǫi∗ = √ ǫi = 􏰺 ǫi (27.18) var(ǫi) 1 (xi −μX)2
These standardized residuals should follow a standard normal distribution. We can thus plot the standardized residuals against the quantiles of a standard normal distribution, and check if the normality assumption holds. Significant deviations would indicate that our model assumptions may not be correct.
σˆ·1−n− sX
Example 27.5 (Standardized Residuals). Consider the Iris dataset from Exam- ple 27.1, with the predictor variable (petal length) and response variable (petal width), and n = 150. Figure 27.3(a) shows the quantile-quantile (QQ) plot. The y-axis is the list of standardized residuals sorted from the smallest to the largest. The x-axis

Multiple Regression
735
2
0 −2
2
0
−2
246
X (b) X vs. ǫ∗
−2 0 2 Q
(a) QQ-plot
Figure 27.3. Residual plots: (a) Quantile-Quantile scatter plot of normal quantiles versus standardized residuals. (b) Independent variable versus standardized residuals.
is the list of the quantiles of the standard normal distribution for a sample of size n, defined as
Q = (q1,q2,…,qn)T qi =F−1􏰕i−0.5􏰖
n
where F is the cumulative distribution function (CDF), and F −1 is the inverse CDF or quantile function (see Eq. (2.2)) for the normal distribution. Thus, the Q values are also sorted in increasing order from smallest to largest. If the standardized residuals follow a normal distribution, then the QQ plot should follow a straight line. Figure 27.3(a) plots this perfect line for comparison. We can observe that the residuals are essentially normally distributed.
The plot of the independent variable X versus the standardized residuals is also instructive. We can see in Figure 27.3(b) that there is no particular trend or pattern to the residuals, and the residual values are concentrated along the mean value of 0, with the majority of the points being within two standard deviations of the mean, as expected if they were sampled from a normal distribution.
27.2 MULTIPLE REGRESSION
In multiple regression there are multiple independent attributes X1,X2,··· ,Xd and a single dependent or response attribute Y, and we assume that the true relationship can be modeled as a linear function
Y=β+ω1 ·X1 +ω2 ·X2 +…+ωd ·Xd +ε
whereβistheinterceptorbiastermandωi istheregressioncoefficientforattributeXi. Recallthatωi denotestheexpectedincreaseinYwithaunitincreaseinthevalueofXi, assuming all other variables are held fixed. We assume that ε is a random variable that
ǫ∗
ǫ∗

736 Regression Evaluation
is normally distributed with mean μ = 0 and variance σ 2 . Further, we assume that the errors for different observations are all independent of each other, and consequently the observed responses are also independent.
Mean and Variance of Response Variable
Let X = (X1,X2,··· ,Xd)T ∈ Rd denote the multivariate random variable comprising the independent attributes. Let x = (x1,x2,··· ,xd)T be some fixed value of X, and let ω = (ω1,ω2,··· ,ωd )T. The expected response value is then given as
􏰠􏰪d 􏰡 E[Y|X=x]=E[β+ω1 ·x1 +…+ωd ·xd +ε]=E β+ ωi ·xi +E[ε]
i=1 =β+ω1·x1+…+ωd ·xd =β+ωTx
which follows from the assumption that E[ε] = 0. The variance of the response variable is given as
􏰞 􏰪d 􏰟 􏰞 􏰪d 􏰟
var(Y|X=x)=var β+ ωi ·xi +ε =var β+ ωi ·xi +var(ε)=0+σ2 =σ2
i=1 i=1
which follows from the assumption that all xi are fixed a prior􏰥i. Thus, we conclude that YalsofollowsanormaldistributionwithmeanE[Y|x]=β+ di=1ωi ·xi =β+ωTxand variance var(Y|x) = σ 2 .
Estimated Parameters
The true parameters β, ω1,ω2,··· ,ωd and σ2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi , for i = 1,2,··· ,n. We augment the data matrix by adding a new column X0 with all values fixed at 1, that is, X0 = 1. Thus, the augmented data D ̃ ∈ Rn×(d+1) comprises the (d + 1) attributes X0,X1,X2,··· ,Xd, and each augmented point is given as x ̃i = (1,xi1,xi2,··· ,xid )T.
Let b = w0 denote the estimated bias term, and let wi denote the estimated regression weights. The augmented vector of estimated weights, including the bias term, is
w ̃ =(w0,w1,···,wd)T
We then make predictions for any given point xi as follows:
yˆ i = b · 1 + w 1 · x i 1 + · · · w d · x i d = w ̃ T x ̃ i
Recall that these estimates are obtained by minimizing the sum of squared errors
(SSE), given as
􏰪n 􏰪n􏰪d2 S S E = ( y i − yˆ i ) 2 =  y i − b − w j · x i j 
i=1 i=1 j=1

Multiple Regression
737
with the least squares estimate [Eq. (23.20)] given as 􏰓 ̃ T ̃ 􏰔−1 ̃ T
w ̃ = D D D Y The estimated variance σˆ 2 is then given as
SSE 1 􏰪n
( y i − yˆ i ) 2
We divide by n−(d +1) to get an unbiased estimate, since n−(d +1) is the number of degrees of freedom for estimating SSE (see Figure 27.5). In other words, out of the n training points, we need to estimate d + 1 parameters, β and the ωi ’s, with n − (d + 1) remaining degrees of freedom.
Estimated Variance is Unbiased We now show that σˆ 2 is an unbiased estimator of the true (but unknown) variance σ2. Recall from Eq. (23.18) that
Y􏰭=D ̃w ̃ =D ̃(D ̃TD ̃)−1D ̃TY=HY
where H is the n × n hat matrix (assuming that (D ̃ TD ̃ )−1 exists). Note that H is an orthogonal projection matrix, since it is symmetric (HT = H) and idempotent (H2 = H). The hat matrix H is symmetric since
HT = 􏰄D ̃ (D ̃ TD ̃ )−1D ̃ T􏰅T = 􏰄D ̃ T􏰅T􏰄(D ̃ TD ̃ )T􏰅−1D ̃ T = H and it is idempotent since
H2 =D ̃(D ̃TD ̃)−1D ̃TD ̃(D ̃TD ̃)−1D ̃T =D ̃(D ̃TD ̃)−1D ̃T =H Furthermore, the trace of the hat matrix is given as
tr(H) = tr(D ̃ (D ̃ TD ̃ )−1D ̃ T) = tr(D ̃ TD ̃ (D ̃ TD ̃ )−1) = tr(I(d+1)) = d + 1
where I(d +1) is the (d + 1) × (d + 1) identity matrix, and we used the fact that the trace of a product of matrices is invariant under cyclic permutations.
Finally, note that the matrix I − H is also symmetric and idempotent, where I is the n × n identity matrix, since
σˆ 2 = n − ( d + 1 ) = n − d − 1 ·
( 2 7 . 1 9 )
(I−H)T =IT −HT =I−H
(I−H)2 =(I−H)(I−H)=I−H−H+H2 =I−H
Now consider the squared error; we have
SSE=􏰑􏰑Y−Y􏰭􏰑􏰑2 =􏰑􏰑Y−HY􏰑􏰑2 =􏰑􏰑(I−H)Y􏰑􏰑2
= YT(I − H)(I − H)Y = YT(I − H)Y However, note that the response vector Y is given as
Y = D ̃ ω ̃ + ε
(27.20)
i=1

738 Regression Evaluation
where ω ̃ = (ω0 , ω1 , · · · , ωd )T is the true (augmented) vector of parameters of the model, and ε = (ε1,ε2,··· ,εn)T is the true error vector, which is assumed to be normally distributed with mean E[ε] = 0 and with fixed variance εi = σ 2 for each point, so that cov(ε) = σ2I. Plugging the expression of Y into Eq. (27.20), we get
S S E = Y T ( I − H ) Y = ( D ̃ ω ̃ + ε ) T ( I − H ) ( D ̃ ω ̃ + ε ) = ( D ̃ ω ̃ + ε ) T 􏰄 ( I − H ) D ̃ ω ̃ + ( I − H ) ε 􏰅
0
=􏰄(I−H)ε􏰅T(D ̃ω ̃ +ε)=εT(I−H)(D ̃ω ̃ +ε) = ε T ( I − H ) D ̃ ω ̃ + ε T ( I − H ) ε = ε T ( I − H ) ε
0
􏱀 􏰿􏰾 􏱁
􏱀 􏰿􏰾 􏱁
where we use the observation that
( I − H ) D ̃ ω ̃ = D ̃ ω ̃ − H D ̃ ω ̃ = D ̃ ω ̃ − ( D ̃ ( D ̃ T D ̃ ) − 1 D ̃ T ) D ̃ ω ̃ = D ̃ ω ̃ − D ̃ ω ̃ = 0
Let us consider the expected value of SSE; we have E[SSE]=E􏰆εT(I−H)ε􏰇
􏰪n 􏰪n􏰪n 􏰪n 􏰪n􏰪n 􏰪2 2
=E εi − hijεiεj = E[εi]− hijE[εiεj] i=1 i=1 j=1 i=1 i=1 j=1
n
(1−hii)E[εi2],sinceεi areindependent,andthereforeE[εiεj]=0 􏰓 􏰪n 􏰔 2 􏰄 􏰅 2 2
=
= n− hii σ = n−tr(H) σ =(n−d−1)·σ
i=1
i=1
whereweusedthefactthatσ2 =var(εi)=E[εi2]−􏰄E[εi]􏰅2 =E[εi2],sinceE[εi]=0.It follows that
σˆ2 =E􏰗 SSE 􏰘= 1 E[SSE]= 1 ·(n−d−1)·σ2 =σ2 (27.21) (n−d −1) (n−d −1) (n−d −1)
27.2.1 Goodness of Fit
Following the derivation in Section 27.1.2, the decomposition of the total sum of squares, TSS, into the sum of squared errors, SSE, and the residual sum of squares, RSS, holds true for multiple regression as well:
TSS=SSE+RSS
􏰪n 􏰪n 􏰪n
(yi −μY)2 = (yi −yˆi)2 + (yˆi −μY)2 i=1 i=1 i=1

Multiple Regression
739
X1
Y
􏱆􏱃
􏱃􏱆
􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱆􏱃 􏱆􏱃􏱆􏱃􏱆􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆
􏱆􏱃
􏱆􏱃 􏱃􏱆􏱆􏱃 􏱆􏱃 􏱃􏱆􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆􏱆􏱆􏱃􏱃􏱃􏱆 􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱃􏱆
􏱃􏱆
􏱃􏱆􏱆􏱃
Figure27.4. Multipleregression:sepallength(X1)andpetallength(X2)withresponseattributepetal width (Y). The vertical bars show the residual errors for the points. Points in white are above the plane, whereas points in gray are below the plane.
The coefficient of multiple determination, R2, gives the goodness of fit, measured as the fraction of the variation explained by the linear model:
R2 =1− SSE = TSS−SSE = RSS (27.22) TSS TSS TSS
One of the potential problems with the R2 measure is that it is susceptible to increase as the number of attributes increase, even though the additional attributes may be uninformative. To counter this, we can consider the adjusted coefficient of determination, which takes into account the degrees of freedom in both TSS and SSE
2 SSE􏰒(n−d −1) (n−1)·SSE
Ra =1− TSS􏰒(n−1) =1−(n−d−1)·TSS (27.23)
We can observe that the adjusted R2a measure is always less than R2, since the ratio n−1 > 1. If there is too much of a difference between R2 and R2 , it might indicate that
n−d−1 a
there are potentially many, possibly irrelevant, attributes being used to fit the model.
Example 27.6 (Multiple Regression: Goodness of Fit). Continuing with multiple regression from Example 23.3, Figure 27.4 shows the multiple regression of sepal length (X1 ) and petal length (X2 ) on the response attribute petal width (Y) for the Iris dataset with n = 150 points. We also add an ext ̃ra attribute X0 = 1150, which is a vector of all ones in R150. The augmented dataset D ∈ R150×3 comprises n = 150 points, and three attributes X0, X1 and X2.
􏱃􏱆
􏱃􏱆 􏱃􏱆 􏱆􏱃􏱃􏱆 􏱆􏱃
􏱆􏱃 􏱃􏱆
􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱆􏱃􏱃􏱆 􏱃􏱆 􏱃􏱆
􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆
􏱃􏱆
􏱆􏱃
􏱃􏱆
􏱆􏱃 X2
􏱃􏱆
􏱆􏱃􏱆􏱃􏱆􏱃 􏱆􏱃􏱆􏱃 􏱆􏱃􏱃􏱆 􏱆􏱃 􏱆􏱃 􏱃􏱆􏱆􏱃􏱆􏱃 􏱃􏱆 􏱆􏱃 􏱃􏱆
􏱃􏱆 􏱃􏱆 􏱃􏱆􏱃􏱆 􏱃􏱆 􏱃􏱆 􏱃􏱆􏱆􏱃􏱆􏱃􏱃􏱆􏱃􏱆􏱃􏱆􏱆􏱃
􏱃􏱆 􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆 􏱃􏱆
􏱃􏱆 􏱃􏱆􏱆􏱃 􏱃􏱆 􏱃􏱆􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱆
􏱆􏱃
􏱆􏱃􏱆􏱃􏱃􏱆
􏱃􏱆
􏱃􏱆 􏱆􏱃􏱃􏱆􏱆􏱃􏱃􏱆􏱆􏱃
􏱆􏱃 􏱃􏱆 􏱃􏱆 􏱃􏱆
􏱆􏱃
􏱆􏱃􏱃􏱆
􏱆􏱃
􏱆􏱃
􏱃􏱆

740 Regression Evaluation
Theuncentered3×3scattermatrixD ̃TD ̃ anditsinversearegivenas
150.0 876.50 563.80   0.793 −0.176 0.064 D ̃ TD ̃ = 876.5 5223.85 3484.25 (D ̃ TD ̃ )−1 = −0.176 0.041 −0.017
563.8 3484.25 2583.00 0.064 −0.017 0.009 The augmented estimated weight vector w ̃ is given as
w0  −0.014 w ̃ =  w 1  = ( D ̃ T D ̃ ) − 1 · ( D ̃ T Y ) =  − 0 . 0 8 2 
w2 0.45 The bias term is therefore b = w0 = −0.014, and the fitted model is
Y􏰭=−0.014−0.082·X1+0.45·X2
Figure 27.4 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals (i.e., ǫi > 0 or yˆ i > yi ), whereas the gray points have negative residual values (i.e., ǫi < 0 or yˆi < y). The SSE value is given as SSE=􏰪ǫi2 =􏰪(yi −yˆi)2 =6.179 i=1 i=1 Thus, the estimated variance and standard error of regression are given as 2 SSE σˆ =􏰺 = 6.179 −2 150 150 n − d − 1 σˆ = S S E =4.203×10 = 4 . 2 0 3 × 1 0 − 2 = 0 . 2 0 5 1 􏰷4 7 n−d−1 The values of total and residual sum of squares are given as TSS = 86.78 RSS = 80.60 We can observe that TSS = SSE + RSS. The fraction of variance explained by the model, that is the R2 value, is given as R2 = RSS = 80.60 = 0.929 TSS 86.78 This indicates a very good fit of the multiple linear regression model. Nevertheless, it makes sense to also consider the adjusted R2a value R2a =1− (n−1)·SSE =1− 149×6.179 =0.928 (n−d −1)·TSS 147×86.78 The adjusted value is almost the same as the R2 value. Multiple Regression 741 x1 Y θ X2 ǫ 􏰭 X1 Y U1 xn U2 x2 Figure 27.5. Geometry of multiple regression. The figure shows two centered predictor variables X1 and X2, along with the corresponding orthogonal basis vectors U1 and U2. The subspace 1 is not shown. The dimensionality of the error space, containing the vector ǫ, is n − d − 1, which also specifies the degrees of freedom for the estimated variance σˆ 2 . Geometry of Goodness of Fit In multiple regression there are d predictor attributes X1,X2,··· ,Xd. We can center them by subtracting their projection along the vector 1 to obtain the centered predictor vectors Xi . Likewise, we can center the response vector Y and the predicted vector Y􏰭. Thus, we have X i = X i − μ X i · 1 Y = Y − μ Y · 1 Y􏰭 = Y􏰭 − μ Y · 1 Once Y, Y􏰭 and Xi ’s have been centered, they all lie in the n − 1 dimensional subspace orthogonal to the vector 1. Figure 27.5 shows this n − 1 dimensional subspace. In this subspace, we first extract an orthogonal basis {U1,U2,··· ,Ud} via the Gram-Schmidt orthogonalization process outlined in Section 23.3.1, and the predicted response vector is the sum of the projections of Y onto each of the new basis vectors (see Eq. (23.23)). The centered vectors Y and Y􏰭, and the error vector ǫ form a right triangle, and thus, by the Pythagoras theorem, we have 􏰑􏰑 􏰑􏰑2 􏰑􏰑􏰭􏰑􏰑2 2 􏰑􏰑􏰭􏰑􏰑2 􏰑􏰑 􏰭􏰑􏰑2 Y =Y +∥ǫ∥=Y +Y−Y (27.24) TSS=RSS+SSE The correlation between Y and Y􏰭 is the cosine of the angle betweenY andY􏰭, which is also given as the ratio of the base to the hypotenuse Y Y ρ Y Y􏰭 = c o s θ = 􏰑􏰑 􏰑􏰑 􏰑􏰑􏰭􏰑􏰑 742 Regression Evaluation Furthermore, by Eq. (27.4), the coefficient of multiple determination is given as 􏰑􏰑􏰭􏰑􏰑2 R2=RSS=Y =ρ2 TSS􏰑􏰑􏰑􏰑2 YY􏰭 Y Example 27.7 (Geometry of Goodness of Fit). Continuing Example 27.6, the corre- lation between Y and Y􏰭 is given as √ 80.60 = √ 86.78 The angle between Y and Y is given as θ = cos−1(0.964) = 15.5◦ The relatively small angle indicates a good linear fit. 􏰑􏰑􏰭􏰑􏰑 √ Y RSS ρ Y Y􏰭 = c 􏰭o s θ = 􏰑􏰑 􏰑􏰑 = √ Y TSS = 0 . 9 6 4 27.2.2 Inferences about Regression Coefficients LetYbetheresponsevectoroverallobservations.Letw ̃ =(w0,w1,w2,···,wd)T bethe estimated vector of regression coefficients, computed as 􏰓 ̃ T ̃ 􏰔−1 ̃ T w ̃ = D D D Y The expected value of w ̃ is given as follows: E [ w ̃ ] = E 􏰯 ( D ̃ T D ̃ ) − 1 D ̃ T Y 􏰰 = ( D ̃ T D ̃ ) − 1 D ̃ T · E [ Y ] =(D ̃TD ̃)−1D ̃T ·E[D ̃ω ̃ +ε]=(D ̃TD ̃)−1(D ̃TD ̃)ω ̃ =ω ̃ since E[ε] = 0. Thus, w ̃ is an unbiased estimator for the true regression coefficients vector ω ̃ . Next, we compute the covariance matrix for w ̃ , as follows cov(w ̃ ) = cov􏰓(D ̃ TD ̃ )−1D ̃ TY􏰔, letting A = (D ̃ TD ̃ )−1D ̃ T, we get = cov(AY) = A cov(Y)AT = A · 􏰄σ 2 · I􏰅 · AT = ( D ̃ T D ̃ ) − 1 D ̃ T 􏰄 σ 2 · I 􏰅 D ̃ ( D ̃ T D ̃ ) − 1 = σ 2 · ( D ̃ T D ̃ ) − 1 ( D ̃ T D ̃ ) ( D ̃ T D ̃ ) − 1 = σ 2 ( D ̃ T D ̃ ) − 1 ( 2 7 . 2 5 ) Here, we made use of the fact that A = (D ̃ TD ̃ )−1D ̃ T is a matrix of fixed values, and therefore cov(AY) = A cov(Y)AT . Also, we have cov(Y) = σ 2 · I, where I is the n × n identity matrix. This follows from the fact that the observed responses yi’s are all independent and have the same variance σ2. Multiple Regression 743 Note that D ̃ TD ̃ ∈ R(d+1)×(d+1) is the uncentered scatter matrix for the augmented data. Let C denote the inverse of D ̃ TD ̃ . That is ( D ̃ T D ̃ ) − 1 = C Therefore, the covariance matrix for w ̃ can be written as cov(w ̃)=σ2C In particular, the diagonal entries σ2 ·cii give the variance for each of the regression coefficient estimates (including for b = w0 ), and their squared roots specify the standard errors. That is var(wi)=σ2 ·cii se(wi)=􏰷var(wi)=σ ·√cii We can now define the standardized variable Zwi that can be used to derive the confidence intervals for ωi as follows wi −E[wi] wi −ωi c (27.26) Zwi= se(w) =σˆ√ i ii where we have replaced the unknown true variance σ 2 by σˆ 2 . Each of the variables Zwi follows a t -distribution with n − d − 1 degrees of freedom, from which we can obtain the 100(1 − α)% confidence interval of the true value ωi as follows: P􏰄wi−tα/2·se(wi)≤ωi ≤wi+tα/2·se(wi)􏰅=1−α Here, tα/2 is the critical value of the t distribution, with n − d − 1 degrees of freedom, that encompasses α/2 fraction of the probability mass in the right tail, given as P(Z≥tα/2)=α orequivalentlyTn−d−1(tα/2)=1−α 22 Example 27.8 (Confidence Intervals). Continuing with multiple regression from Example 27.6, we have σˆ 2 = 4 . 2 0 3 × 1 0 − 2  0.793 −0.176 0.064 C = (D ̃ TD ̃ )−1 = −0.176 0.041 −0.017 0.064 −0.017 0.009 Therefore, the covariance matrix of the estimated regression parameters, including the bias term, is given as cov(w ̃)=σˆ2 ·C=−7.379×10−3  3.333 × 10−2 2.678 × 10−3 −7.379 × 10−3 1.714×10−3 −7.012 × 10−4 2.678 × 10−3 −7.012×10−4 3.775 × 10−4 744 Regression Evaluation The diagonal entries specify the variances and standard errors of each of the estimated parameters var(b) = 3.333 × 10−2 var(w1 ) = 1.714 × 10−3 var(w2 ) = 3.775 × 10−4 se(b) = 􏰷􏰷3.333 × 10−2 = 0.183 se(w1 ) = 􏰷1.714 × 10−3 = 0.0414 se(w2 ) = 3.775 × 10−4 = 0.0194 where b = w0. Using confidence level 1 − α = 0.95 (or significance level α = 0.05), the critical value of the t-distribution that encompasses α = 0.025 fraction of the probability 2 mass in the right tail is given as tα/2 = 1.976. Thus, the 95% confidence intervals for the true bias term β, and the true regression coefficients ω1 and ω2, are: β ∈ 􏰄b ± tα/2 · se(b)􏰅 = (−0.014 − 0.074, −0.014 + 0.074) 􏰄 􏰅 = (−0.088, 0.06) ω1 ∈ w1 ±tα/2 ·se(w1) =(−0.082−0.0168,−0.082+0.0168) 􏰄 􏰅 = (−0.099, −0.065) ω2 ∈ w2 ±tα/2 ·se(w2) =(0.45−0.00787,0.45+0.00787) = (0.442, 0.458) 27.2.3 Hypothesis Testing Once the parameters have been estimated, it is beneficial to test whether the regression coefficients are close to zero or substantially different. For this we set up the null hypothesis that all the true weights are zero, except for the bias term (β = ω0). We contrast the null hypothesis with the alternative hypothesis that at least one of the weights is not zero H0: ω1 =0,ω2 =0,...,ωd =0 Ha: ∃i, suchthatωi ̸=0 The null hypothesis can also be written as H0 : ω = 0, where ω = (ω1,ω2,··· ,ωd )T. We use the F-test that compares the ratio of the adjusted RSS value to the estimated variance σˆ 2 , defined via the f -statistic RSS/d RSS􏰒d f = σˆ 2 = S S E 􏰒 ( n − d − 1 ) ( 2 7 . 2 7 ) Under the null hypothesis, we have E[RSS/d]=σ2 Multiple Regression 745 To see this, let us examine the regression equations in vector terms, namely Y􏰭􏰭 = b · 1 + w 1 · X 1 + . . . + w d · X d Y=(μY−w1μX1 −...−wdμXd)·1+w1·X1+...+wd·Xd Y􏰭 − μ Y · 1 = w 1 ( X 1 − μ X 1 · 1 ) + . . . + w d ( X d − μ X d · 1 ) , w h i c h i m p l i e s 􏰪d i=1 Y􏰭=w1X1+w2X2+...+wdXd = Let us consider the RSS value; we have wiXi 􏰑􏰑2 􏰑􏰑􏰭􏰑􏰑2 􏰭T􏰭 RSS=Y−μY·1 =Y =YY 􏰑􏰑􏰭 􏰓􏰪d 􏰔T􏰓􏰪d j=1 􏰔 􏰪d 􏰪d wjXj = wiwjXiTXj =wT(DTD)w = i=1 wiXi where w = (w1 , w2 , · · · , wd )T is the (without the bias term), and D ∈ Rn×d is the centered data matrix (without augmen- tation by the X0 = 1 attribute). The expected value of RSS is thus given as E[RSS] = E􏰯wT(DT D)w􏰰 = tr􏰓E􏰯wT(DT D)w􏰰􏰔, since, E􏰯wT(DT D)w􏰰 is a scalar = E􏰯tr􏰓wT(DT D)w􏰔􏰰 = E􏰯tr􏰓(DT D)wwT􏰔􏰰, since trace is invariant under cyclic permutation = t r 􏰓 ( DT D) · E [ w w T ] 􏰔 = t r 􏰓 ( DT D) · 􏰄 c o v ( w ) + E [ w ] · E [ w ] T 􏰅 􏰔 ( 2 7 . 2 8 ) = tr 􏰓( DT D) · cov(w)􏰔 , since under the null hypothesis E[w] = ω = 0 = t r 􏰓 ( DT D) · σ 2 ( DT D) − 1 􏰔 = σ 2 t r ( I d ) = d · σ 2 where Id is the d × d identity matrix. We also used the fact that cov(w) = E[wwT] − E[w] · E[w]T, and therefore E[wwT ] = cov(w) + E[w] · E[w]T Notice that from Eq. (27.25), the covariance matrix for the augmented weight vector w ̃ , t h a t i n c l u d e s t h e b i a s t e r m , i s g i v e n a s σ 2 ( D ̃ T D ̃ ) − 1 . H o w e v e r , s i n c e w e a r e i g n o r i n g the bias b = w0 in the hypothesis test, we are interested only in the lower right d × d submatrix of (D ̃ TD ̃ )−1, which excludes the values related to w0. It can be shown that this submatrix is precisely the inverse of the centered scatter matrix (DT D)−1 for the unaugmented data. We used this fact in the derivation above. Therefore, it follows that E 􏰗 RSS 􏰘 = 1 E[RSS] = 1 · d · σ 2 = σ 2 ddd i=1 j=1 d -dimensional vector of regression coefficients 746 Regression Evaluation Further, as per Eq. (27.21), the estimated variance is an unbiased estimator, so that E[σˆ2]=σ2 Thus, under the null hypothesis the f -statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then E[RSS/d] ≥ σ2, resulting in a larger f value. The ratio f follows a F -distribution with d , (n − d − 1) degrees of freedom for the numerator and denominator, respectively. Therefore, we can reject the null hypothesis if the p-value is less than the chosen significance level, say α = 0.01. Noticethat,sinceR2=1−SSE =RSS,wehave TSS TSS SSE=(1−R2)·TSS RSS=R2 ·TSS Therefore, we can rewrite the f ratio as follows RSS􏰒d n−d−1 R2 f = SSE􏰒(n − d − 1) = d · 1 − R2 (27.29) In other words, the F -test compares the adjusted fraction of explained variation to the unexplained variation. If R2 is high, it means the model can fit the data well, and that is more evidence to reject the null hypothesis. Hypothesis Testing for Individual Parameters We can also test whether each independent attribute Xi, contributes significantly for the prediction of Y or not, assuming that all the attributes are still retained in the model. For attribute Xi , we set up the null hypothesis H0 : ωi = 0 and contrast it with the alternative hypothesis Ha : ωi ̸= 0. Using Eq. (27.26), the standardized variable Zwi under the null hypothesis is given as wi −E[wi] wi wi c (27.30) Zwi = se(w) =se(w)=σˆ√ i i ii since E[wi ] = ωi = 0]. Next, using a two-tailed t -test with n − d − 1 degrees of freedom, we compute p-value(Zwi ). If this probability is smaller than the significance level α (say 0.01), we can reject the null hypothesis. Otherwise, we accept the null hypothesis, which would imply that Xi does not add significant value in predicting the response in light of other attributes already used to fit the model. The t-test can also be used to test whether the bias term is significantly different from 0 or not. Example 27.9 (Hypothesis Testing). We continue with Example 27.8, but now we test for the regression effect. Under the null hypothesis that ω1 = ω2 = 0, the expected value of RSS is σ 2 . Thus, we expect the f -statistic to be close to 1. Let us check if that Multiple Regression 747 is the case; we have f = RSS/d = 80.60/2 = 958.8 σˆ 2 4.203 × 10−2 Using the F -distribution with (d , n − d − 1) = (2, 147) degrees of freedom, we have p-value(958.8) ≃ 0 In other words, such a large value of the f -statistic is extremely rare, and therefore, we can reject the null hypothesis. We conclude that Y does indeed depend on at least one of the predictor attributes X1 or X2. We can also test for each of the regression coefficients individually using the t -test. For example, for ω1 , let the null hypothesis be H0 : ω1 = 0, so that the alternative hypothesis is Ha : ω1 ̸= 0. Assuming that the model still has both X1 and X2 as the predictor variables, we can compute the t-statistic using Eq. (27.26): Zw1 = w1 = −0.082 = −1.98 se(w1 ) 0.0414 Using a two-tailed t -test with n − d − 1 = 147 degrees of freedom, we find that p-value(−1.98) = 0.0496 Since the p-value is only marginally less than a significance level of α = 0.05 (i.e., a 95% confidence level), this means that X1 is only weakly relevant for predicting Y in the presence of X2. In fact, if we use the more stringent significance level α = 0.01, we would conclude that X1 is not significantly predictive of Y, given X2. On the other hand, individually for ω2, if we test whether H0 : ω2 = 0 versus Ha :ω2 ̸=0,wehave: Zw2 = w2 = 0.45 =23.2 se(w2 ) 0.0194 Using a two-tailed t -test with n − d − 1 = 147 degrees of freedom, we find that p-value(23.2) ≃ 0 This means, that individually X2 is significantly predictive of Y even in the presence of X1. Using the t-test, we can also compute the p-value for the bias term: Zb = b = −0.014 = −0.077 which has a p-value = 0.94 for a two-tailed test. This means, we accept the null hypothesis that β = 0, and reject the alternative hypothesis that β ̸= 0. se(b) 0.183 Example 27.10 (Centered Scatter Matrix). Here we show that the inverse of the centered scatter matrix (DTD)−1 for the unaugmented data is the lower right 748 Regression Evaluation submatrix of the inverse of the uncentered scatter matrix (D ̃ TD ̃ )−1 for the augmented data. For the augmented data comprising X0, X1 and X2, from Example 27.6, the uncentered3×3scattermatrixD ̃TD ̃ anditsinverse,aregivenas 150.0 876.50 563.80   0.793 −0.176 0.064 D ̃ TD ̃ = 876.5 5223.85 3484.25 (D ̃ TD ̃ )−1 = −0.176 0.041 −0.017 563.8 3484.25 2583.00 0.064 −0.017 0.009 For the unaugmented data comprising only X1 and X2, the centered scatter matrix and its inverse are given as: DT D= 􏰕102.17 189.78􏰖 (DT D)−1 = 􏰕 0.041 −0.017􏰖 189.78 463.86 −0.017 0.009 Wecanobservethat(DTD)−1 isexactlythelowerright2×2submatrixof(D ̃TD ̃)−1. 27.2.4 Geometric Approach to Statistical Testing The geometry of multiple regression provides further insight into the hypothesis testing approach for the regression effect. Let Xi = Xi − μXi · 1 denote the centered attributevector,andletX=(X1,X2,···,Xd)T denotethemultivariatecenteredvector of predictor variables. The n-dimensional space over the points is divided into three mutually orthogonal subspaces, namely the 1-dimensional mean space Sμ = span(1), the d dimensional centered variable space SX = span(X), and the n − d − 1 dimensional error space Sǫ, which contains the error vector ǫ. The response vector Y can thus be decomposed into three components (see Figures 27.2 and 27.5) Y = μ Y · 1 + Y􏰭 + ǫ Recall that the degrees of freedom of a random vector is defined as the dimensionality of its enclosing subspace. Since the original dimensionality of the point space is n, we have a total of n degrees of freedom. The mean space has dimensionality dim(Sμ) = 1, thecenteredvariablespacehasdim(SX)=d,andtheerrorspacehasdim(Sǫ)=n−d− 1, so that we have dim(Sμ)+dim(SX)+dim(Sǫ)=1+d+(n−d−1)=n Population Regression Model Recall that the regression model posits that for a fixed value xi = (xi1,xi2,··· ,xid )T, the true response yi is given as y i = β + ω 1 · x i 1 + . 􏰥. . + ω d · x i d + ε i where the systematic part of the model β + dj =1 ωj · xij is fixed, and the error term εi varies randomly, with the assumption that εi follows a normal distribution with mean μ = 0 and variance σ2. We also assume that the εi values are all independent of each other. Multiple Regression 749 Sǫ Y E [Y | X ] 􏱆􏱃 ε SX 1 Figure 27.6. Population regression model. The n − 1 dimensional subspace, comprising SX + Sǫ , orthogonal to 1 is shown as a hyperrectangle. The true error vector is a random vector of length ∥ε∥; the (n − 1)-dimensional hypersphere shows the possible orientations of ε for a fixed length value. Plugging β = μY − 􏰥dj =1 ωj · μXj into the equation above, we obtain yi =μY +ω1 ·(xi1 −μX1)+...+ωd ·(xid −μXd)+εi =μY +ω1 ·xi1 +...+ωd ·xid +εi where x ij = xij − μXj is the centered value for attribute Xj . Across all the points, we can rewrite the above equation in vector form Y=μY ·1+ω1 ·X1 +...+ωd ·Xd +ε We can also center the vector Y, so that we obtain a regression model over the centered response and predictor variables, as illustrated in Figure 27.6: Y =Y−μY ·1=ω1 ·X1 +ω2 ·X2 +...+ωd ·Xd +ε=E[Y|X]+ε In this equation, 􏰥di=1ωi ·Xi is a fixed vector that denotes the expected value E[Y|X] and ε is an n-dimensional random vector that is distributed according to a n-dimensional multivariate normal vector with mean μ = 0, and a fixed variance σ 2 along all dimensions, so that its covariance matrix is 􏰂 = σ 2 · I. The distribution of ε is therefore given as f(ε)= √ 1 √ ·exp􏰢−εT􏰂−1ε􏰣= √ 1 ·exp􏰛−∥ε∥2 􏰜 ( 2π)n· |􏰂| 2 ( 2π)n·σn 2·σ2 which follows from the fact that |􏰂| = det(􏰂) = det(σ2I) = 􏰄σ2􏰅n and 􏰂−1 = 1 I. σ2 The density of ε is thus a function of its squared length ∥ε∥2, independent of its angle. In other words, the vector ε is distributed uniformly over all angles and is equally likely to point in any direction. Figure 27.6 illustrates the population regression model. The fixed vector E[Y|X] is shown, and one orientation of ε is shown. It is important 750 Regression Evaluation to note that the n − 1 dimensional hypersphere denotes the fact that the random vector ε can be in any orientation in this hypersphere of radius ∥ε∥. Notice how the population regression model differs from the fitted model. The residual error vector ǫ is orthogonal to the predicted mean response vector Y􏰭, which acts as the estimate for E[Y|X]. On the other hand, in the population regression model, the random error vector ε can be in any orientation compared to E[Y|X]. Hypothesis Testing Consider the population regression model Y = μ Y · 1 + ω 1 · X 1 + . . . + ω d · X d + ε = μ Y · 1 + E [Y | X ] + ε To test whether X1,X2,··· ,Xd are useful in predicting Y, we consider what would happen if all of the regression coefficients were zero, which forms our null hypothesis H0: ω1 =0,ω2 =0,...,ωd =0 Y=μY ·1+ε =⇒ Y−μY ·1=ε =⇒ Y =ε Since ε is normally distributed with mean 0 and covariance matrix σ 2 · I, under the null hypothesis, the variation in Y for a given value of x will therefore be centered around the origin 0. On the other hand, under the alternative hypothesis Ha that at least one of the ωi is non-zero, we have Y = E [Y | X ] + ε Thus, the variation inY is shifted away from the origin 0 in the direction E[Y|X]. In practice, we obviously do not know the true value of E[Y|X], but we can estimate it by projecting the centered observation vectorY onto the subspaces SX and Sǫ , as follows Y = w 1 · X 1 + w 2 · X 2 + . . . + w d · X d + ǫ = Y􏰭 + ǫ Now, u􏰭nder the null hypothesis, the true centered response vector is Y = ε, and therefore, Y and ǫ are simply the projections of the random error vector ε onto the subspaces SX a􏰭nd Sǫ, as shown in Figure 27.7(a). In this case, we also expect the length of ǫ and Y to be roughly comparable􏰭. On the other hand, under the alternative hypothesis, we have Y = E[Y |X] + ε, and so Y will be relatively much longer compared to ǫ, as shown in Figure 27.7(b). 􏰭 Given that we expect to see a difference between Y under the null an􏰭d alternative hypotheses, this suggests a geometric test based on the relative lengths of Y and ǫ (since we do not know the true E[Y|X] or􏰭ε). However, there is one difficulty; we cannot compare their lengths directly, since Y lies in a d dimensional space, whereas ǫ lies in a n − d − 1 dimensional space. Instead, we can compare their lengths after normalizing In this case, we have Multiple Regression 751 Sǫ 􏱆􏱃 ε ǫ Y􏰭 1 SX Sǫ (a) Null hypothesis: Y = ε Y Y􏰭 SX by the number of dimensions. Define the mean squared length of per dimension for the two vectorsY􏰭 and ǫ, as follows 􏰑􏰑􏰭􏰑􏰑2 􏰑􏰑􏰭􏰑􏰑2 􏰄􏰭􏰅 Y Y 1 (b) Alternative hypothesis Figure 27.7. Geometric test for regression effect: Null versus alternative hypothesis. MY =dim(SX)= d ∥ǫ ∥2 ∥ǫ ∥2 M(ǫ)= dim(Sǫ) = n−d−1 Thus, the geometric test for the regression effect is the ratio of the normalized mean squared length ofY􏰭 to ǫ, given as 􏰄 􏰭 􏰅 􏰑􏰑 􏰭 􏰑􏰑 2 􏰒 MYYd M(ǫ) =∥ǫ∥2􏰒(n−d−1) 􏱆􏱃 􏱆􏱃 ε ǫ 752 Regression Evaluation 􏰑􏰑􏰭􏰑􏰑2 It is interesting to note that from Eq. (27.24) we have Y 2 = RSS and ∥ǫ∥ = 􏰑􏰑Y − Y􏰭􏰑􏰑2 = SSE, and therefore, the geometric ratio test is identical to the F-test in Eq. (27.27), since 􏰄 􏰭 􏰅 􏰑􏰑 􏰭 􏰑􏰑 2 􏰒 􏰒 MY Yd RSSd M(ǫ) = ∥ǫ∥2􏰒(n−d−1) = SSE􏰒(n−d−1) =f The geometric approach, illustrated in Figure 27.7, makes it clear that if f ≃ 1 then the null hypothesis holds, and we conclude that Y does not depend on the predictor variables X1,X2,··· ,Xd. On the other hand, if f is large, with a p-value less than the significance level (say, α = 0.01), then we can reject the null hypothesis and accept the alternative hypothesis that Y depends on at least one predictor variable Xi . 27.3 FURTHER READING For a geometrical approach to multivariate statistics see Wickens (2014), and Saville and Wood (2012). For an excellent treatment of modern statistical inference in the context of regression see Devore and Berk (2012). Devore, J. and Berk, K. (2012). Modern Mathematical Statistics with Applications. 2nd ed. New York: Springer Science+Business Media. Saville, D. J. and Wood, G. R. (2012). Statistical Methods: The Geometric Approach. New York: Springer Science + Business Media. Wickens, T. D. (2014). The Geometry of Multivariate Statistics. New York: Psychology Press, Taylor & Francis Group. 27.4 EXERCISES Q1. Showthatforbivariateregression,wehaveβ=μY−ω·μX,whereβandωarethe multiple regression. true model parameters. Q2. Showthat􏰥n (yi −yˆi)·(yˆi −μY)=0. i=1 􏰭 Q3. ProvethatǫisorthogonaltoYandX.Showthisforbivariateregressionandthenfor Q4. Show that for bivariate regression, the R2 statistic is equivalent to the squared correlation between the independent attribute vector X and response vector Y. That is, show that R2 = ρX2 Y. Q5. ShowthatY􏰭=Y􏰭−μY·1. 􏰭 􏰑􏰑 􏰭􏰑􏰑 􏰑􏰑 􏰑􏰑 Q6. Showthat∥ǫ∥= Y−Y = Y−Y . Q7. Inbivariateregression,showthatunderthenullhypothesis,E[RSS]=σ2. Q8. Inbivariateregression,showthatE[σˆ2]=σ2. Q9. ShowthatE[RSS/d]≥σ2. Exercises 753 Q10. Treating each residual ǫi = yi − yˆi as a random variable, show that var(ǫi)=σ2(1−hii) where hii is the ith diagonal of the (d + 1) × (d + 1) hat matrix H for the augmented data D ̃. Next, using the above expression for var(ǫi), show that for bivariate regression, the variance of the ith residual is given as var(ǫi)=σ2·􏰕1−1− 1 ·(xi−μX)2􏰖 n sX Q11. Given data matrix D, let D the centered data matrix, and D ̃ the augmented data matrix (with the extra column X0 = 1). Let (D ̃ TD ̃ ) be the uncentered scatter matrix for the augmented data, and let DT Dbe the scatter matrix for the centered data. Show that the lower right d × d submatrix of (D ̃ TD ̃ )−1 is (DT D)−1.