Perceptions and
Machines
Support
Vector
i
Outline
Applications
Preliminaries
Perceptions SVM
a
kernel Comparisons
with SVR
others
trick
Preli與
Separating
RP.fi
L
xc
It perplane f
GEpign
g i
1fnl 1
1
1
I
L
eg.fm
國 tx 2xz Pōl
二
二
fy
fy 1
0
7X fcnco
0 Pil Pi2
2
Data
aB
X
機一 KEG yfltnli L
fa 0
oxiEGfak0
gl
fgso
yisgnnifáfo 州 to
o
In
class 2 class
L
x
0
0
X
Assign
Linearly
至三
fit
Separable yif
0
how correct
degree
of
Linearly L
non
separable
X
x
O
0
0
0
o
X
至
人
0
x
to
fa
fcnco
Correctly classified
classified
Correctly classified Wrongly classified
how correcthurong
Wrongly
Sgnh
1列 degree
I of
Properties
仍
L
i
if
Proof
Nxo EL
fai of
i 1,2_ _faz pixixzh A
暫
_Xz
迎仙
X Xz EL
ftp.T
fy B上 xīxz
0
Hxiftpix 0
p
fxo
0
KXERP the signed distance to L
奶遊
X_x 吵
till
x_x Bo
喝
二
L
_x
入
有
二
All
型
llf IT B
Xo
伸
Bypn fy
0
Assign The
yisgntn distance from
Xi
to L
以三
yx
𠵯
與 制
dxi
iii
r
X
L
gilqq d
和刪 光
dxd.co
x O g di fa o o
fcnco
Classifications
Perceptions SVM
Perceptonszn Rosenblat 1958
Foundations
for
Neaurd Networks in 80 s 490 s
Objective minimize
of
L
distance
ttd mis_classified
to
L
dlxi
二
点
L m
where
Ei.im
ximis classified set
Nl
random
problems in convergence
may
cause
in
zyii
Em I
min
B im
EydB.fi
回
不
SGD
Problemser
dealing on Starting point
Many
a
solutions
X
X
O
o
M
x
o
n
Long time t converge
Non_separable
Fail
to
case
converge
Nlaximize
Solution
Margin
r
More
Robust
Yo
O o
340 9. Support Vector Machines
i
−1 0 1 2 3 −1 0 1 2 3
X1 X1
FIGURE 9.2. Left: There are two classes of observations, shown in blue and in purple, each of which has measurements on two variables. Three separating hyperplanes, out of many possible, are shown in black. Right: A separating hy- perplane is shown in black. The blue and purple grid indicates the decision rule made by a classifier based on this separating hyperplane: a test observation that falls in the blue portion of the grid will be assigned to the blue class, and a test observation that falls into the purple portion of the grid will be assigned to the purple class.
those from the purple class as yi = −1. Then a separating hyperplane has the property that
(9.6)
(9.7)
(9.8)
β0 +β1xi1 +β2xi2 +…+βpxip >0if yi =1, β0 + β1xi1 + β2xi2 + . . . + βpxip < 0 if yi = −1.
and
Equivalently, a separating hyperplane has the property that
yi(β0 +β1xi1 +β2xi2 +...+βpxip)>0 a
for all i = 1,…,n.
If a separating hyperplane exists, we can use it to construct a very natural
classifier: a test observation is assigned a class depending on which side of the hyperplane it is located. The right-hand panel of Figure 9.2 shows an example of such a classifier. That is, we classify the test observation x∗ based on the sign of f(x∗) = β0+β1x∗1+β2x∗2+. . .+βpx∗p. If f(x∗) is positive, then we assign the test observation to class 1, and if f(x∗) is negative, then we assign it to class −1. We can also make use of the magnitude of f(x∗). If f(x∗) is far from zero, then this means that x∗ lies far from the hyperplane, and so we can be confident about our class assignment for x∗. On the other
X2
−1 0 1 2 3
X2
−1 0 1 2 3
5
Vapnik
止
您
Non
r
overlapping
i
Maximum
Maximize MOB
DillBl12M
Margin
Classifier
stdlxi.LK
Botixiz
11 in B
油
M
ˊ
X
X
a
Mi Ou M
of
o
x
二0
Note
Botixiz
Vcso
M cftcpixiz
MNH
九 B
infinite of solutions
Sad off is NOT important
its direction Fix HM
off only
convenient
Find
Many choices choose one
its
direction
IS scale
mathematically
Choose
Max Ma Min
X
X
腐
go.in
o
lsminlpi
你亦
非仙 11 昨 1
fy
0
Reformulation
PoB 斲 st.ftfxiZI.gl
㡭的 器的
It Convexopnminrny
s
Support Proof
vs
vectors 2,4,6 are Nonsupport vectors
Nonsupport Points i
yfn
Points it 3,5 are
xi 二 0 vectors
Proof
Xi 20
Support
i gift 1 1M 1
刪M 1
r
true for i 1,3 D o
x
fy
ˊ
3011811 g
峾
0
Min
0
erlapping_
n.fi
ii
I
l
鑫
⻄
vectors i 二 2,4,6 are Nonsupport vectors
correctly labeled and outside margin
Support
vs
Nonsupport
Points
yfn
All other points
wrongly labelled terre
8
r
Proof
刪M 1
xi 二 0 vectors
OR inside margin sense
Points 1,3,5,7,8.1 On_no
o
oX
0
x
are
Support
points
ˊ
58
x30 0704
1191流
0
圓
in
in gfy
0
Proof Xi 20 Òyifa 1 1M 1共 巡
sioyifa.FI
nM 1 去20
Pointi Yifcnn二
共
35
correct
labdl
o
主
11
8 一点㗀 人是 三 rlabdg.tn
Mf3 13 3ˇ
M 冰 10 inin
1
o
border100M11 1 line
主
ˇ
wrong
Obsiosfi
烈 Corrects labelled wrongly labelled
斯
Sp2
c
杜
ˊ
9.71
ninit.io
1 To
X
X
x
0
fcn
M
M
出
n
Eihhi I
I
烈
0
yif.DE Hinge
fil
4 Loss
o
Kernel trick
From
if
it
Ètizài
N
fit
it Klxxi
x’T Iiyiklx
if
N
二
yikii
1
i IN
到
y
where
XD
kernel
Original Optimal
separating
hypery
Pros
Linear in Cons
feature XER
Simple
Linear in xe
RP
klx xi 化 妙 蛇
Too
Linear kernels too
Non linear ones
simple
restrictive
Kerndtrickzr
Reject Feature
space engineering
X onto highendin
XERP
Work
1
h
ERG lqpp
instead of X
with ln
I
1
Choiceg PRML
i
Example
SVM
wit
Gaussian kernel
品管器
器
0
nomsepadlezfjány
Binary
to Mutidass Classifications
I
Prod Cons Discussed earlier
vs all I vs I
SVM
簽
i
一
iūi
on hntn
步
Smformulatnmze
piltCEE
Sty.fgz
Siiin.to
l esizo
NINE 剩 212 人只
o
生他 1 Si 人
min Pops
yi x
I
Sit
Support
Nonsupport
Vectors
cgliegr
他
X
to
0
00 fno
ˊ
0
逃
1 2 1 Sits Xi 0 as Sizo
yfx
Those points play no roles in
Support
classic others
vectors
yif 1 Si I20
中0
Xi G 1 yifg
The
f
Ěiyi
linear
in
D
solution
satisfies 50
巨
形 不
focus support
不 成生出
Only
on
Xi to Cepg vectors
i
N I
in
St Si 1 Simaxfyif
yif
nini
min BoP Si
Sit yif
入
Si
0
20
1
i.fi hr
Hinge
Llfp
nl
1
烈
Loss
yif 3
hr
i
1
o
不
SVM vs
Logistic
Reg
的
di
is
Th
Logistic Loglikelihood
Nlnzié
in
yitl
Review
binary
for
logistic classification
rg
Observations yxī
in
n_n
Ai occurs otherwise
Link
kill function
Xip
高一
yiinikl
Where
Xi
呦
Éiexirs
一
9
的
of
Likelihood
fg
f
in
贰
ieHé Ìlyixiptng
Llf
yn n
Tpjipji
f
二
il
成𠠬
iei
l
Ī Let
Nowoe 二11
樂
纞
Hi Model
器 導
9 fiiliapj
加博
p Elnflij
1
l
li
l
it
ghleigtlcnaltǘ
等a 等
等
pin
址
吧
tnalei
hate
i i
tnhltēk
Aside
Loss
点
Ado
Exponential
Loss
Üzē
二
ˋ
SVMvslogistic
fX
ipEi
lhiǐfyif
Loss些些 二
oolety.it
yif
北
到十
hdég
lnltēj
426 12. Flexible Discriminants
痴
is
Hinge Loss Binomial Deviance Squared Error Class Huber
A 鬆哭管
−3 −2 −1 0 1 2 3
yf
FIGURE 12.4. The support vector loss function (hinge loss), compared to the negative log-likelihood loss (binomial deviance) for logistic regression, squared-er- ror loss, and a “Huberized” version of the squared hinge loss. All are shown as a function of yf rather than f, because of the symmetry between the y = +1 and y = −1 case. The deviance and Huber have the same asymptotes as the SVM loss, but are rounded in the interior. All are scaled to have the limiting left-tail slope of −1.
12.3.2 The SVM as a Penalization Method
With f(x) = h(x)T β + β0, consider the optimization problem N λ2
min [1 − yif(xi)]+ + ∥β∥ β0, β i=1 2
(12.25)
where the subscript “+” indicates positive part. This has the form loss +
penalty, which is a familiar paradigm in function estimation. It is easy to
same as that for (12.8).
Examination of the “hinge” loss function L(y, f ) = [1 − yf ]+ shows that
it is reasonable for two-class classification, when compared to other more traditional loss functions. Figure 12.4 compares it to the log-likelihood loss for logistic regression, as well as squared-error loss and a variant thereof. The (negative) log-likelihood or binomial deviance has similar tails as the SVM loss, giving zero penalty to points well inside their margin, and a
show (Exercise 12.1) that the solution to (12.25), with λ = 1/C, is the
四
Loss
0.0 0.5 1.0 1.5 2.0 2.5 3.0
12.3 Support Vector Machines and Kernels 427
TABLE 12.1. The population minimizers for the different loss functions in Fig- ure 12.4. Logistic regression uses the binomial log-likelihood or deviance. Linear discriminant analysis (Exercise 4.2) uses squared-error loss. The SVM hinge loss estimates the mode of the posterior class probabilities, whereas the others estimate a linear transformation of these probabilities.
log[1 + e−yf(x)]
i
Loss Function
Binomial Deviance
SVM Hinge Loss
Squared Error
“Huberised” Square Hinge Loss
L[y,f(x)]
[1 − yf(x)]+
[y − f(x)]2 = [1 − yf(x)]2
Minimizing Function
f(x) = log Pr(Y = +1|x) Pr(Y = -1|x)
f(x) = sign[Pr(Y = +1|x) − 12 ] f(x) = 2Pr(Y = +1|x) − 1
f (x) = 2Pr(Y = +1|x) − 1
−4yf(x), yf(x) < -1 [1 − yf (x)]2+ otherwise
linear penalty to points on the wrong side and far away. Squared-error, on the other hand gives a quadratic penalty, and points well inside their own margin have a strong influence on the model as well. The squared hinge loss L(y, f ) = [1 − yf ]2+ is like the quadratic, except it is zero for points inside their margin. It still rises quadratically in the left tail, and will be less robust than hinge or deviance to misclassified observations. Recently Rosset and Zhu (2007) proposed a “Huberized” version of the squared hinge loss, which converts smoothly to a linear loss at yf = −1.
We can characterize these loss functions in terms of what they are es- timating at the population level. We consider minimizing EL(Y,f(X)). Table 12.1 summarizes the results. Whereas the hinge loss estimates the classifier G(x) itself, all the others estimate a transformation of the class posterior probabilities. The “Huberized” square hinge loss shares attractive properties of logistic regression (smooth loss function, estimates probabili- ties), as well as the SVM hinge loss (support points).
Formulation (12.25) casts the SVM as a regularized function estimation problem, where the coefficients of the linear expansion f (x) = β0 + h(x)T β are shrunk toward zero (excluding the constant). If h(x) represents a hierar- chical basis having some ordered structure (such as ordered in roughness),
10.6 Loss Functions and Robustness 347
Misclassification Exponential Binomial Deviance Squared Error Support Vector
−2 −1 0 1 2
y·f
FIGURE 10.4. Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) ̸= y); exponential: exp(−yf); binomial deviance: log(1 + exp(−2yf)); squared error: (y − f)2; and support vector: (1 − yf)+ (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1).
f(x) = 0. The goal of the classification algorithm is to produce positive
margins as frequently as possible. Any loss criterion used for classification
should penalize negative margins more heavily than positive ones since positive margin observations are already correctly classified.
Figure 10.4 shows both the exponential (10.8) and binomial deviance criteria as a function of the margin y · f (x). Also shown is misclassification loss L(y,f(x)) = I(y·f(x) < 0), which gives unit penalty for negative mar- gin values, and no penalty at all for positive ones. Both the exponential and deviance loss can be viewed as monotone continuous approximations to misclassification loss. They continuously penalize increasingly negative margin values more heavily than they reward increasingly positive ones. The difference between them is in degree. The penalty associated with bi- nomial deviance increases linearly for large increasingly negative margin, whereas the exponential criterion increases the influence of such observa- tions exponentially.
At any point in the training process the exponential criterion concen- trates much more influence on observations with large negative margins. Binomial deviance concentrates relatively less influence on such observa-
i
Loss
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Reflection
Popular for
Reasons Less
Powerful
SVMZANN Kenny
Not so popular Alex Net
Not Black
Logistic
on
20
SVM
years
䅃
eg
trick
for digitrecognition
二 SUM
二 Logistic
t Kernel t Penalty
known
和 mysterious 用
with kernel
so
2012
mysterious box
SVRz.ro
Y LY
loss
XtE 她
Lz
vs
醺
SVR
f
Penalty
hinge
SVR
9
Hngetiplt
to
入雌
insensitive small errors
Real
Examples
Exampletone
Handwritten Digit Recognition
February 1, 2015
Example 3: Handwritten Digit Recognition
Dataset ZipCode http://www-stat.stanford.edu/~tibs/ ElemStatLearn/datasets/zip.info
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of di erent sizes and orientations; the images here have been normalized, resulting in 16 x 16 grayscale images (Le Cun et al., 1990).
The task is to predict, from the 16 ◊ 16 matrix of pixel intensities, the identity of each image (0, 1, . . . , 9) quickly and accurately
Content
I Exploratory Data Analysis I Classification Algorithm
Look at the TRAINING data set
head(DATASET.train[,1:6])
## Y X.1 X.2 X.3 X.4 X.5
## 1 6 -1 -1 -1 -1.000 -1.000
## 2 5 -1 -1 -1 -0.813 -0.671
## 3 4 -1 -1 -1 -1.000 -1.000
## 4 7 -1 -1 -1 -1.000 -1.000
## 5 3 -1 -1 -1 -1.000 -1.000
## 6 6 -1 -1 -1 -1.000 -1.000
dim(DATASET.train)
Ii Pres257二 256 1 16 16 1
x
y
in P ## [1] 7291 257
Display digits of training data set (An average of each digits)
012
3 16x16 4 5 678
1:16
9
1:16
1:16 1:16
1:16
1:16
1:16
1:16
1:16
1:16 1:16 1:16 1:16
1:16 1:16 1:16
1:16 1:16 1:16
Total Number of Digits (Training Set)
1005
731
Total Number of Digits (Training Set)
1194
658 652
664
645
644
0123456789
more 0s
556
542
Examples Number
0 200 400 600 800 1000 1200 1400
Classification. Predictive Model. SVM (Support Vector Machine) Algorithm
##
## Call:
## svm(formula = DATASET.train$Y ~ ., data = DATASET.train,
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.00390625
##
## Number of Support Vectors: 2606
##
## (21332631923528563256262401246) ##
##
## Number of Classes: 10
EI
Confusion Matrix (SVM)
## Predicted Class
##ActualClass 0 1 2 3 4 5 6 7 8 9 ## 0351 0 6 0 1 0 0 0 1 0 ## 1025310513100 ## 2201834301140 ## 30051460110130 ## 4013018612313 ## 5302311470013 ## 6404021158010 ## 7002050013802 ## 8402302101513 ## 9000041002170
## [1] "Accuary (Precision): 0.938216243148979"
I
1
Predict Digit for Example 1 (SVM)
## [1] "Current Digit: 9"
## [1] "Predicted Digit: 9"
9
1:16
Errors with Support Vector Machine (SVM)
## [1] "Error Numbers: 4"
C.D.:9 − Pr.D.:2 C.D.:6 − Pr.D.:5 C.D.:3 − Pr.D.:5
C.D.:6 − Pr.D.:0
1:16 1:16 1:16
1:16
Errors
二点器
1:16
1:16
1:16
1:16
Example
Code Chapter 11
2 ESL
Zip
404 Neural Networks
FIGURE 11.9. Examples of training cases from ZIP code data. Each image is a 16 × 16 8-bit grayscale representation of a handwritten digit.
decay parameter, and hence cross-validation of this parameter would be preferred.
11.7 Example: ZIP Code Data
This example is a character recognition task: classification of handwritten numerals. This problem captured the attention of the machine learning and neural network community for many years, and has remained a benchmark problem in the field. Figure 11.9 shows some examples of normalized hand- written digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images shown here have been deslanted and size normal- ized, resulting in 16 × 16 grayscale images (Le Cun et al., 1990). These 256 pixel values are used as inputs to the neural network classifier.
A black box neural network is not ideally suited to this pattern recogni- tion task, partly because the pixel representation of the images lack certain invariances (such as small rotations of the image). Consequently early at- tempts with neural networks yielded misclassification rates around 4.5% on various examples of the problem. In this section we show some of the pioneering efforts to handcraft the neural network to overcome some these deficiencies (Le Cun, 1989), which ultimately led to the state of the art in neural network performance(Le Cun et al., 1998)1.
Although current digit datasets have tens of thousands of training and test examples, the sample size here is deliberately modest in order to em-
1The figures and tables in this example were recreated from Le Cun (1989).
Logistic
CNN
10
16x16
10 4x4
8x8x2
16x16
10 12
16x16
10 4x4
8x8
CNN
16x16
Net-3
Local Connectivity
11.7 Example: ZIP Code Data 405
ANN
8
Net-1
Net-2
8
8
10
16x16
4x4x4
CNN
8
Net-4
Shared Weights
Net-5
FIGURE 11.10. Architecture of the five networks used in the ZIP code example.
phasize the effects. The examples were obtained by scanning some actual hand-drawn digits, and then generating additional images by random hor- izontal shifts. Details may be found in Le Cun (1989). There are 320 digits in the training set, and 160 in the test set.
Five different networks were fit to the data:
Net-1: No hidden layer, equivalent to multinomial logistic regression. Net-2: One hidden layer, 12 hidden units fully connected.
Net-3: Two hidden layers locally connected.
Net-4: Two hidden layers, locally connected with weight sharing.
Net-5: Two hidden layers, locally connected, two levels of weight sharing.
These are depicted in Figure 11.10. Net-1 for example has 256 inputs, one
each for the 16 × 16 input pixels, and ten output units for each of the digits
0–9. The predicted value fˆ (x) represents the estimated probability that k
an image x has digit class k, for k = 0,1,2,...,9.
8
8x8x2
406 Neural Networks
100
90
f NN
Net-5 Net-4
Net-3 Net-2
Net-1
80 i
70
60
0 5 10 15 20 25 30
Training Epochs
FIGURE 11.11. Test performance curves, as a function of the number of train- ing epochs, for the five networks of Table 11.1 applied to the ZIP code data. (Le Cun, 1989)
The networks all have sigmoidal output units, and were all fit with the sum-of-squares error function. The first network has no hidden layer, and hence is nearly equivalent to a linear multinomial regression model (Exer- cise 11.4). Net-2 is a single hidden layer network with 12 hidden units, of the kind described above.
The training set error for all of the networks was 0%, since in all cases there are more parameters than training observations. The evolution of the test error during the training epochs is shown in Figure 11.11. The linear network (Net-1) starts to overfit fairly quickly, while test performance of the others level off at successively superior values.
The other three networks have additional features which demonstrate the power and flexibility of the neural network paradigm. They introduce constraints on the network, natural for the problem at hand, which allow for more complex connectivity but fewer parameters.
Net-3 uses local connectivity: this means that each hidden unit is con- nected to only a small patch of units in the layer below. In the first hidden layer (an 8×8 array), each unit takes inputs from a 3×3 patch of the input layer; for units in the first hidden layer that are one unit apart, their recep- tive fields overlap by one row or column, and hence are two pixels apart. In the second hidden layer, inputs are from a 5 × 5 patch, and again units that are one unit apart have receptive fields that are two units apart. The weights for all other connections are set to zero. Local connectivity makes each unit responsible for extracting local features from the layer below, and
logistic
% Correct on Test Data
CNN
Links Weights 2570 2570 3214 3214 1226 1226 2266 1132 5194 1060
% Correct 80.0% 87.0% 88.5% 94.0% 98.4%
11.7 Example: ZIP Code Data 407
TABLE 11.1. Test set performance of five different neural networks on a hand- written digit classification example (Le Cun, 1989).
Network Architecture
Net-1: Single layer network
Net-2: Two layer network
Net-3: Locally connected
Net-4: Constrained network 1
Net-5: Constrained network 2
units than Net-2, Net-3 has fewer links and hence weights (1226 vs. 3214), and achieves similar performance.
Net-4 and Net-5 have local connectivity with shared weights. All units in a local feature map perform the same operation on different parts of the image, achieved by sharing the same weights. The first hidden layer of Net- 4 has two 8×8 arrays, and each unit takes input from a 3×3 patch just like in Net-3. However, each of the units in a single 8 × 8 feature map share the same set of nine weights (but have their own bias parameter). This forces the extracted features in different parts of the image to be computed by the same linear functional, and consequently these networks are sometimes known as convolutional networks. The second hidden layer of Net-4 has no weight sharing, and is the same as in Net-3. The gradient of the error function R with respect to a shared weight is the sum of the gradients of R with respect to each connection controlled by the weights in question.
Table 11.1 gives the number of links, the number of weights and the optimal test performance for each of the networks. We see that Net-4 has more links but fewer weights than Net-3, and superior test performance. Net-5 has four 4 × 4 feature maps in the second hidden layer, each unit connected to a 5 × 5 local patch in the layer below. Weights are shared in each of these feature maps. We see that Net-5 does the best, having errors of only 1.6%, compared to 13% for the “vanilla” network Net-2. The clever design of network Net-5, motivated by the fact that features of handwriting style should appear in more than one part of a digit, was the result of many person years of experimentation. This and similar networks gave better performance on ZIP code problems than any other learning method at that time (early 1990s). This example also shows that neural networks are not a fully automatic tool, as they are sometimes advertised. As with all statistical models, subject matter knowledge can and should be used to improve their performance.
This network was later outperformed by the tangent distance approach (Simard et al., 1993) described in Section 13.3.3, which explicitly incorpo- rates natural affine invariances. At this point the digit recognition datasets become test beds for every new learning procedure, and researchers worked
Vmǒstlbesst t5 reduces considerably the total number of weights. With many more hidden
sn
408 Neural Networks
hard to drive down the error rates. As of this writing, the best error rates on a large database (60, 000 training, 10, 000 test observations), derived from standard NIST2 databases, were reported to be the following: (Le Cun et al., 1998):
• 1.1% for tangent distance with a 1-nearest neighbor classifier (Sec- tion 13.3.3);
i
• 0.8% for a degree-9 polynomial SVM (Section 12.3);
• 0.8% for LeNet-5, a more complex version of the convolutional net-
work described here;
• 0.7% for boosted LeNet-4. Boosting is described in Chapter 8. LeNet-
4 is a predecessor of LeNet-5.
Le Cun et al. (1998) report a much larger table of performance results, and it is evident that many groups have been working very hard to bring these test error rates down. They report a standard error of 0.1% on the error estimates, which is based on a binomial average with N = 10,000 and p ≈ 0.01. This implies that error rates within 0.1—0.2% of one another are statistically equivalent. Realistically the standard error is even higher, since the test data has been implicitly used in the tuning of the various procedures.
11.8 Discussion
Both projection pursuit regression and neural networks take nonlinear func- tions of linear combinations (“derived features”) of the inputs. This is a powerful and very general approach for regression and classification, and has been shown to compete well with the best learning methods on many problems.
These tools are especially effective in problems with a high signal-to-noise ratio and settings where prediction without interpretation is the goal. They are less effective for problems where the goal is to describe the physical pro- cess that generated the data and the roles of individual inputs. Each input enters into the model in many places, in a nonlinear fashion. Some authors (Hinton, 1989) plot a diagram of the estimated weights into each hidden unit, to try to understand the feature that each unit is extracting. This is limited however by the lack of identifiability of the parameter vectors αm, m = 1, . . . , M. Often there are solutions with αm spanning the same linear space as the ones found during training, giving predicted values that
2The National Institute of Standards and Technology maintain large databases, in- cluding handwritten character databases; http://www.nist.gov/srd/.
Computational Considerations
With N training cases, p predictors, and m support vectors, the support vector machine requires m3 + mN + mpN operations, assuming m ≈ N. They do not scale well with N, although computational shortcuts are avail- able (Platt, 1999). Since these are evolving rapidly, the reader is urged to search the web for the latest technology.
LDA requires Np2 + p3 operations, as does PDA. The complexity of FDA depends on the regression method used. Many techniques are linear in N, such as additive models and MARS. General splines and kernel-based regression methods will typically require N3 operations.
Software is available for fitting FDA, PDA and MDA models in the R package mda, which is also available in S-PLUS.
Bibliographic Notes
The theory behind support vector machines is due to Vapnik and is de- scribed in Vapnik (1996). There is a burgeoning literature on SVMs; an online bibliography, created and maintained by Alex Smola and Bernhard Sch ̈olkopf, can be found at:
http://www.kernel-machines.org.
Our treatment is based on Wahba et al. (2000) and Evgeniou et al. (2000),
and the tutorial by Burges (Burges, 1998).
Linear discriminant analysis is due to Fisher (1936) and Rao (1973). The
connection with optimal scoring dates back at least to Breiman and Ihaka (1984), and in a simple form to Fisher (1936). There are strong connections with correspondence analysis (Greenacre, 1984). The description of flexible, penalized and mixture discriminant analysis is taken from Hastie et al. (1994), Hastie et al. (1995) and Hastie and Tibshirani (1996b), and all three are summarized in Hastie et al. (1998); see also Ripley (1996).
Exercises
Ex. 12.1 Show that the criteria (12.25) and (12.8) are equivalent.
Ex. 12.2 Show that the solution to (12.29) is the same as the solution to
(12.25) for a particular kernel.
Ex. 12.3 Consider a modification to (12.43) where you do not penalize the constant. Formulate the problem, and characterize its solution.
Ex. 12.4 Suppose you perform a reduced-subspace linear discriminant anal- ysis for a K-group problem. You compute the canonical variables of di-
Exercises 455
Example IS
3 LR
Heart Chpter
a Hack 9
品仁SUM linear
Similiar di At
distinguish
354
9. SupportfVectorMachines
be
piió
心 worst
不斷巡
0.6 0.8 1.0
staining
Support Vector Classifier LDA
好
Support Vector Classifier SVM: γ=10−3
SVM: γ=10−2 SVM: γ=10−1
r
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0 0.2 0.4
False positive rate
FIGURE 9.10. ROC curves for the Heart data training set. Left: The support oo
vector classifier and LDA are compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ = 10−3, 10−2, and 10−1.
9.3.3 An Application to the Heart Disease Data
In Chapter 8 we apply decision trees and related methods to the Heart data.
13
p
The aim is to use 13 predictors such as Age, Sex, and Chol in order to predict 0
whether an individual has heart disease. We now investigate how an SVM
t.pk
compares to LDA on this data. After removing 6 missing observations, the 297 data consist of 297 subjects, which we randomly split into 207 training and
0
90 test observations.
We first fit LDA and the support vector classifier to the training data.
Note that the support vector classifier is equivalent to a SVM using a poly-
nomial kernel of degree d = 1. The left-hand panel of Figure 9.10 displays
ROC curves (described in Section 4.4.3) for the training set predictions for
both LDA and the support vector classifier. Both classifiers compute scores ˆˆˆˆˆ
of the form f(X) = β0 + β1X1 + β2X2 + ... + βpXp for each observation. For any given cutoff t, we classify observations into the heart disease or
ˆˆ
no heart disease categories depending on whether f(X) < t or f(X) ≥ t.
The ROC curve is obtained by forming these predictions and computing the false positive and true positive rates for a range of values of t. An opti- mal classifier will hug the top left corner of the ROC plot. In this instance LDA and the support vector classifier both perform well, though there is a suggestion that the support vector classifier may be slightly superior.
The right-hand panel of Figure 9.10 displays ROC curves for SVMs using a radial kernel, with various values of γ. As γ increases and the fit becomes more non-linear, the ROC curves improve. Using γ = 10−1 appears to give an almost perfect ROC curve. However, these curves represent training error rates, which can be misleading in terms of performance on new test data. Figure 9.11 displays ROC curves computed on the 90 test observa-
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
diffsmall Testing
0.0 0.2 0.4
9.4 SVMs with More than Two Classes 355
a
SVM beHer slightly
ff best 8二心 worst
Support Vector Classifier LDA
False positive rate
0.6 0.8 1.0
0.0 0.2
0.4 0.6 0.8 1.0
False positive rate
FIGURE 9.11. ROC curves for the test set of the Heart data. Left: The support o
vector classifier and LDA are compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ = 10−3, 10−2, and 10−1.
tions. We observe some differences from the training ROC curves. In the left-hand panel of Figure 9.11, the support vector classifier appears to have a small advantage over LDA (although these differences are not statisti- cally significant). In the right-hand panel, the SVM using γ = 10−1, which showed the best results on the training data, produces the worst estimates on the test data. This is once again evidence that while a more flexible method will often produce lower training error rates, this does not neces- sarily lead to improved performance on test data. The SVMs with γ = 10−2 and γ = 10−3 perform comparably to the support vector classifier, and all three outperform the SVM with γ = 10−1.
9.4 SVMs with More than Two Classes
So far, our discussion has been limited to the case of binary classification: that is, classification in the two-class setting. How can we extend SVMs to the more general case where we have some arbitrary number of classes? It turns out that the concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes. Though a number of proposals for extending SVMs to the K-class case have been made, the two most popular are the one-versus-one and one-versus-all approaches. We briefly discuss those two approaches here.
9.4.1 One-Versus-One Classification
Suppose that we would like to perform classification using SVMs, and there
are K > 2 classes. A one-versus-one or all-pairs approach constructs K2 one-versus- area vnnnnnnne nnnre one
Support Vector Classifier SVM: γ=10−3
SVM: γ=10−2
SVM: γ=10−1
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0