Statistical Science
2006, Vol. 21, No. 1, 1–15
DOI: 10.1214/088342306000000060
⃝c Institute of Mathematical Statistics, 2006
Classifier Technology and the Illusion
of Progress David J. Hand
Abstract. A great many tools have been developed for supervised clas- sification, ranging from early methods such as linear discriminant anal- ysis through to modern developments such as neural networks and sup- port vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illu- sion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.
Key words and phrases: Supervised classification, error rate, misclas- sification rate, simplicity, principle of parsimony, population drift, se- lectivity bias, flat maximum effect, problem uncertainty, empirical com- parisons.
1. INTRODUCTION
In supervised classification, one seeks to construct arulewhichwillallowonetoassignobjectstooneofa prespecified set of classes based solely on a vector of measurements taken on those objects. Construc- tion of the rule is based on a “design set” or “train- ing set” of objects with known measurement vectors and for which the true class is also known: one essen- tially tries to extract from the design set the infor- mation which is relevant to distinguishing between the classes in terms of the given measurements. It is because the classes are known for the members
David J. Hand is Professor, Department of Mathematics and Institute for Mathematical Science, Imperial College, Huxley Building, 180 Queen’s Gate, London SW7 2AZ, United Kingdom (e-mail: d.j.hand@imperial.ac.uk )
of this initial data set that the term “supervised” is used: it is as if a “supervisor” has provided these classlabels.
Such problems are ubiquitous and, as a conse- quence, have been tackled in several different re- search areas, including statistics, machine learning, pattern recognition, computational learning theory and data mining. As a result, a tremendous vari- ety of algorithms and models has been developed for the construction of such rules. A partial list in- cludes linear discriminant analysis, quadratic dis- criminant analysis, regularized discriminant analy- sis, the naive Bayes method, logistic discriminant analysis, perceptrons, neural networks, radial ba- sis function methods, vector quantization methods, nearest neighbor and kernel nonparametric meth- ods, tree classifiers such as CART and C4.5, sup- port vector machines and rule-based methods. New methods, new variants on existing methods and new algorithms for existing methods are being developed all the time. In addition, different methods for vari- able selection, handling missing values and other aspects of data preprocessing multiply the number
This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in Statistical Science, 2006, Vol. 21, No. 1, 1–15. This reprint differs from the original in pagination and typographic detail.
1
arXiv:math/0606441v1 [math.ST] 19 Jun 2006
2 D. J. HAND
of tools yet further. General theoretical advances have also been made which have resulted in im- proved performance at predicting the class of new objects. These include ideas such as bagging, boost- ing and more general ensemble classifiers. Further- more, apart from the straightforward development of new rules, theory and practice have been devel- oped for performance assessment. A variety of crite- ria have been investigated, including measures based on the receiver operating characteristic (ROC) and Brier score, as well as the standard measure of mis- classification rate. Subtle estimators of these have been developed, such as jackknife, cross-validation and a variety of bootstrap methods, to overcome the potential optimistic bias which results from simply reclassifying the design set.
An examination of recent conference proceedings and journal articles shows that such developments are continuing. In part this is because of new compu- tational developments that permit the exploration of new ideas, and in part it is because of the emer- gence of new application domains which present new twists on the standard problem. For example, in bioinformatics there are often relatively few cases but many thousands of variables. In such situations the risk of overfitting is substantial and new classes of tools are required. General references to work on supervised classification include [11, 13, 33, 38, 44].
The situation to date thus appears to be one of very substantial theoretical progress, leading to deep theoretical developments and to increased predictive power in practical applications. While all of these things are true, it is the contention of this paper that the practical impact of the developments has been inflated; that although progress has been made, it may well not be as great as has been suggested. The arguments for this assertion are described in the fol- lowing sections. They develop ideas introduced by Hand [12, 14, 15, 18, 19] and Jamain and Hand [24]. The essence of the argument is that the improve- ments attributed to the more advanced and recent developments are small, and that aspects of real practical problems often render such small differ- ences irrelevant, or even unreal, so that the gains re- ported on theoretical grounds, or on empirical com- parisons from simulated or even real data sets, do not translate into real advantages in practice. That is, progress is far less than it appears.
These ideas are described in four steps.
First, model-fitting is a sequential process of pro- gressive refinement, which begins by describing the
largest and most striking aspects of the data struc- ture, and then turns to progressively smaller aspects (stopping, one hopes, before the process begins to model idiosyncrasies of the observed sample of data rather than aspects of the true underlying distribu- tion). In Section 2 we show that this means that the large gains in predictive accuracy in classification are won using relatively simple models at the start of the process, leaving potential gains which decrease in size as the modeling process is taken further. All of this means that the extra accuracy of the more sophisticated approaches, beyond that attained by simple models, is achieved from “minor” aspects of the distributions and classification problems.
Second, in Section 3 we argue that in many, per- haps most, real classification problems the data points in the design set are not, in fact, randomly drawn from the same distribution as the data points to which the classifier will be applied. There are many reasons for this discrepancy, and some are illustrated. It goes without saying that statements about classi- fier accuracy based on a false assumption about the identity of the design set distribution and the dis- tribution of future points may well be inaccurate.
Third, when constructing classification rules, var- ious other assumptions and choices are often made which may not be appropriate and which may give misleading impressions of future classifier performance. For example, it is typically assumed that the classes are objectively defined, with no arbitrariness or un- certainty about the class labels, but this is some- times not the case. Likewise, parameters are often estimated by optimizing criteria which are not rele- vant to the real aim of classification accuracy. Such issues are described in Section 4 and, once again, it
is obvious that these introduce doubts about how the claimed classifier performance will generalize to real problems.
The phenomena with which we are concerned in Sections 3 and 4 are related to the phenomenon of overfitting. A model overfits when it models the de- sign sample too closely rather than modeling the dis- tribution from which this sample is drawn. In Sec- tions 3 and 4 we are concerned with situations in which the models may accurately reflect the design distributions (so they do not underfit or overfit), but where they fail to recognize that these distributions, and the apparent classification problems described, are in fact merely a single such problem drawn from a notional distribution of problems. The real aim might be to solve a rather different problem. One
might thus describe the issue as one of problem un- certainty. To take a familiar example, which we do not explore in detail in this paper because it has been explored elsewhere, the relative costs of differ- ent kinds of misclassification may differ and may be unknown. A very common resolution is to assume equal costs (Jamain and Hand [24] found that most comparative studies of classification rules made this assumption) and to use straightforward error rate as the performance criterion. However, equality is but one choice, and an arbitrary one at that, and one which we suspect is in fact rarely appropriate. In assuming equal costs, one is adopting a particu- lar problem which may not be the one which is re- ally to be solved. Indeed, things are even worse than this might suggest, because relative misclassification costs may change over time. Provost and Fawcett [36] have described such situations: “Comparison of- ten is difficult in real-world environments because key parameters of the target environment are not known. The optimal cost/benefit tradeoffs and the target class priors seldom are known precisely, and often are subject to change (Zahavi and Levin [47]; Friedman and Wyatt [8]; Klinkenberg and Thorsten [29]). For example, in fraud detection we cannot ig- nore misclassification costs or the skewed class dis- tribution, nor can we assume that our estimates are precise or static (Fawcett and Provost [6]).”
Moving on, our fourth argument is that classifi- cation methods are typically evaluated by report- ing their performance on a variety of real data sets. However, such empirical comparisons, while superfi- cially attractive, have major problems which are of- ten not acknowledged. In general, we suggest in Sec- tion 5 that no method will be universally superior to other methods: relative superiority will depend on the type of data used in the comparisons, the particular data sets used, the performance criterion and a host of other factors. Moreover, the relative performance will depend on the experience the per- son making the comparison has in using the meth- ods, and this experience may differ between meth- ods: researcher A may find that his favorite method is best, merely because he knows how to squeeze the best performance from this method.
These various arguments together suggest that an apparent superiority in classification accuracy, ob- tained in “laboratory conditions,” may not trans- late to a superiority in real-world conditions and, in particular, the apparent superiority of highly sophis- ticated methods may be illusory, with simple meth-
ods often being equally effective or even superior in classifying new data points.
2. MARGINAL IMPROVEMENTS
This section demonstrates that the extra perfor- mance to be achieved by more sophisticated classifi- cation rules, beyond that attained by simple meth- ods, is small. It follows that if aspects of the classi- fication problem are not accurately described (e.g., if incorrect distributions have been used, incorrect class definitions have been adopted, inappropriate performance comparison criteria have been applied, etc.), then the reported advantage of the more so- phisticated methods may be incorrect. Later sec- tions illustrate how some inaccuracies in the clas- sification problem description can arise.
2.1 A Simple Example
Statistical modeling is a sequential process in which one gradually refines the model to provide a better and better fit to the distributions from which the data were drawn. In general, the earlier stages in this process yield greater improvement in model fit than later stages. Furthermore, if one looks at the histor- ical development of classification methods, then the earlier approaches involve relatively simple struc- tures (e.g., the linear forms of linear or logistic dis- criminant analysis), while more recent approaches involve more complicated structures (e.g., the de- cision surfaces of neural networks or support vector machines). It follows that the simple approaches will have led to greater improvement in predictive per- formance than the later approaches which are nec- essarily trying to improve on the predictive perfor- mance obtained by the simpler earlier methods. Put another way, there is a law of diminishing returns.
Although this paper is concerned with supervised classification problems, it is illuminating to examine a simple regression case. Suppose that we have a single response variable y which is to be predicted fromdvariables(x1,…,xd)T =x.Supposealsothat the correlation matrix of (xT , y)T has the form
(2.1) Σ=Σ11 Σ12=(1−ρ)I+ρ11T τ Σ21 Σ22 τT 1
withΣ11 =(1−ρ)I+ρ11T,Σ12 =ΣT21 =τ andΣ22 = 1, where I is the d × d identity matrix, 1 = (1,…,1)T of length d and τ = (τ,…,τ)T of length d. That is, the correlation between each pair
CLASSIFIER TECHNOLOGY 3
4 D. J. HAND
of predictor variables is ρ, and the correlation be- tween each predictor variable and the response vari- able is τ . Suppose also that ρ, τ ≥ 0. This condition is not necessary for the argument which follows; it merely allows us to avoid some detail.
Let V (d) be the conditional variance of y given the values of d predictor variables x, as above. Standard results give this conditional variance as
(2.2) V (d) = Σ22 − Σ21Σ−1Σ12. 11
Using the result that
Σ−1 =[(1−ρ)I+ρ11T]−1
is reasonably strong mutual correlation between the predictor variables, the earliest ones contribute sub- stantially more to the reduction in variance remain- ing unexplained than do the later ones. The case ρ = 0 consists of a diagonal straight line running from 1 down to zero. In the case ρ = 0.9, almost all of the variance in the response variable is explained by the first chosen predictor.
This example shows that the reduction in con- ditional variance of the response variable decreases with each additional predictor we add, even though each predictor has an identical correlation with the response variable (provided this correlation is greater than 0). The reason for the reduction is, of course, the mutual correlation between the predictors: much of the predictive power of a new predictor has al- ready been accounted for by the existing predictors.
In real applications, the situation is generally even more pronounced than in this illustration. Usually, in real applications, the predictor variables are not identically correlated with the response, and the pre- dictors are selected sequentially, beginning with those which maximally reduce the conditional variance. In a sense, then, the example above provides a lower bound on the phenomenon: in real applications the proportion of the gains attributable to the early steps is even greater.
2.2 Decreasing Bounds on Possible Improvement
We now return to supervised classification. For illustrative purposes, suppose that misclassification rate is the performance criterion, although similar arguments apply with other criteria. Ignoring issues of overfitting, adding additional predictor variables can only lead to a decrease in misclassification rate.
(2.3)
11
1 ρ11T = 1−ρ I− 1+(d−1)ρ
[with −(d − 1)−1 < ρ < 1, so that Σ11 is positive def- inite], leads to
(2.4)
From this it follows that the reduction in conditional variance due to adding an extra predictor variable, xd+1 (also correlated ρ with the other predictors and τ with the response variable), is
V (d) = 1 − τ T
dτ2 ρd2τ2
1 ρ11T I − τ
1−ρ 1+(d−1)ρ =1− 1−ρ + (1+(d−1)ρ)(1−ρ).
X(d + 1) = V (d) − V (d + 1) (2.5) = τ2
1−ρ
ρτ2 d2
(d+1)2 + 1−ρ 1+(d−1)ρ − 1+dρ .
Note that the condition −(d − 1)−1 < ρ < 1 must still be satisfied when d is increased.
Now consider two cases:
Case 1. When the predictor variables are uncor- related, ρ = 0. From (2.5), we obtain X(d + 1) = τ2. That is, if the predictor variables are mutually un- correlated and each has correlation τ with the re- sponse variable, then each additional predictor re- duces the variance of the conditional variance of y given the predictors by τ2. [Of course, by setting ρ = 0 in (2.4) we see that this is only possible up to d = τ −2 predictors. With this many predictors the condi- tional variance of y given x has been reduced to zero.]
Case 2. ρ>0. Plots of V(d) for τ =0.5 and for a range of ρ values are shown in Figure 1. When there
Fig. 1. Conditional variance of response variable as addi- tional predictors are added for τ = 0.5. A range of values of ρ is shown.
The simplest model is that which uses no predictors, leading, in the two-class case, to a misclassification rate of m0 = π0, where π0 is the prior probability of the smaller class. Suppose that a predictor vari- able is now introduced which has the effect of re- ducing the misclassification rate to m1 < m0. Then the scope for further improvement is only m1, which is less than the original scope m0. Furthermore, if m1 < m0 − m1, then all future additions necessar- ily improve things by less than the first predictor variable. In fact, things are even more extreme than this: one cannot further reduce the misclassification rate by more than m1 − mb, where mb is the Bayes error rate. To put it another way, at each step the maximum possible increase in predictive power de- creases, so it is not surprising that, in general, at each step the additional contribution to predictive power decreases. 2.3 Effectiveness of Simple Classifiers Although the literature contains examples of ar- tificial data which simple models cannot separate (e.g., intertwined spirals or checkerboard patterns), such data sets are exceedingly rare in real life. Con- versely, in the two-class case, although few real data sets have exactly linear decision surfaces, it is com- mon to find that the centroids of the predictor vari- able distributions of the classes are different, so that a simple linear surface can do surprisingly well as an estimate of the true decision surface. This may not be the same as “can do surprisingly well in classify- ing the points,” since in many problems the Bayes error rate is high, meaning that no decision sur- face can separate the distributions of such problems very well. However, it means that the dramatic steps in improvement in classifier accuracy are made in the simple first steps. This is a phenomenon which has been noticed by others (e.g., Rendell and Se- shu [37]; Shavlik, Mooney and Towell [41]; Mingers [34]; Weiss, Galen and Tadepalli [45]; Holte [22]). Holte [22], in particular, carried out an investigation of this phenomenon. His “simple classifier” (called 1R) consists of a partition of a single variable, with each cell of the partition possibly being assigned to a different class: it is a multiple-split single-level tree classifier. A search through the variables is used to find that which yields the best predictive accuracy. Holte compared this simple rule with C4.5, a more sophisticated tree algorithm, finding that “on most of the datasets studied, 1R’s accuracy is about 3 percentage points lower than C4’s.” We carried out a similar analysis. Perhaps the earliest classification method formally developed is Fisher’s linear discriminant analysis [7]. Table 1 shows misclassification rates for this method and for the best performing method we could find in a search of the literature (these data were abstracted from the data accumulated by Jamain [23] and Jamain and Hand [24]) for a randomly selected sample of ten data sets. The first numerical column shows the misclassification rate of the best method we found (mT ), the second shows that of linear discriminant analysis (mL), the third shows the default rule of as- signing every point to the majority class (m0) and the final column shows the proportion of the dif- ference between the default rule and the best rule which is achieved by linear discriminant analysis [(m0 − mL)/(m0 − mT )]. It is likely that the best rules, being the best of rules which many researchers have applied, are producing results near the Bayes error rate. CLASSIFIER TECHNOLOGY 5 Table 1 Performance of linear discriminant analysis and the best result we found on ten randomly selected data sets Data set Segmentation Pima House-votes16 Vehicle Satimage Heart Cleveland Splice Waveform21 Led7 Breast Wisconsin Best method e.r. 0.0140 0.1979 0.0270 0.1450 0.0850 0.1410 0.0330 0.0035 0.2650 0.0260 Lindisc e.r. 0.083 0.221 0.046 0.216 0.160 0.141 0.057 0.004 0.265 0.038 Default rule 0.760 0.350 0.386 0.750 0.758 0.560 0.475 0.667 0.900 0.345 Prop linear 0.907 0.848 0.948 0.883 0.889 1.000 0.945 0.999 1.000 0.963 6 D. J. HAND The striking thing about Table 1 is the large val- ues of the percentages of classification accuracy gained by simple linear discriminant analysis. The lowest percentage is 85% and in most cases over 90% of the achievable improvement in predictive accuracy, over the simple baseline model, is achieved by the simple linear classifier. I am grateful to Willi Sauerbrei for pointing out that when the error rates of both the best method and the linear method are small, the large propor- tion in achievable accuracy which can be obtained by the linear method corresponds to the error rate of the linear method being a large multiple of that of the best method. For example, in the most ex- treme case in Table 1, the results for the segmenta- tion data show that the linear discrimination error rate is nearly six times that of the best method. On the other hand, when the error rates are small, this large difference will correspond to only a small pro- portion of new data points. Small differences in error rate are susceptible to the issues raised in Sections 3 and 4: they may vanish when problem uncertainties are taken into account. 2.4 The Flat Maximum Effect Even within the context of classifiers defined in terms of simple linear combinations of the predictor variables, it has often been observed that the ma- jor gains are made by (for example) weighting the variables equally, with only little further gains to be had by careful optimization of the weights. This phenomenon has been termed the flat maximum ef- fect [13, 43]: in general, often quite large deviations from the optimal set of weights will yield predictive performance not substantially worse than the opti- mal weights. An informal argument that shows why this is often the case is as follows. Let the predictor variables be (x1,...,xd)T = x and, for simplicity, assume that E(xi) = 0 and V(xi)=1 for i=1,...,d. Let Σ={r}ij be the cor- relation matrix between these variables. Now define two weighted sums dd w=wixi and v=vixi, i=1 i=1 for i = 1,...,d, and also require wi = 1 and vi = 1. Using these conditions, a little algebra shows that r(v, w) ≥ v w r(x , x ). ijij ij Now, with equal weights, vi = 1/d,i = 1,...,n, we obtain 1 r(v,w) ≥ d wjr(xi,xj) ij ≥ 1 w r(x ,x ), djik ij where k = arg minj r(xi, xj ). From this, r(v,w)≥ 1wjr(xi,xk) = d r(xi,xk). i dij 1 In words, the correlation between an arbitrary weighted sum of the x variables (with weights summing to 1) and the simple combination using equal weights is bounded below by the smallest row average of the entries in the correlation matrix of the x vari- ables. Hence if the correlations are all high, the sim- ple average will be highly correlated with any other weighted sum: the choice of weights will make little difference to the scores. The gain to be made by the extra effort of optimizing the weights may not be worth the effort. using respective weight vectors (w1,...,wd) and (v1,...,vd). In general, r(w,v), the correlation between w and v, Fig. 2. Effect on misclassification rate of increasing the can take extreme values of +1 and −1, but suppose number of hidden nodes in a neural network to predict the we restrict the weights to be nonnegative, wi, vi ≥ 0 class of the sonar data. CLASSIFIER TECHNOLOGY 7 Fig. 3. Effect on misclassification rate of increasing the number of leaves in a tree classifier to predict the class of the sonar data. 2.5 An Example As a simple illustration of how increasing model complexity leads to a decreasing rate of improve- ment, we fitted models to the sonar data from the University of California, Irvine (UCI) data base. This data set consists of 208 observations, 111 of which belong to the class “metal” and 97 of which belong to the class “rock.” There are 60 predictor variables. The data were randomly divided into two parts, and a succession of neural networks with increasing num- bers of hidden nodes was fitted to half of the data, with the other half being used as a test set. The er- ror rates are shown in Figure 2. The left-hand point, corresponding to 0 nodes, is the baseline misclassi- fication rate achieved by assigning everyone in the test set to the larger class. The error bars are 95% confidence intervals calculated from 100 networks in each case. Figure 3 shows a similar plot, but this time for a recursive partitioning tree classifier ap- plied to the same data. The horizontal axis shows increasing numbers of leaf nodes. Standard methods of tree construction were used, in which a large tree is pruned back to the requisite number of nodes. In both of these figures we see the dramatic improve- ment arising from fitting the first nontrivial model. This far exceeds the subsequent improvement ob- tained in any later step. 3. DESIGN SAMPLE SELECTION Intrinsic to the classical supervised classification paradigm is the assumption that the data in the design set are randomly drawn from the same dis- tribution as the points to be classified in the future. Sometimes slight variants of the sampling scheme are used, for example, drawing samples separately from each class, but the assumption that future points to be classified are drawn from the same distribu- tions as the design set is always made. Unfortu- nately, as we illustrate in this section, there are sev- eral reasons why this assumption may not be justi- fied. In fact, as with our suggestion that the com- mon choice of equal misclassification costs may be more often inappropriate than appropriate, we sus- pect that the assumption that the design distribu- tion is representative of the distribution from which future points will be drawn is perhaps more often incorrect than correct. If the distribution underlying the design data and that underlying future points to be classified do dif- fer, then elaborate optimization of the classifier us- ing the design data may be wasted effort: the per- formance difference between two classifiers may be irrelevant in the context of the differences arising between the design and future distributions. In par- ticular, we suggest, more sophisticated classifiers, which almost by definition model small idiosyncrasies of the distribution underlying the design set, will be more susceptible to wasting effort in this way: the grosser features of the distributions (modeled by simpler methods) are more likely to persist than the smaller features (modeled by the more elaborate methods). 3.1 Population Drift A fundamental assumption of the classical paradigm is that the various distributions involved do not change over time. In fact, in many applications this is unre- alistic and the population distributions are nonsta- Evolution of misclassification rate of a classifier built at the start of the period. Fig. 4. 8 D. J. HAND tionary. For example, it is unrealistic in most com- mercial applications concerned with human behav- ior: customers will change their behavior with price changes, with changes to products, with changing competition and with changing economic conditions. Hoadley [21] remarked “the test sample is supposed to represent the population to be encountered in the future. But in reality, it is usually a random sample of the current population. High performance on the test sample does not guarantee high performance on future samples, things do change” and “there is al- ways a chance that a variable and its relationships will change in the future. After that, you still want the model to work. So don’t make any variable dom- inant.” He is cautioning against making the model fit the design distribution too well. The last point about not making any variable dominant is related to the flat maximum effect, described above. Among the most important reasons for changes to the distribution of applicants are changes in mar- keting and advertising practices. Changes to the dis- tributions that describe the customers explain why, in the credit scoring and banking industries [16, 20, 39, 42], the classification rules used to predict which applicants are likely to default on loans are updated every few months: their performance degrades, not because the rules themselves change, but because the distributions to which they are being applied change [27]. An example of this is given in Figure 4. The avail- able data consisted of the true classes (“bad” or “good”) and the values of 17 predictor variables for 92,258 customers taking out unsecured personal loans with a 24-month term given by a major UK bank during the period 1 January 1993 to 30 Novem- ber 1997; 8.86% of the customers belonged to the bad class. The figure shows how the misclassifica- tion rate for a classification rule built on data just preceding the start of the displayed period changed over time. Since the coefficients of the classifier were not changing, the deterioration in performance must be due to shifts in the distributions of customers over time. An illustration of how this “population drift” phe- nomenon affects different classifiers differentially is given in Figure 5. For the purposes of this illustra- tion we used a linear discriminant analysis (LDA) as a simple classifier and a tree model as a more complicated classifier. For the design set we used customers 1, 3, 5, 7, . . . , 4999. We then applied the classifiers to alternate customers, beginning with the second, up to the 60,000th customer. This meant that different customers were used for designing and testing, even during the initial period, so that there would be no overfitting in the reported results. Fig- ure 5 shows lowess smooths of the misclassification cost [i.e., misclassification rate, with customers from each class weighted so that c0/c1 = π1/π0, where ci is the cost of misclassifying a customer from class i and πi is the prior (class size) of class i]. As can be seen from the figure, the tree classifier (the lower curve) is initially superior (has smaller loss), but af- ter a time its superiority begins to fade. Superficial examination of the figure might suggest that the ef- fect takes a long time to become apparent, not re- ally manifesting itself until around the 40,000th cus- tomer, but consider that, in an application such as this, the data are always retrospective. In the present case, one cannot determine the true class until the entire 24 month loan term has elapsed. [In fact, of course, this is not quite true: if a customer defaults before the end of the term, then their class (bad) is known, but otherwise their true (good or bad) class is not known until the end, so that to obtain an unbi- ased sample, one has to wait until the end. Survival analysis models can be constructed to allow for this, but that is leading us away from the point.] For our problem, to accumulate an unbiased sample of 5000 customers with known true outcome, one would have to wait until two years after the 5000th customer had been accepted. In terms of the horizontal axis in Figure 5, this means that the model would be built, and would be initially used at around the time that the 40,000th customer was being considered. The figure shows that this is just when the model de- grades. The changes in population structure which Fig. 5. Lowess smooths of cost-weighted misclassification rate for a tree model and LDA applied to customers 2,4,6,...,60,000. occurred during the two years which elapsed while we waited for the true classes of the 5000 design set customers to become known have reduced any ad- vantage that the more sophisticated tree model may have. In summary, the apparent superiority of the more sophisticated tree classifier over the very simple lin- ear discriminant classifier is seen to fade when we take into account the fact that the classifiers must necessarily be applied in the future to distributions which are likely to have changed from those which produced the design set. Since, as demonstrated in Section 2, the simple linear classifier captures most of the separation between the classes, the additional distributional subtleties captured by the tree method become less and less relevant when the distributions drift. Only the major aspects are still likely to hold. The impact of population drift on supervised clas- sification rules is nicely described by the Ameri- can philosopher Eric Hoffer, who said, “In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world that no longer exists.” 3.2 Sample Selectivity Bias The previous subsection considered the impact on classification rules of distributions which changed over time. There is little point in optimizing the rule to the extent that it models aspects of the distribu- tions and decision surface which are likely to have changed by the time the rule is applied. Similar futil- ity applies if a selection process means that the de- sign sample is drawn from a distribution distorted in some way from that to which the classification rule is to be applied. In fact, I suspect that this may be common. Consider, for example, a classification rule aimed at differential medical diagnosis or med- ical screening. The rule will have been developed on a sample of cases (including members of each class). Perhaps these cases will be drawn from a particu- lar hospital, clinic or health district. Now all sorts of demographic, social, economic and other factors influence who seeks and is accepted for treatment, how severe the cases being treated are, how old they are and so on. In general, it would be risky to as- sume that these selection criteria are the same for all hospitals, clinics or health districts. This means that the fine points of the classification rule are un- likely to hold. One might expect its coarser features to be true across different such sets of cases, but the detailed aspects will reflect particular proper- ties of the population from which the design data were drawn. In fact, there are some subtleties here. Suppose that the classification rule follows the diag- nostic paradigm [directly modeling p(c|x), the prob- ability of class membership, c, given the descriptor vector x], rather than the sampling paradigm [which models p(c|x) indirectly from the p(x|c) using Bayes’ theorem]. Then if x spans the space of all predictors of class membership and if the model form chosen for p(c|x) includes the “true” model, then sampling distortions based on x alone will not adversely in- fluence the classifier: the classifier built in one clinic will also apply elsewhere. Of course, it would be a brave person who could confidently assert that these two conditions held. Such subtleties aside, what this means, again, is that effort spent on overrefining the classification model is probably wasted effort and, in particular, that fine differences between different classification rules should not be regarded as carry- ing much weight. This problem of sample selection and how it might be tackled has been the subject of intensive research, especially by the medical statistics and economet- rics communities, but appears not to have been of great concern to researchers on classification meth- ods. Having said that, one area that involves sample selectivity in classification problems which has at- tracted research interest arises in the retail financial services industry, as in the previous section. Here, as in that section, the aim is to predict, for example, on the basis of application and other background variables, whether or not an applicant is likely to be a good customer. Those expected to be good are accepted, and those expected to be bad are rejected. For those that have been accepted, we subsequently discover their true good or bad class. For the re- jected applicants, however, we never know whether they are good or bad. The consequence is that the resulting sample is distorted as a sample from the population of applicants, which is our real interest for the future. Measuring the performance or at- tempting to build an improved classification rule us- ing those individuals for which we do know the true class (which is needed for supervised classification) has the potential to be highly misleading for the overall applicant population. In particular, it means that using highly sophisticated methods to squeeze subtle information from the design data is pointless. This problem is so ubiquitous in the personal finan- cial services sector that it has been given its own name—reject inference [17]. CLASSIFIER TECHNOLOGY 9 10 D. J. HAND 4. PROBLEM UNCERTAINTY to lead to worse performance. There is also a sec- ondary issue, that the shrinkage of r(x) will make it less easy to estimate the decision surface accu- rately because it is a flatter surface: the variance of the estimated decision surface, from sample to sample, will be greater when there is mislabeling of classes. In such circumstances it is better to stick to simpler models, since the higher order terms of the more complicated models will be very inaccurately estimated. 4.2 Arbitrariness in the Class Definition The classical supervised classification paradigm also takes as fundamental the fact that the classes are well defined. That is, that there is some fixed clear external criterion which is used to produce the class labels. In many situations, however, this is not the case. In particular, when the classes are defined by thresholding a continuous variable, then there is always the possibility that the defining threshold might be changed. Once again, this situation arises in consumer credit, where it is common to define a customer as “defaulting” if they fall three months in arrears with repayments. This definition, however, is not a qualitative one (contrast has a tumor/does not have a tumor) but is very much a quantitative one. It is entirely reasonable that alternative defini- tions (e.g., four months in arrears) might be more useful if economic conditions were to change. This is a simple example, but in many situations much more complex class definitions based on logical com- binations of numerical attributes, split at fairly ar- bitrary thresholds, are used. For example, student grades are often based on levels of performance in continuous assessment and examinations. In detect- ing vertebral deformities in studies of osteoporosis, the ranges of the anterior, posterior and mid heights of the vertebra, as well as functions of these, such as ratios, are combined in quite complicated Boolean conditions to provide the definition (e.g., [10]). Def- initions formed in this sort of way are particularly common in situations that involve customer man- agement. For example, Lewis [31] defined a good account in a revolving credit operation (such as a credit card) as someone whose billing account shows (a) on the books for a minimum of 10 months, (b) ac- tivity in 6 of the most recent 10 months, (c) pur- chases of more than $50 in at least 3 of the past 24 months and (d) not more than once 30 days delin- quent in the past 24 months. A bad account is de- fined as (a) delinquent for 90 days or more at any Section 3 looked at mismatches between the dis- tributions modeled by the classification rule and the distributions to which it was applied. This is an ob- vious way in which things may go awry, but there are many others, perhaps not so obvious. This sec- tion illustrates just three. 4.1 Errors in Class Labels The classical supervised classification paradigm is based on the assumption that there are no errors in the true class labels. If one expects errors in the class labels, then one can attempt to build models which explicitly allow for this, and there has been work to develop such models. Difficulties arise, how- ever, when one does not expect such errors, but they nevertheless occur. Suppose that, with two classes, the true posterior class probabilities are p(1|x) and p(2|x), and that a (small) proportion δ of each class is incorrectly believed to come from the other class at each x. Denoting the apparent posterior probability of class 1 by p∗(1|x), we have p∗(1|x) = (1 − δ)p(1|x) + δp(2|x). It follows that if we let r(x) = p(1|x)/p(2|x) denote the true odds and let r∗(x) = p∗(1|x)/p∗(2|x) denote the apparent odds, then (4.1) r∗(x)= r(x)+ε εr(x) + 1 with ε = δ/(1 − δ). With small ε, (4.1) is monotonic increasing in r(x), so that contours of r(x) map to correspond- ing contours of r∗(x). In particular, if the true op- timal decision surface is r(x) = k (k is determined by the relative misclassification costs), then the opti- mal decision surface when errors are present is given by r∗(x) = k∗, with k∗ = (k + ε)/(εk + 1). Unfor- tunately, if the occurrence of mislabeling is unsus- pected, then r∗(x) will be compared with k rather than k∗. In the case of equal misclassification costs, so that k=1, we have k∗ =k=1, so that no prob- lems arise from the misclassification. (Indeed, ad- vantages can even arise: see [9].) However, what hap- pensifk̸=1?Itiseasytoshowthatr∗(x)>r(x) whenever r(x) < 1 and that r∗(x) < r(x) whenever r(x) > 1. That is, the effect of the errors in class la- bels is to shrink the posterior class odds toward 1, so that comparing r∗(x) with k rather than k∗ is likely
time with an outstanding undisputed balance of $50 or more, (b) delinquent three times for 60 days in the past 12 months with an outstanding undisputed bal- ance on each occasion of $50 or more or (c) bankrupt while the account was open. Li and Hand [32] gave an even more complicated example from retail bank- ing.
Our concern with these complicated definitions is that they are fairly arbitrary: the thresholds used to partition the various continua are not natural thresholds, but are imposed by humans. It is entirely possible that, retrospectively, one might decide that other thresholds would have been better. Ideally, un- der such circumstances, one would go back to the design data, redefine the classes and recompute the classification rule. However, this requires that the raw data have been retained at the level of the un- derlying continua used in the definitions. This is of- ten not the case. The term concept drift is some- times used to describe changes to the definitions of the classes. See, for example, the special issue of Ma- chine Learning (1998, Vol. 32, No. 2), Widmer and Kubat [46] and Lane and Brodley [30]. The prob- lem of changing class definitions has been examined in [25, 26] and [28].
If the very definitions of the classes may change between designing the classification rule and apply- ing it, then clearly there is little point in developing an overrefined model for the class definition which is no longer appropriate. Such models fail to take into account all sources of uncertainty in the problem. Of course, this does not necessarily imply that simple models will yield better classification results: this will depend on the nature of the difference between the design and application class definitions. How- ever, there are similarities to the overfitting issue. Overfitting arises when a complicated model faith- fully reflects aspects of the design data to the extent that idiosyncrasies of that data, rather than merely of the distribution from which the data arose, are included in the model. Then simple models, which fit the design data less well, lead to superior clas- sification. Likewise, in the present context, a model optimized on the design data class definition is re- flecting idiosyncrasies of the design data which may not occur in application data, not because of random variation, but because of the different definitions of the classes. Thus it is possible that models which fit the design data less well will do better in future classification tasks.
The possibility of arbitrariness in the class defini- tion discussed in this section is quite distinct from the possibility of class priors or relative misclassifi- cation costs being changed—referred to in the quote from Provost and Fawcett [36] above—but the pos- sibility of these changes, also, casts doubt on the wisdom of modeling the problem too precisely, that is, of using models which are too sophisticated.
4.3 Optimization Criteria and Performance Assessment
When fitting a model to a design set, one op- timizes some criterion of goodness of fit (perhaps modified by a penalization term to avoid overfit- ting) or of classification performance. Many such measures are in use, including likelihood, misclas- sification rate, cost-weighted misclassification rate, Brier score, log score and area under the ROC curve. Unfortunately, it is not difficult to contrive data sets for which different optimization criteria lead to (e.g.) linear decision surfaces with very different ori- entations (even to the extent of being orthogonal). Benton [[2], Chap. 4] illustrated this for several real data sets. Clearly, then it is important to specify the criterion to be used when building a classifica- tion rule. If the use to which the model will be put is well specified to the extent that a measure of perfor- mance can be precisely defined, then this measure should determine the criterion of goodness of fit. All too often, however, there is a mismatch between the criterion used to choose the model, the criterion used to evaluate its performance, and the criterion which actually matters in real application. For ex- ample, a common approach might be to use like- lihood to estimate a model’s parameters, use mis- classification rate to assess its performance and use some cost-weighted misclassification rate in practice (e.g., some combination of specificity and sensitiv- ity). In circumstances such as these, it would clearly be pointless to refine the model to a high degree of accuracy from a likelihood perspective, when this may be only weakly related to the real performance ob jective.
Having said that, one must acknowledge that of- ten precise details of how performance is to be mea- sured in the future cannot be given. For example, in most applications it is difficult to give more than general statements about the relative costs of differ- ent kinds of misclassifications. In such cases it might
CLASSIFIER TECHNOLOGY 11
12 D. J. HAND
be worthwhile to choose a criterion that is equiva- lent to averaging over a range of possible costs: like- lihood, the area under a receiver operating charac- teristic curve and the weighted version of the latter described in [1] can all be regarded as attempts to do that.
5. INTERPRETING EMPIRICAL COMPARISONS
There have been a great many empirical compar- isons of the performance of different kind of classi- fication rules. Some of these are in the context of a new method having been developed and the ef- fort to gain some understanding of how it performs relative to existing methods. Other comparisons are purely comparative studies, seeking to make disin- terested comparative statements about the relative merits of different methods. At first glance, such comparative studies are useful in shedding light on the different methods, on which generally yield su- perior performance or on which are to be preferred for particular kinds of data or in particular domains. However, on closer examination, such comparisons have major weaknesses and can even be seriously misleading. Various authors have drawn attention to these problems, including Duin [4], Salzberg [40], Hand [13], Hoadley [21] and Efron [5], so we will only briefly mention some of the main points here; in particular, only those points relative to classifi- cation accuracy, rather than other aspects of per- formance. Jamain and Hand [24] also gave a more detailed review of comparative studies of classifica- tion rules.
Different categories of users might be expected to obtain different rankings of classification methods in comparative studies. For example, we can con- trast an expert user, who will be able to fine-tune methods, with an inexperienced user, perhaps some- one who has simply pulled some standard public- domain software from the web. It would probably be surprising if their rankings did not differ. More- over, experts will tend to have particular expertise with particular classes of method. Someone expert in neural networks may well achieve superior re- sults with those methods than with support vec- tor machines and vice versa. Taken to an extreme, of course, many comparative studies are made to establish the performance and properties of newly invented methods—by their inventors. One might expect substantial bias in favor of the new methods,
compared to what others might be able to achieve, in such studies. Duin [4] pointed out the difficulty of comparing, “in a fair and objective way,” classi- fiers which require substantial input of expertise (so that domain knowledge can be taken advantage of ) and classifiers which can be applied automatically with little external input of expertise. The two ex- tremes (of what is really a continuum, of course) are appropriate in different circumstances.
The principle of comparing methods by applying them to a collection of disparate real data sets is use- ful, but has its weaknesses. An obvious one is that different studies use different collections of data sets, so making comparisons difficult. Furthermore, the collection will not be representative of real data sets in any formal sense. Moreover, a potential user is not really interested in some “average performance” over distinct types of data, but really wants to know what will be good for his or her problem, and differ- ent people have different problems, with data aris- ing from different domains. A given method may be very poor on most kinds of data, but very good for certain problems.
The widespread use of standard collections of data sets (such as the UCI repository [35]) has clear mer- its: new methods can be compared with earlier ones on a level playing field. However, this also means that there will be some overfitting both to the in- dividual data sets in the collection and to the col- lection as a whole. That is, some methods will do well on data sets in the collection purely by chance. Indeed, the more successful the collection is in the sense that more and more people use it for compara- tive assessments, the more serious this problem will become.
Jamain and Hand [24] pointed out the difficulty of saying exactly what a classification “method” is. Is a neural network with a single hidden node to be regarded as from the same family as one with an arbitrary number of hidden nodes? It is clearly not exactly the same method. Comparative evaluations using the two models may well yield very different classification results. It is this sort of phenomenon which explains why the comparative performance literature contains many different results for “the same” methods applied to given public data sets. Can one then draw general conclusions about the effectiveness of the method of neural networks? Fur- thermore, to what extent is preprocessing the data to be regarded as part of the method? Linear dis- criminant analysis on raw data may yield very dif- ferent results from the same model applied to data
which has been processed to remove skewness. Is, then, linear discriminant analysis good or bad on these data? Likewise, is a data set in which miss- ing values have been replaced by imputed values the same as a data set in which incomplete records have been dropped? Applying the same method to the two variants of the data is likely to yield different results.
We have already commented that the “accuracy” of a classification rule can be measured in a wide va- riety of ways, and that different measures are likely to yield different performance rankings of classifiers.
Given all of the above points, it is not surprising that different authors have drawn different conclu- sions about the relative accuracy of different clas- sifiers. Other commentators have taken things even further. In the discussion that accompanies [3], Efron suggested that new methods always look better than older ones and that complicated methods are harder to criticize than simpler ones. He also noted that it is difficult to make fair comparisons by making the same effort in applying different methods—a point made above. Hoadley, in the same discussion, “coined a phrase called the ‘ping-pong theorem.’ This theorem says that if we revealed to Profes- sor Breiman the performance of our best model and gave him our data, then he could develop an algo- rithmic model using random forests, which would outperform our model. But if he revealed to us the performance of his model, then we could develop a segmented scorecard, which would outperform his model.”
With so many difficulties in ranking and compar- ing classifiers, one might naturally have reservations about small differences in performance—of the kind generally asserted for the more complicated and so- phisticated methods over the older and simpler mod- els.
6. CONCLUSION
In Section 2 we demonstrated that, when build- ing predictive models of increasing complexity, the marginal gain from complicated models is typically small compared to the predictive power of the sim- ple models. In many cases, the simple models ac- counted for over 90% of the predictive power that could be achieved by “the best” model we could find. Now, in the idealized classical supervised classifica- tion paradigm, certain assumptions are implicit: it is assumed that the distributions from which the de- sign points and the new points are drawn are the
same, that the classes are well defined and the def- initions will not change, that the costs of different kinds of misclassification are known accurately, and so on. In real applications, however, these additional assumptions will often not hold. This means that apparent small (laboratory) gains in performance might not be realized in practice—they may well be swamped by uncertainties arising from mismatches between the apparent problem and the real problem. In particular, many of the comparative studies in the literature are based on brief descriptions of data sets, containing no background information at all on such possible additional sources of variation due to breakdown of implicit assumptions of the kind illustrated above. This must cast doubt on the va- lidity of their conclusions. In general, it means that deeper critical assessment of the context of the prob- lem and data should be made if useful practical con- clusions are to be drawn. If enough is known about likely additional sources of variability, beyond the classical sources of sampling variability and model uncertainty, then more sophisticated models can be built. However, if insufficient information is known about these additional sources, which we speculate will very often be the case, then the principle of par- simony suggests that it is better to stick to simple models.
We should note, parenthetically, that there are also other reasons to favor simple models. Inter- pretability, in particular, is often an important re- quirement of a classification rule. Indeed, sometimes it is even a legal requirement (e.g., in credit scoring). This leads us to the observation that what one re- gards as “simple” may vary from user to user: some might favor weighted sums of predictor values, oth- ers might prefer (small) tree structures and yet oth- ers might regard nearest neighbor methods as being simple.
Perhaps it is appropriate to conclude with the comment that, by arguing that simple models are often more appropriate than complex ones and that the claims of superior performance of the more com- plex models may be misleading, I am not suggest- ing that no major advances in classification methods will ever be made. Such a claim would be absurd in the face of developments such as the bootstrap and other resampling approaches, which have led to sig- nificant advances in classification and other statis- tical models. All I am saying is that much of the purported advance may well be illusory. Further- more, although (almost by definition) one cannot
CLASSIFIER TECHNOLOGY 13
14 D. J. HAND
predict where the next step-change will come from, one might venture a guess as to its general area. Re- sampling methods are children of the computer revo- lution, as indeed are most other recent developments in classifier technology [e.g., classification trees, neu- ral networks, support vector machines, random forests, multivariate adaptive regression splines (MARS) and practical Bayesian methods]. Since progress in com- puter hardware is continuing, one might reasonably expect that the advances will arise from more pow- erful data storage and processing ability.
ACKNOWLEDGMENTS
I have given several presentations based on the ideas in this paper. An earlier version of this pa- per was presented at the 2004 Conference of the International Federation of Classification Societies in Chicago and appeared in the proceedings [18]. I would like to thank all those who commented on the material. In particular, I am grateful to Svante Wolde, Willi Sauerbrei, Foster Provost, Jerome Fried- man and Leo Breiman. Of course, merely because they had valuable and interesting things to say about the ideas does not necessarily mean they agree with them.
REFERENCES
[1] Adams,N.M.andHand,D.J.(1999).Comparingclas- sifiers when the misallocation costs are uncertain. Pattern Recognition 32 1139–1147.
[2] Benton, T. C. (2002). Theoretical and empirical mod- els. Ph.D. dissertation, Dept. Mathematics, Impe- rial College London.
[3] Breiman, L. (2001). Statistical modeling: The two cul- tures (with discussion). Statist. Sci. 16 199–231. MR1874152
[4] Duin, R. P. W. (1996). A note on comparing classifiers. Pattern Recognition Letters 17 529–536.
[5] Efron, B. (2001). Comment on “Statistical modeling: The two cultures,” by L. Breiman. Statist. Sci. 16 218–219. MR1874152
[6] Fawcett, T. and Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery 1 291–316.
[7] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 179– 188.
[8] Friedman, C. P. and Wyatt, J. C. (1997). Evalua- tion Methods in Medical Informatics. Springer. New York.
[9] Friedman, J. H. (1997). On bias, variance, 0/1 loss, and the curse of dimensionality. Data Mining and Knowledge Discovery 1 55–77.
[10] Gallagher, J. C., Hedlund, L. R., Stoner, S. and Meeger, C. (1988). Vertebral morphometry: Nor- mative data. Bone and Mineral 4 189–196.
[11] Hand, D. J. (1981). Discrimination and Classification. Wiley, Chichester. MR0634676
[12] Hand, D. J. (1996). Classification and computers: Shifting the focus. In COMPSTAT-96: Proceedings in Computational Statistics (A. Prat, ed.) 77–88. Physica, Berlin.
[13] Hand, D. J. (1997). Construction and Assessment of Classification Rules. Wiley, Chichester.
[14] Hand, D. J. (1998). Strategy, methods, and solving the right problems. Comput. Statist. 13 5–14.
[15] Hand, D. J. (1999). Intelligent data analysis and deep
understanding. In Causal Models and Intelligent Data Management (A. Gammerman, ed.) 67–80. Springer, Berlin. MR1722705
[16] Hand, D. J. (2001). Modelling consumer credit risk. IMA J. Management Mathematics 12 139–155.
[17] Hand, D. J. (2001). Reject inference in credit opera- tions. In Handbook of Credit Scoring (E. Mays, ed.) 225–240. Glenlake, Chicago.
[18] Hand, D. J. (2004). Academic obsessions and classi- fication realities: Ignoring practicalities in super- vised classification. In Classification, Clustering and Data Mining Applications (D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, eds.) 209– 232. Springer, Berlin. MR2113611
[19] Hand, D. J. (2005). Supervised classification and tunnel vision. Applied Stochastic Models in Business and Industry 21 97–109. MR2137544
[20] Hand,D.J.andHenley,W.E.(1997).Statisticalclas- sification methods in consumer credit scoring: A re- view. J. Roy. Statist. Soc. Ser. A 160 523–541.
[21] Hoadley, B. (2001). Comment on “Statistical model- ing: The two cultures,” by L. Breiman. Statist. Sci. 16 220–224. MR1874152
[22] Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Ma- chine Learning 11 63–90.
[23] Jamain, A. (2004). Meta-analysis of classification meth- ods. Ph.D. dissertation, Dept. Mathematics, Impe- rial College London.
[24] Jamain, A. and Hand, D. J. (2005). Mining supervised classification performance studies: A meta-analytic investigation. Technical report, Dept. Mathematics, Imperial College London.
[25] Kelly, M. G and Hand, D. J. (1999). Credit scoring with uncertain class definitions. IMA J. Mathemat- ics Management 10 331–345.
[26] Kelly, M. G., Hand, D. J. and Adams, N. M. (1998). Defining the goals to optimise data mining per- formance. In Proc. Fourth International Confer- e n ce o n K n o w l ed g e D i s co v e r y a n d D a t a M i n i n g (R. Agrawal, P. Stolorz and G. Piatetsky-Shapiro, eds.) 234–238. AAAI Press, Menlo Park, CA.
[27] Kelly, M. G., Hand, D. J. and Adams, N. M. (1999). The impact of changing populations on classifier performance. In Proc. Fifth ACM SIGKDD Inter- national Conference on Knowledge Discovery and
Data Mining (S. Chaudhuri and D. Madigan, eds.)
367–371. ACM, New York.
[28] Kelly, M. G., Hand, D. J. and Adams, N. M. (1999).
Supervised classification problems: How to be both judge and jury. In Advances in Intelligent Data Analysis. Lecture Notes in Comput. Sci. 1642 235– 244. Springer, Berlin. MR1723394
[29] Klinkenberg, R. and Thorsten, J. (2000). Detecting concept drift with support vector machines. In Proc. 17th International Conference on Machine Learning (P. Langley, ed.) 487–494. Morgan Kaufmann, San Francisco.
[30] Lane, T. and Brodley, C. E. (1998). Approaches to online learning and concept drift for user iden- tification in computer security. In Proc. Fourth International Conference on Knowledge Discovery and Data Mining (R. Agrawal, P. Stolorz and G. Piatetsky-Shapiro, eds.) 259–263. AAAI Press, Menlo Park, CA.
[31] Lewis, E. M. (1990). An Introduction to Credit Scoring. Athena, San Rafael, CA.
[32] Li, H. G. and Hand, D. J. (2002). Direct versus in- direct credit scoring classifications. J. Operational Research Society 53 647–654.
[33] McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. MR1190469
[34] Mingers, J. (1989). An empirical comparison of prun- ing methods for decision tree induction. Machine Learning 4 227–243.
[35] Newman, D. J., Hettich, S., Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. Dept. Information and Com- puter Sciences, Univ. California, Irvine. Available at www.ics.uci.edu/ ̃mlearn/MLRepository.html.
[36] Provost, F. and Fawcett, T. (2001). Robust classifi- cation for imprecise environments. Machine Learn- ing 42 203–231.
[37] Rendell, A. L. and Seshu, R. (1990). Learning hard concepts through constructive induction: Frame- work and rationale. Computational Intelligence 6 247–270.
[38] Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press. MR1438788
[39] Rosenberg, E. and Gleit, A. (1994). Quantitative methods in credit management: A survey. Oper. Res. 42 589–613.
[40] Salzberg, S. L. (1997). On comparing classifiers: Pit- falls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1 317–328.
[41] Shavlik, J., Mooney, R. J. and Towell, G. (1991). Symbolic and neural learning algorithms: An exper- imental comparison. Machine Learning 6 111–143.
[42] Thomas, L. C. (2000). A survey of credit and be- havioural scoring: Forecasting financial risk of lend- ing to consumers. International J. Forecasting 16 149–172.
[43] vonWinterfeldt,D.andEdwards,W.(1982).Costs and payoffs in perceptual research. Psychological Bulletin 91 609–622.
[44] Webb,A.R.(2002).StatisticalPatternRecognition,2nd ed. Wiley, Chichester. MR2191640
[45] Weiss, S. M., Galen, R. S. and Tadepalli, P. V. (1990). Maximizing the predictive value of produc- tion rules. Artificial Intelligence 45 47–71.
[46] Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Ma- chine Learning 23 69–101.
[47] Zahavi, J. and Levin, N. (1997). Issues and problems in applying neural computing to target marketing. J. Direct Marketing 11(4) 63–75.
CLASSIFIER TECHNOLOGY 15