Machine Learning, 15, 201-221 (1994) © 1994KluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands.
Improving Generalization with Active Learning*
DAVID COHN (COHN@PSYCHE.MIT.EDU)
Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139
LES ATLAS
Deptartment of Electrical Engineering, Universityof Washington, Seattle, WA 98195
RICHARD LADNER
Deptartment of Computer Science and Engineering, Universityof Washington, Seattle, WA 98195
Editor: Alex Waibel
Abstract.Activelearningdiffersfrom”learningfromexamples”inthatthelearningalgorithmassumesatleast somecontroloverwhatpartoftheinputdomainitreceivesinformationabout.Insomesituations,activelearning is provablymorepowerfulthanlearningfromexamplesalone,givingbettergeneralizationfor a fixednumber of trainingexamples.
In this article, we consider the problem of learning a binary concept in the absence of noise. We describe a formalismfor activeconceptlearningcalledselectivesamplingand showhow it maybe approximatelyimplemented by a neural network. In selective sampling, a learner receives distributioninformationfrom the environment and queries an oracle on parts of the domain it considers “useful.” We test our implementation,called an SG- network, on three domains and observe significantimprovementin generalization.
Keywords. queries, active learning, generalization,version space, neural networks.
1. Introduction: Random sampling vs. active learning
Most neural network generalizationproblems are studied only with respect to random sam- piing: the training examples are chosen at random, and the network is simply a passive learner. This approach is generally referred to as “learning from examples.” Baum and Haussler (1989) examine the problem analytically for neural networks; Cohn and Tesauro (1992) provide an empirical study of neural network generalization when learning from examples. There have also been a number of empirical efforts, such as those of Le Cun et al. (1990), aimed at improving neural network generalization when learning from examples.
Learning from examples is not, however, a universally applicable paradigm. Many natural learning systems are not simply passive, but instead make use of at least some form of active learning to examine the problem domain. By active learning, we mean any form of learning in which the learning program has some control over the inputs on which it
*A preliminary version of this article appears as Cohn et al. (1990).
202 D. COHN, L. ATLASAND R. LADNER
trains. In natural systems (such as humans), this phenomenon is exhibited at both high levels (e.g., active examination of objects) and low, subconscious levels (e.g., Fernald and Kuhl’s (1987) work on infant reactions to “Motherese” speech).
Within the broad definition of active learning, we will restrict our attention to the simple and intuitive form of concept learning via membership queries. In a membership query, the learner queries a point in the input domain and an oracle returns the classification of that point. Much work in formal learning theory has been directed to the study of queries (see, e.g., Angluin 1986; Valiant, 1984), but only very recently have queries been exam- ined with respect to their role in improving generalization behavior.
In many formal problems, active learning is provably more powerful than passive learn- ing from randomly given examples. A simple example is that of locating a boundary on the unit line interval. In order to achieve an expected position error of less than e, one would need to draw O ~± In ~-L~ random training examples. If one is allowed to make membership queries seciuentia.l~yc,_)_JthJenC binary search is possible and, assuming a uniform distribution, a position error ore may be reached with OIln Illl queries.
One can imagine any number of algorithms for employing mem15ership queries to do active learning. We have been studying the problem of learning binary concepts in an error- free environment. For such problems, a learner may proceed by examining the informa- tion already given and determining a region of uncertainty, an area in the domain where it believes misclassification is still possible. The learner then asks for examples exclusively from that region. This article discusses a formalization of this simple approach, which we call selective sampling.
In section 2, we describe the concept-learning problem in detail and give a formal defini- tion of selective sampling, describing the conditions necessary for the approach to be useful. In section 3 we describe the SG-network, a neural network implementation of this tech- nique inspired by version-space search (Mitchell, 1982). Section 4 contains the results of testing this implementation on several different problem domains, and section 5 discusses some of the limitations of the selective sampling approach. Sections 6 and 7 contain reference to related work in the field and a concluding discussion.
2. Concept learning and selective sampling
Given an arbitrary domain X, we define a concept c to be some subset of points in the domain. For example, X might be a two-dimensional space, and c might be the set of all points lying inside a fixed rectangle in the plane. We classify a point x E X by its member- ship in concept c: we write c(x) = 1 ifx ~ c, and c(x) = 0 otherwise. A popular use of artificial neural networks is as concept classifiers: x is presented as the input to an ap- propriately trained network, which then activates a designated output node above some threshold if and only ifx ~ c, that is, ifx is an instance of concept c. Formally, a concept class C is a set of concepts, usually described by some description language. In the above example, our class C may be the set of all two-dimensional, axis-parallel rectangles (see figure 1). In the case of neural networks, the concept class is usually the set of all concepts that the network may be trained to classify.
IMPROVING GENERALIZATION WITH ACTIVE LEARNING
203
o
oOi 0
0
I
Figure1. A concept class defined as the set of all axis-parallel rectangles in two dimensions. Several positive and negative examples are depicted, as are several consistent concepts in the class.
2.1. Generalization
For target concept t, a training example is a pair (x, t (x)) consisting of a point x (usually drawn from some distribution (P), and the point’s classification t(x). If x E t, then t(x) = 1, and we say that (x, t(x)) is a positive example. Otherwise, t(x) = 0, and (x, t(x)) is a negative example. A concept c is consistent with an example (x, t(x)) if c(x) = t(x), that is, if the concept produces the same classification of point x as the target. The error of c, with respect to t and distribution (P, is the probability that c and t will disagree on
a random example drawn from (P. We write this as
e(c, t, (P) = Pr[c(x) ~ t(x)], for x drawn randomly according to (P.
The generalization problem is posed as follows: for a given concept class C, an unknown target t, an arbitrary error rate e, and confidence 6, how many examples do we have to draw and classify from an arbitrary distribution (P in order to find a concept c E C consis- tent with the examples such that e(c, t, ~) _< e with confidence at least 1 - 6? This prob- lem was formalized by Valiant (1984) and has been studied for neural networks by Baum and Haussler (1989) and by Haussler (1992).
2.2. The region of uncertainty
If we consider a concept class C and a set Smof m examples, the classification of some regions of the domain may be implicitly determined (figure 2): all concepts in C that are consistent with all the examples must agree in these parts. What we are interested in here are the areas that are not determined by available information--what we define to be the region of uncertainty:
(~(Sm) = {x : 3Cl, c2 E C, c1, c2 are consistent with all s E am, and Cl(X) ~ c2(x)}.
0
204
D. COHN, L. ATLAS AND R. LADNER
01 iii!ii!iiii!ii.i.i.i.!.i.i.i.i.i.i.i.i.i.i.i.i.i.i
ollii|i~i!i~i!i!iiii!i!i:i:!i!i!ii!iiii!i:!:i!i~i|i0!i!l n llili:!ti~!~ili;i/i~i!~i:i~i~i~i!i~i!i!i~i:~:i~i~i~i~i!|
o °ili?ililii'i
iliii'ili,i!iiii iiii if!!
0
o
Figure2. The region of uncertainty, (R(sm), is the set of all points x in the domain such that there are two con- cepts that are consistent with all training examples in Smand yet disagree on the classification ofx. Here, 6t(Sm) is shaded.
For an arbitrary distribution (P, we can define the size of this region as c~ = Pr[x E (R(sm)]. Ideally, in an incremental learning procedure, as we classify and train on more examples, ot will be monotonically non-increasing. A point that falls outside (R(Sm) will leave it unchanged; a point inside will further restrict the region. Thus, a is the probability that a new, random point from (P will reduce our uncertainty.
As such, (R(Sm) serves as an envelope for consistent concepts; any disagreement be- tween those concepts must lie within (R(sm). Because of this, (R(Sm) also bounds the potential error of any consistent hypothesis we choose. If the error of our current hypothesis is e, then e < or. Since we have no basis for changing our current hypothesis without a contradicting point, ot is also a bound on the probability of an additional point reducing our error.
2.3. Selective sampling is active learning
Let us consider learning as a sequential process, drawing examples one after another, and determine how much information each successive example gives us. If we draw at random over the whole domain, then the probability that an individual sample will reduce our er- ror is or, as defined above, which decreases to zero as we draw more and more examples. This means that the efficiency of the learning process also approaches zero; eventually, most example we draw will provide us with no information about the concept we are trying to learn.
Now consider what happens if we recalculate (R(sm), the region of uncertainty after each new example, and draw examples only from within (R(Sm). Then each example will reduce tR(sm), and will reduce our uncertainty, with no decrease in efficiency as we draw more and more examples. We call this process selective sampling.
If the distribution (P is known (e.g., (P is uniform), then we perform selective sampling directly by randomly querying points (according to (P) that lie strictly inside (R(Sm). Fre- quently, however, the sample distribution, as well as the target concept, is unknown. In this case, we cannot choose points from our domain with impunity or we risk assuming
0
.
IMPROVINGGENERALIZATIONWITH ACTIVELEARNING 205
a distribution that differs greatly from the actual underlying (P. In many problems, though, we can still make use of distribution information without having to pay the full cost of drawing and classifying an example. Rather than assuming that the drawing of a classified example is an atomic operation (see Valiant, 1984; Blumer et al., 1988), we may divide the operation into two steps: first, drawing an unclassified example from the distribution, and second, querying the classification of that point. If the cost of drawing a point from our distribution is small compared to the cost of finding the point's proper classification, we can "filter" points drawn from our distribution, drawing at random, but only selecting, classifying, and training on those that fall in fit(sm). This approach is well suited to prob- lems such as speech recognition, where unlabeled speech data are plentiful, but the classi- fying (labeling) of speech segments is a laborious process.
Since calculating fit(Sm) can be computationaUy expensive, we may want to perform selective sampling in batches. On the first pass, we draw an initial batch of training ex- amples S~' from (P, train on it, and determine the initial fit(sm). We then define a new distribution (P' to sample from that is zero outside fit(S~), but that maintains the relative distribution of 6) inside fit(Sb~).We can then make a second pass, drawing a second batch of training examples S~', adding it to the first, and determining a new, smaller fit(S~' U S'~). The smaller the batch size is, and the more passes are made, the more efficiently the algorithm will draw training examples (see figure 3). However, since fit(Sm) is recalculated on each pass, this advantage must be weighed against the added computational cost incurred in this calculation.
2.4. Approximations to selective sampling
Even for simple concept classes, such as the set of all axis-parallel rectangles in two dimen- sions, it may be difficult or computationally expensive to represent the region of uncer- tainty exactly. For the class of rectangles, negative examples that lie along the corners of the region can add complexity by causing "nicks" in the outer comers of fit(Sm) (as in figure 2). With more realistic, complicated classes, representing fit(Sm) exactly can easily become a difficult, if not impossible, task. Using a good approximation of fit(Sm) may, however, be sufficient to allow selective sampling, Practical implementations of selective sampling are possible with a number of approximations to the process, including main- taining a close superset or subset of fit(Sin).
Assume we are able to maintain a superset fit+(Sm) _~ (~(sm). Since any point in (~(Sm) will also be in the superset, we can selectively sample inside fit+ (Sm) and be assured that we will not exclude any part of the domain of interest. The penalty we pay is that of efficiency: we may also train on some points that are not of interest. The efficiency of this approach, as compared with pure selective sampling, can be measured as the ratio Pr[x E fit(sm)]/Pr[x E fit+(Sm)].
If we are only able to maintain a subset fit- (Sm) c fit(Sin), our sampling and training algorithm must take additional precautions. On any given iteration, some part of fit(Sm) will be excluded from sampling. Because of this, we will need to ensure that on successive iterations we will choose subsets that cover the entire region of uncertainty (an example of this technique will be discussed in the next section). We will also need to keep the number of examples on each iteration small to prevent oversampling of one part of the domain.
206
D. COHN, L. ATLASAND R. LADNER
O,i Error
0.4 7\" 0.2
l
0. 04]i
0.02
0.01
o
[]
A x
random sampling 2 pass sampling 3 pass sampling 4 pass sampling i0 pass sampling 20 pass sampling
\ ix "-.
50 I00 Training set size
150 200
Figure3. As the batch size in selective samplingapproachesone, the process yields diminishing improvements for the addedcomputationalcosts. This figureplots errorvs. trainingset size for selective sampling, using dif- ferent batch sizes for learning an axis-parallel rectangle in two dimensions.
For the remainder of this article, we will denote an arbitrary algorithm's approximation to the true region of uncertainty as ~* (sm).
3. Neural networks for selective sampling
The selective sampling approach holds promise for improved generalization in many trainable classifiers. The remainder of this article is concerned with demonstrating how an approx- imation of selective sampling may be implemented using a feedforward neural network trained with error backpropagation.
The backpropagation algorithm (Rurnelhart et al., 1986) is a supervised neural network learning technique, in that the network is presented with a training set of input/output pairs (x, t(x)) and learns to output t(x) when given input x. To train a neural network using stan- dard backpropagation, we take the training example (x, t(x)) and copy x into the input nodes of the network (as in figure 4).1 We then calculate the individual neuron outputs layer by layer, beginning at the first "hidden" layer and proceeding through the output layer. The output of neuron j is computed as
IMPROVING GENERALIZATION WITH ACTIVE LEARNING
207
first second
hidden hidden input layer layer
Figure4. A simple feedforward neural network. Each node computes the weighted sum of its inputs, passes that sum through a sigmoidal "squashing" function, and passes the result on as its output,
Oj(X)-~"t7I~i Oi(X)"Wj,i~,
where wj,i is the connection weight to neuronj from neuron i, and a(a) = 1/(1 + exp(-a)) is a sigmoidal "squashing" function that produces neuron outputs in the range [0, 1]. We define the error of the output node n as 6n(x) = (On(X) - t(x))2. This error value is prop- agated back through the network (see Rumelhart et al. (1986) for details), so that each neuron j has an error term 6j(x). The connection weights wj,i are then adjusted by adding
AWji(X) ~- ~)j(X)Oj(X), (1)
where ~ is a constant "learning rate."
This adjustment incrementally, decreases the error of the network on example (x, t(x)).
By presenting each training example in turn, a sufficiently large network will generally converge to a set of weights where the network has acceptably small error on each training example. In the concept-learning model, the target values of training examples are 1 or 0, depending on whether or not the input is an instance of the concept being learned. Patterns are trained on until their error is less than some threshold.
At this point, we need to draw attention to the distinction between a neural network's architecture and its configuration,z The architecture of a neural network refers to those parameters of the network that do not change during training; in our case, these will be the network's topology and transfer functions. The configuration of a network refers to the network parameters that do change during training: in this case, the weights given to each of the connections between the neurons. Although there are network-training algorithms that involve changing a network's topology during training (e.g., Ash, 1989), we consider here only those with fixed topologies that train by weight adjustment. The theory and methods described here should, with some modification, be equally applicable to other trainable classifiers.
For a neural network architecture with a single output node, the concept class C is specified by the set of all configurations that the network can take on. Each of these configurations implements a mapping from an input x to an output in [0, 1], and many configurations
network output
connection weights
208 D. COHN, L. ATLAS AND R. LADNER
may implement the same mapping. If we set a threshold (such as 0.5) on the output, then we may say that a particular configuration ? represents a concept c such that x E c if and only if ?(x) > 0.5 (see figure 5). Having trained on training set Sm, we can say that the network configuration ~ implements a concept c that is consistent with training set Sm. We will use c to denote both the concept c and the network ~ that implements it.
Below, we consider a naive algorithm for selective sampling with neural networks and examine its shortcomings. We then describe the SG-net, based on the version-space paradigm (Mitchell, 1982) that overcomes these difficulties.
3.1. A naive neural network querying algorithm
The observation that a neural network implementation of a concept learner may produce a real-valued ouptut that is thresholded suggests a naive algorithm for defining a region of uncertainty. When a network is trained with these tolerances, we can divide all points in the domain into one of three classifications: ‘T’ (0.9 or greater), “0” (0.1 or less), and “uncertain” (between 0.1 and 0.9). We may say that this last category corresponds to a region where the network is uncertain, and may thus define it to be (R*(sm), our approx- imation to the region of uncertainty (figure 6).
The problem with applying this approach is that it measures only the uncertainty of that particular configuration, not the uncertainty among configurations possible given the ar- chitecture. While it is in fact a part of 0t(sm), the full region is composed of the dif- ferences between all possible consistent network configurations.
0
0 III~TII
Figure 5. The thresholded output of the trained neural network ~ serves as a classifier representing a concept c that is (hopefully) similar to the unknown target concept.
111
°
@
IMPROVING GENERALIZATION WITH ACTIVE LEARNING
209
°i
o
o
Figure 6 A naive approach to representing the region of uncertainty: we can use the network’s transition area between 0 and 1 to represent the part of the domain where the network is “uncertain?’
This limitation is exacerbated by the inductive bias of some learning algorithms, including backpropagation. The backpropagation algorithm, when it attempts to classify a set of points, tends to draw sharp distinctions and become “overly confident” in regions that are still unknown. As a result, the (R*(S’n) chosen by this method will in general be a very small subset of the true region of uncertainty.
A pathological example of this behavior is exhibited in figures 7a and 7b. In figure 7a, the initial random sampling has failed to yield any positive examples in the triangle on the right. Training by backpropagation on the examples yields a region of uncertainty (be- tween the two contours) that concentrates on the left half of the domain, completely to the exclusion of the right. The final result of 10 iterations of querying and learning is shown in figure 7b. This strategy (and related ones) is prone to failure of this form whenever there are regions of detail in the target concept that are not discovered in the initial random- sampling stage.
3.2. Version space
Mitchell (1982) describes a learning procedure based on the partial ordering in generality of the concepts being learned. Some concept cl is “more general” than another c2 if and only if c2 C Cl. If c1 ~ c2 and c2 (E c1, then the two concepts are incomparable. For a concept class C and a set of examples Sm, the version space is the subset Csm = {c ~ C, such that c is consistent with all s e s’n}. To bound the concepts in the version space, we can maintain two subsets, S, G c Csm. S is the set of all “most specific” consistent
o
o o
o~o
i1
IIIII0III~
210
D. COHN, L. ATLAS AND R. LADNER
0 O~oo
O0 0 0
io 00
0Co 00
00 30
0
00 00
0 O0
0i 0
00 0
Figure 7. A pathological example of naive network querying. (a) An initial random sample has failed to detect the second, disjoint region of the target concept; (b) after 10 successive iterations, the naive querying algorithm has ignored that region and concentrated on the region where it has seen examples. The dotted line denotes the true boundary of the unknown target concept.
concepts, thatis, S = {cECs,,, ~c’ ECs,,, c’ C c}. Similarly, G = {cECs,,, ~c’ ECsm, c’ D c} is the set of “most general” concepts. For any consistent concept c, it must be the case thats c_ c c_ gforsomesE SandgE G.
One may do active learning with a version space by examining instances that fall in the “difference” of S and G, that is, the region SAG = tO{sAg : s E S, g E G} (where A is the symmetric difference operator). If an instance in this region proves positive, then some s in S will have to generalize to accommodate the new information; if it proves negative, some g in G will have to be modified to exclude it. In either case, the version space, the space of plausible hypotheses, is reduced with every query.
3.3. Implementing an active version-space search
Since an entire neural network configuration represents a single concept, a complete ver- sion space cannot be directly represented by any single neural network. In fact, Haussler (1987) pointed out that the size of the S and G sets could grow exponentially in the size of the training set. Representing these sets completely would require keeping track of and manipulating an exponential number of network configurations.
We can, however, modify the version-space search to make the problem tractable if we impose, according to the distribution (P, a strict index of ordering on all concepts in the class. We will define a concept cI to be “more general” than concept c2 if and only if for a random point x drawn from (P, Pr[x E cl] > Pr[x E c2]. Under this definition, the generality of all concepts in the class is comparable, and it makes sense to speak of an ordering in which we can represent a single “most general” concept g and a single “most specific” concept s. There may still be many concepts with the same generality, but this is no impediment. We need only know that there are no concepts, in the “most general” case, with a greater generality than the concept g we have chosen.
[
0
IMPROVINGGENERALIZATIONWITHACTIVELEARNING 211
By maintaining these two concepts, we have a window into our version space: (~*(Sm) = sag will be a subset of SAG. Thus, a point x ~ (R*(Sm) is guaranteed to reduce the size of our version space. If positive, it will invalidate s and leave us with another s, either a more general one or an equally specific one that includes the new point. Similarly, if the new point is classified as negative, it will invalidate g. By proceeding in this fashion, we can approximate a step-by-step traversal of the S and G sets using a fixed representation
size.
3.4. The SG-net: a neural network version-space search algorithm
Since we are interested in selecting examples that improve the generalization behavior of some given neural network architecture N, we define the concept class in question to be the set of concepts learnable by N with its learning algorithm. If we can manage to obtain network configurations that represent the s and g concepts described above, then it is a simple matter to implement the modified version-space search. In the following two subsec- tions, we first describe how one may learn a “most specific” or “most general” concept associated with a network and then describe how these two networks may be used to selec- tively sample the 6~*(Sm) defined by regions where they disagree.
3.4.1. Implementing a “most specific/general” network
Below, we describe how one may learn s, a “most specific” concept consistent with some given data. The case for learning g, a “most general” concept, is analogous.
A most specific network for a set of examples S m (according to distribution (P), is one that classifies as positive those example points that are in fact positive and classifies as negative as much as possible of the rest of the domain. This requirement amounts to choosing a c consistent with S m that minimizes Pr[x E c]. Such a network may be arrived at by employing an inductive bias. An inductive bias is a predisposition of a learning algorithm for some solutions over others. Most learning algorithms inherently have at least some form of an inductive bias, whether it is a preference for simple solutions over complex ones, or a tendency to choose solutions where the absolute values of the parameters remain small?
What we will do is explicitly add a new inductive bias to the backpropagation algorithm: by penalizing the network for any part of the domain that it classifies as positive, we add a bias that prefers specific concepts over general ones. The weight of this penalty must be carefully adjusted: if it is large enough to outweigh the training examples, the network will not converge on the training data. The penalty must, however, be large enough to outweigh any other inductive bias in the learning algorithm and to force the algorithm to find a most specific configuration consistent with Sm.
This “negative” bias may be implemented by drawing unclassified points from (P (or creating them in the case where (P is known), and arbitrarily labeling them as negative examples. We then add these “background” examples to the training set (figure 8). This
212
D. COHN> L. ATLAS AND R. LADNER
6J0I~ t~f0
@ee eee
.0 o eo e
a0o o0e0000o o0o
C~ o o .oooOo~ooo ooo~–
(3V\oo° o
oo of0o0~ ° °oo~ ,
o;°
-°oo1~.~o /.o.ooii
– \ V ‘o ,’o,; I ,o o o x~__.~ -oo oo o /
o oo0 o~o o 0~/,
o~o /’1 ,,o….. I/
Figure & Training on a large number of “background” points in addition to the regular training data forces the network into a “most specific” configuration.
creates a background bias over the domain that is weighted by the input distribution (P: the networks that have the least error on these background patterns will be the ones that are the most specific according to (P.
In order to allow the network to converge on the actual training examples in spite of the background examples, we must balance the influence of background examples against that of the training data. As the network learns training example x, the error term 8j(x) in equation (1) will approach zero, while the error term of an arbitrary background exam- ple y may remain constant. Unless the “push” that the random background example exerts on the network weights (Awji(Y)) is decreased to match that of the normal training ex- amples (Awji(x)), the background examples will dominate, and the network will not con- verge on a solution.
We achieve balance by using different learning rates for training examples and background examples. We dynamically decrease the background learning rate as a function of the net- work’s error on the training set. Each time we present a training example x, we calculate a new background learning rate 7’ = 3″~(x)~/,where 8(x) is the error of the network on x, and 0 < 3' < 1 is a constant. We then train on a single background example using this value of ~/' and repeat. Formally, the algorithm is as follows:
IMPROVING GENERALIZATIONWITH ACTIVE LEARNING 213
1. Initialize network to random configuration c.
2. If for all actual training examples (x, t(x)) E Sm, ((c(x) - t(x))2 < threshold, stop. 3. Otherwise, select next actual training example (x, t(x)).
4. Calculate the output error of network c on input x, 5(x) = (c(x) - t(x))2, and
backpropagate through the network, adjusting weights according to
AwjAx) = n6j(x)oj(x).
5. Calculate new background learning rate ~/' = 3"b(x)~7.
6. Draw a point y from (P and create background example (y, 0).
7. Calculate the output error 6(y) = (c(yk))2, and backpropagate through the network, ad-
justing weights according to the modified equation
Awji(Y ) -~- ~t~j(y)oj(y).
8. Go to step 2.
Optimally, 3' should be set such that the weight update on the background patterns is always infinitesimally smaller than the weight update on the actual training patterns, allowing the network to "anneal" to a most specific configuration. This however, requires a pro- hibitive amount of training time. Empirically, we have found that setting 3' = 0.75 pro- vides an adequate bias and still allows convergence in a reasonable number of iterations.
A similar procedure can be used to produce a "most general" network, adding a positive inductive bias by classifying all background points drawn from (P as positive.
3.4.2. Implementing active learning with an SG-net
Once we can represent concepts s and g, it is a simple matter to test a point x for member- ship in 6t*(Sm) by determining if s(x) # g(x). Selective sampling may then be im- plemented as follows: if a point drawn from the distribution is not in sag (if the two net- works agree on their classification of it), then the point is discarded. If point is in sAg, then its true classification is queried, and it is added to the training set. In practice, we can merge the inputs of the s and g networks, as illustrated in figure 9, and train both together. XX It is important to note that this technique is somewhat robust, in that its failure modes degrade the efficiency of a single sampling iteration rather than causing overall failure of the learning process. If either the s or g networks fail to converge on the training data, the points that failed to converge will be contained in sAg, and that region will be eligible for additional sampling on the next iteration. In most cases, we have found that these addi- tional examples will suffice to "push" the network out of its local minimum.
If the network does converge on the training set, but settles on solutions that are not near the most specific/general networks consistent with the data, the examples gleaned in the next iteration are still useful. Since they were chosen by virtue of lying in areas where the two networks disagreed, the points will settle discrepancies between the two. This may lead to some oversampling of the region, but will not, in and of itself, cause the technique to fail.
214 D. COHN, L. ATLASAND R. LADNER
G. ±oi typicalnetwork splitintoseparate
architecture sandgnetworks
Figure 9. Constructionof an SG-networkequivalentto the original.
The effects of these two failure modes can be minimized by keeping the number of ex- amples taken on each iteration small. This increases the efficiency of the learning process in terms of the number of examples classified, but as we have observed, there is a tradeoff in the computational resources required. Each time new data are added to the training set, the network may have to completely readjust itself to incorporate the new information. We have found that in practice, with large training set sizes, it is often most efficient imply to retrain the entire network from scratch when new examples are added. Recent work by Pratt (1993) offers hope that this retraining may be made more efficient by use of "in- formation transfer" strategies between iterations.
4. Experimental results
Experiments using selective sampling were run on three types of problems: solving a sim- ple boundary-recognition problem in two dimensions, learning a 25-input real-valued threshold function, and recognizing the secure region of a small power system.
4.1. The triangle learner
A two-input network with two hidden layers of eight and three units and a single output was trained on a uniform distribution of examples that were positive inside a pair of triangles and negative elsewhere. This task was chosen because of its intuitive visual appeal and because it requires learning a non-connected concept--a task that demands more from the training algorithm (and sample selection scheme) than a simple convex shape.
The baseline case consisted of 12 networks trained on randomly drawn examples with training set sizes from 10 to 150 points in increments of 10 examples. Eight test cases were run on the same architecture with data selected by four runs of an SG-network using
15 selective sampling iterations of 10 examples each (figures 10a and 10b). Additionally, 12 runs of the naive querying algorithm described in section 3.1 were run for comparison. The networks trained on the selectively sampled data showed marked, consistent im-
provement over both the randomly sampled networks and the ones trained with naive query- ing (figure 11). The naive querying algorithm displayed much more erratic performance than the other two algorithms, possibly due to the pathological nature of its failure modes.
IMPROVING GENERALIZATION WITH ACTIVE LEARNING
215
o o
&-...........o.- : r % o oo
oooooo% oo
0'C3' 0'0'"~0' 0°00o
o0
000ooo
00%0000 O0
o0090 o ~\: o
. j.\oo i,o.O
°° o2 ;
o oi:i 1i ,11{°o
o o o~~o o ~<"!_!_~j o o j c,.
o oooo ooo
% o o oo ~% °
o oo °°°° °,oo° 3 , _ , o ,o ----
Figure 10. The triangle learner problem, (a) when learned by 150 random examples, and (b) when learned by 150 examples drawn in 15 passes of selective sampling. The dotted line denotes the true boundary of the unknown target concept.
i
Error
,3] i
0.2 <
0.I
0,07
0.05
-- ~ -e
. c-
random sampling naive querying selective sampling
'k \,,
'\\ ~ ?
o°o o
o
oo
oo .ooo
o
oo
oo
io
00 i0000 00~00
ooo o oo
o
: i,o o
ooooo
Y
50 i00 150
Training set size
Figure 11. Generalization error vs. training set size for random sampling, naive querying, and selective sam- piing. The irregularity of the naive querying algorithm's error may be due to its intermittent failure to find both triangles in the initial random sample.
216 D. COHN, L. ATLAS AND R. LADNER
4.2. Real-valued thresholdfunction
We use the 25-bit real-valued threshold problem as a quantitative measure of network per- formance on a simple but higher-dimensional problem. Six runs of selective sampling, using iterations of 10 examples per iteration, were trained on the problem and compared to 12 identical networks trained with randomly sampled data. The results (figure 12) in- dicate a much steeper learning curve for selective sampling.
Plotting the generalization error against the number of training examples m, networks trained on the randomly sampled data exhibited a roughly polynomial curve, as would be expected following Blumer et al. (1988). Using simple linear regression on 1, the error datafite=(a"m+b)-1(fora=0.0514andb= -0.076)withacoefficientofdeter- mination (r2) of 0.987. Networks trained on the selectively sampled data, by comparison, fit e = (a • m + b) -1 with r 2 = 0.937, indicating that the fit to the polynomial was not as good.
Visually, the selectively sampled networks exhibited a steeper drop in the generalization error, as would be expected from an active learning method. Using linear regression on the natural logarithm of the errors, the selectively sampled networks exhibited a decrease
Error
0.4
0.2
0.04
0.02 -
i ~
random sampling selective sampllng
0 50 i00 150 200 250 300 Training set size
Figure12. Generalization error vs. training set size for random sampling and selective sampling. Standard devia- tion of error averages 0.00265 for the random case and 0.00116 for the selectively sampled case.
IMPROVINGGENERALIZATIONWITH ACTIVELEARNING 217
in generalization error matching e - - - e a'm+b for a = -0.0218 and b = 0.071 with r 2 = 0.995 (up until the error drops below 1.5 %), indicating a good fit to the exponential curve. By comparison, the randomly sampled networks' fit to e = ea'm+° achieved only r 2 = 0.981.
In this domain, the SG-network appears to provide an almost exponential improvement in generalization with increasing training-set size, much as one would expect from an active learning algorithm. This suggests that the SG-network represents a good approximation to the region of uncertainty (in this domain) and thus implements a good approximation of selective sampling.
Additional experiments, run using 2, 3, 4, and 20 iterations, indicate than the error decreases as the sampling process is broken up into smaller, more frequent iterations. This observation is consistent with an increased efficiency of sampling as new information is incorporated earlier into the sampling process.
4.3. Power system security analysis
If various load parameters of an electrical power system are within a certain range, the system is secure. Otherwise it risks thermal overload and brownout. Previous research (Ag- goune et al., 1989) determined that this problem was amenable to neural network learn- ing, but that random sampling of the problem domain was inefficient in terms of examples needed. The range of parameters over which the system will be run is known, so distribu- tion information is readily available. For each set of parameters (each point in the domain), one can analytically determine whether the sysem is secure, but this must be done by solv- ing a time-consuming system of equations. Thus, since the classification of a point is much more expensive than the determination of an input distribution, the problem is amenable to solution by selective sampling.
The baseline case of random sampling in four dimensions studied by Hwang et al. (1990) was used for comparison. In our experiments, we ran six sets of networks on the initial, random training sets (with 500 data points) and added a single iteration of selective sam- piing. Networks were trained on a small second iteration of 300 points (for a total of 800) as well as a large second iteration of 2000 (for a total of 2500 points). These results were compared to the baseline cases of 800 and 2500 points of randomly sampled data.
We estimated the network errors by testing on 14,979 randomly drawn test points. The improvement that the single extra iteration of selective sampling yielded for the small set was over 10.7% of the total error (5.17% instead of 5.47%), while on the large set it resulted in an improvement of 12.6 % of total (4.21% instead of 4.82 %). This difference is signifi- cant with greater than 90% confidence.
5. Limitations of the selective sampling approach
There are a number of limitations to the selective sampling approach: some are practical, as mentioned in the previous section discussing implementations of the technique, while others are more theoretical.
218 D. COHN, L. ATLASAND R. LADNER
5.1. Practical limitations
As discussed earlier in this article, an exact implementation of selective sampling is prac- tical only for relatively simple concept classes. As the class becomes more complex, it becomes difficult to compute and maintain an accurate approximation of (R(Sm).
In the case of maintaining a superset, increased concept complexity seems to lead to cases where ~R+(Sm) effectively contains the entire domain, reducing the efficiency of selective sampling to that of random sampling. The example in section 2.4 illustrates this nicely: while a bounding box suffices as an approximation for rectangles in two dimen- sions, the "nicks" in such a box bounding a 20-dimensional figure could conceivably re- quire the approximation to contain most of the input space.
In the case of maintaining a subset, increased concept complexity leads to an extreme where 61- (S'n) contains only a very small subset of 6{(sm). In these cases, oversampling of regions becomes a critical problem, and due to the inductive bias of the training algorithm, even a training-set size of only one may omit large regions of the domain.
5.2. Theoretical limitations
Selective sampling draws its power from the ability to differentiate a region of uncertainty from the bulk of the domain. In cases where the representational complexity of the con- cept is large (as in a neural network with many hidden units), however, (R(Sm) can extend over the whole domain until the concept is already well learned. In other words, even though the maximum error may be small, due to the number of places that this error may arise, the total uncertainty may remain large. Thus, depending on the desired final error rate, selective sampling may not come into effect until it is no longer needed. Similarly, if the input dimension is very large, the bulk of the domain may be uncertain, even for simple concepts. One method for avoiding this problem is the use of Bayesian probabilities to measure the degree of utility in querying various parts of the region of uncertainty. This approach has recently been studied by David MacKay (1991) and is discussed briefly in the following section.
6. Related work
The work described in this article is an extension of results published by Cohn et al. (1990). Prior to that work, and since then, there have been many related results in active learning. There is a large body of work studying the effects of queries from the strict learning-theory viewpoint, primarily with respect to learning formal concepts such as Boolean expressions and finite-state automata. Angluin (1986) showed that while minimal finite-state automata were not polynomiaUy learnable (in the Valiant sense) from examples alone, they could be learned using a polynomial number of queries to an oracle that provides eounterexamples. Valiant (1984) considers various classes that are learnable using a variety of forms of directed learning. Work by Eisenberg and Rivest (1990) puts bounds on the degree to which member- ship query examples can help generalization when the underlying distribution is unknown. Additionally, given certain smoothness constraints on the distribution, these authors describe how queries may be used to learn the class of initial segments on the unit line.
IMPROVINGGENERALIZATIONWITHACTIVELEARNING 219
Seung et al. (1992) independently proposed an approach to selecting queries similar to ours, basing it on a lack of consensus in a "committee" of learners. Freund et al. (1993) showed that as the size of the committee increases beyond the two learners used in selec- tive sampling, the accuracy of one's "utility" estimate increases sharply.
Actual implementations of querying systems for learning have only recently been ex- plored. Work done by Hwang et al. (1990) implements querying for neural networks by means of inverting the activation of a trained network to determine where it is uncertain. This approach shows promise for concept learning in cases with relatively compact con- nected concepts and has already produced impressive results on the power system static security problem. It is, however, susceptible to the pathology discussed in section 3.1. An algorithm due to Baum and Lang (1991) uses queries to reduce the computational costs of training a single hidden-layer neural network. Their algorithm makes queries that allow the network to efficiently determine the connection weights from the input layer to the hidden layer.
Work by David MacKay (1991) pursues a related approach to dataselectionusing Baye- sian analysis. By assigning prior probabilities to each concept (or each network configura- tion), one can determine the utility of querying various parts of (R(sm). The fact that a point lies within (R(Sm)means that there are consistent configurations that disagree on the classification of that point. If that point is on the edge of (R(Sm),however, it may be that only a very few configurations disagree, and querying the point will only decrease the size of 6~(S'~) by an infinitesimally small amount. Using Bayesian analysis, one may, in effect, determine the "number" of configurations that disagree on a given point, and thus determine what parts of 6/(Sm) are "most" uncertain.
7. Conclusion
In this article, we have presented a theory of selective sampling, described a neural net- work implementation of the theory, and examined the performance of the resulting system in several domains.
Selective sampling is a very rudimentary form of active learning, but it has the benefit of a formal grounding in learning theory. In the neural network implementation tested, selective sampling demonstrates significant improvement over passive, random sampling techniques on a number of simple problems.
The paradigm is suited for concept-learning problems where the relevant input distribu- tion is known or where the cost of obtaining an unlabeled example from the input distribu- tion is small compared with the cost of labeling that example. While the limitations of selective sampling become apparent on more complex problem domains, the approach opens the door to the study of more sophisticated techniques for querying and learning by the natural and intuitive means of active learning.
Acknowledgments
This work was supported by National Science Foundation grant number CCR-9108314, the Washington Technology Center, and the IBM Corporation. The majority of this work
220 D. COHN, L. ATLASAND R. LADNER
was done while David Cohn was with the Department of Computer Science and Engineer- ing, University of Washington. The remainder was done while David Cohn was at IBM T.J. Watson Research Center, Yorktown Heights, NY 10598. We would like to thank Jai Choi and Sift Weerasooriya for their work in running simulation data for the power system problem. We would also like to thank two anonymous referees for their suggestions on an earlier version of this article.
Notes
1. We assume that all inputs have been normalized to the range [0, 1].
2. The terminology is that of Judd (1988).
3. The inductive biases inherent in baekpropagatlon have not been well studied, but there appears to be a ten-
dency to fit the data using the smallest number of units possible.
References
Aggoune, M., Atlas, L., Cohn, D., Damborg, M., E1-Sharkawi,M., & Marks, R. II. (1989). Artificial neural networks for power system static security assessment. Proceedings, International Symposium on Circuits and Systems.IEEE.
Angluin, D, (1986). Learning regular sets from queries and counter-examples. (Technical Report YALEU/DCS/TR-64). Dept. of Computer Science, Yale University, New Haven, CT.
Ash, TJ(1989). Dynamic node creation in backpropagation networks. ICS Report 8901. Institute for Cognitive Science, University of California, San Diego, CA.
Aum, E., & Haussler, D. (1989). What size net gives valid generalization? In D. Touretzky (Ed.), Advances in neural information processing systems, (Vol. 1). San Francisco, CA: Morgan Kaufmann.
Baum, E., & Lang, K. (1991). Constructing hidden units using examples and queries. In R. Lippmann et al. (Eds.), Advances in neural information processing systems (Vol. 3). San Francisco, CA: Morgan Kaufmann. Blum, A., & Rivest, R. (1989). Traininga 3-node neural network is NP-eomplete. In D. Touretzky (Ed.), Ad-
vances in neural information processing systems, Volume 1. San Francisco, CA: Morgan Kanfmarm. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis
dimension. JACM, 36(4), 929-965.
Cohn, D., Atlas, L., & Ladner, R. (1990). Trainingconnectionist networks with queries and selective sampl-
ing. In D. Touretzky (Ed.), Advances in neural information processing systems, (Vol. 2). San Francisco, CA:
Morgan Kaufmann.
Cohn, D., & Tesauro, G. (1992). How tight are the Vapnik-Chervonenkis bounds? Neural Computation 4(2),
249-269.
Eisenberg, B., & Rivest, R. (1990). On the sample complexity of pac-learning using random and chosen ex-
amples. In M. Fulk & J. Case (Eds.), ACM3rdAnnual Workshop on Computational Learning Theory. San
Francisco, CA: Morgan Kaufmann.
Fernald, A., &Kutd, P. (1987).AcousticdeterminantsofinfantpreferenceforMotheresespeech.InfantBehavior
and Development, 10, 279-293.
Freund, Y., Seung, H.S,, Shamir, E., & Tishby, N. (1993). Information, prediction, and query by committee.
In S. Hanson et al., (Eds.). Advances in Neural Information Processing Systems (Vol. 5). San Francisco,
CA: Morgan Kaufmann.
Haussler, D. (1987). Learning conjunctiveconcepts in structural domains. Proceedings, AAA1 "87(pp. 466--470).
San Francisco, CA: Morgan Kaufmann.
Haussler, D., (1992). Decision-theoretic generalizations of the PAC model for neural net and other applica-
tions. Information and Computation, 100(1), 78-150.
IMPROVING .GENERALIZATION WITH ACTIVE LEARNING 221
Hwang, J.-N., Choi, J., Oh, S., & Marks, R. (1990). Query learning based on boundary search and gradient computation of trained multilayer perceptrons. IJCNN 90. San Diego, CA.
Judd, S. (1988). On the complexity of loading shallow neural networks. Journal of Complexity, 4, 177-192. Le Cunn, Y., Denker, J., & Solla, S. (1990). Optimal brain damage. In D. Touretzky (Ed.), Advances in neural
information processing systems (Vol. 2). San Francisco, CA: Morgan Kaufmann.
MacKay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4),
590~604.
Mitchell, T. (1982). Generalization as search. Artificial Intelligence, 18, 203-226.
Pratt, L.Y. (1993). Discriminability-based transfer between neural networks. In C.L. Giles, et al. (Eds.), Ad-
vances in Neural Information Processing Systems, (Vol. 5). San Francisco, CA: Morgan Kaufrnann. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In
D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing, Cambridge, MA: MIT Press. Seung, H.S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Proceedings of the Fifth Annual
ACM Workshop on Computational Learning Theory (pp. 287-294). New York: ACM. Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27, 1134-1142.
Received September 24, 1990 Accepted February 4, 1992
Final Manuscript December 15, 1992