Two faces of active learning
Sanjoy Dasgupta
dasgupta@cs.ucsd.edu
Abstract
An active learner has a collection of data points, each with a label that is initially hidden but can be obtained at some cost. Without spending too much, it wishes to find a classifier that will accurately map points to labels. There are two common intuitions about how this learning process should be organized: (i) by choosing query points that shrink the space of candidate classifiers as rapidly as possible; and (ii) by exploiting natural clusters in the (unlabeled) data set. Recent research has yielded learning algorithms for both paradigms that are efficient, work with generic hypothesis classes, and have rigorously characterized labeling requirements. Here we survey these advances by focusing on two representative algorithms and discussing their mathematical properties and empirical performance.
1 Introduction
As digital storage gets cheaper, and sensing devices proliferate, and the web grows ever larger, it gets easier to amass vast quantities of unlabeled data – raw speech, images, text documents, and so on. But to build classifiers from these data, labels are needed, and obtaining them can be costly and time consuming. When building a speech recognizer for instance, the speech signal comes cheap but thereafter a human must examine the waveform and label the beginning and end of each phoneme within it. This is tedious and painstaking, and requires expertise.
We will consider situations in which we are given a large set of unlabeled points from some domain X, each of which has a hidden label, from a finite set Y, that can be queried. The idea is to find a good classifier, a mapping h : X → Y from a pre-specified set H, without making too many queries. For instance, each x ∈ X might be the description of a molecule, with its label y ∈ {+1, −1} denoting whether or not it binds to a particular target of interest. If the x’s are vectors, a possible choice of H is the class of linear separators.
In this setting, a supervised learner would query a random subset of the unlabeled data and ignore the rest. A semisupervised learner would do the same, but would keep around the unlabeled points and use them to constrain the choice of classifier. Most ambitious of all, an active learner would try to get the most out of a limited budget by choosing its query points in an intelligent and adaptive manner (Figure 1).
1.1 A model for analyzing sampling strategies
The practical necessity of active learning has resulted in a glut of different querying strategies over the past decade. These differ in detail, but very often conform to the following basic paradigm:
Start with a pool of unlabeled data S ⊂ X
Pick a few points from S at random and get their labels Repeat:
Fit a classifier h ∈ H to the labels seen so far
Query the unlabeled point in S closest to the boundary of h
(or most uncertain, or most likely to decrease overall uncertainty,…)
This high level scheme (Figure 2) has a ready and intuitive appeal. But how can it be analyzed?
1
−−− −+
+
+
+
Figure 1: Each circle represents an unlabeled point, while + and − denote points of known label. Left: raw and cheap – a large reservoir of unlabeled data. Middle: supervised learning picks a few points to label and ignores the rest. Right: Semisupervised and active learning get more use out of the unlabeled pool, by using them to constrain the choice of classifier, or by choosing informative points to label.
Figure 2: A typical active learning strategy chooses the next query point near the decision boundary obtained from the current set of labeled points. Here the boundary is a linear separator, and there are several unlabeled points close to it that would be candidates for querying.
2
− −
−
+
+ +
+ −
−
−
−
+
+ +
w∗ w
hw(x)= +1 ifx≥w −1 ifx
In the case of DHM, the distribution P might not be separable, in which case we need to take into account the best achievable error:
ν = inf err(h). h∈H
7
LDHM(ǫ,δ) is then the smallest to such that
P[someh∈Ht haserr(h)>ν+ǫ]≤δ
for all t ≥ to. In typical supervised learning bounds, and here as well, the dependence of L(ǫ,δ) upon δ is modest, at most poly log(1/δ). To avoid clutter, we will henceforth ignore δ and speak only of L(ǫ).
Theorem 1 [16] Suppose H has finite VC dimension d, and the learning problem is separable, with dis-
agreement coefficient θ. Then
LCAL(ǫ) ≤ Oθdlog1, ǫ
where the O notation suppresses terms logarithmic in d, θ, and log 1/ǫ.
A supervised learner would need Ω(d/ǫ) examples to achieve this guarantee, so active learning yields an exponential improvement when θ is finite: its label requirement scales as log 1/ǫ rather than 1/ǫ. And this is without any effort at finding maximally informative points!
In the nonseparable case, the label complexity also depends on the minimum achievable error within the hypothesis class.
Theorem 2 [12] With parameters as defined above,
where ν = infh∈H err(h).
LDHM(ǫ) ≤ Oθdlog21+dν2 ǫ ǫ2
In this same setting, a supervised learner would require Ω((d/ǫ) + (dν/ǫ2)) samples. If ν is small relative to ǫ, we again see an exponential improvement from active learning; otherwise, the improvement is by the constant factor ν.
The second term in the label complexity is inevitable for nonseparable data.
Theorem 3 [5] Pick any hypothesis class with finite VC dimension d. Then there exists a distribution P over X × Y for which any active learner must incur a label complexity
L(ǫ,1/2) ≥ Ωdν2 , ǫ2
where ν = infh∈H err(h).
The corresponding lower bound for supervised learning is dν/ǫ2.
2.5 The disagreement coefficient
We now define the leading constant in both label complexity upper bounds: the disagreement coefficient. To start with, the data distribution P induces a natural metric on the hypothesis class H: the distance between any two hypotheses is simply the P-mass of points on which they disagree. Formally, for any
h, h′ ∈ H, we define
d(h, h′) = P[h(X) ̸= h′(X)], and correspondingly, the closed ball of radius r around h is
B(h,r)={h′ ∈H:d(h,h′)≤r}.
Now, suppose we are running either CAL or DHM, and that the current version space is some V ⊂ H.
Then the only points that will be queried are those that lie within the disagreement region DIS(V)={x∈X :thereexisth,h′ ∈V withh(x)̸=h′(x)}.
8
h h*
Figure 8: Left: Suppose the data lie in the plane, and that hypothesis class consists of linear separators. The distance between any two hypotheses h∗ and h is the probability mass (under P) of the region on which they disagree. Middle: The thick line is h∗. The thinner lines are examples of hypotheses in B(h∗, r). Right: DIS(B(h∗, r)) might look something like this.
Figure 8 illustrates these notions.
If the minimum-error hypothesis is h∗, then after a certain amount of querying we would hope that the
version space is contained within B(h∗,r) for small-ish r. In which case, the probability that a random point from P would get queried is at most P(DIS(B(h∗, r))). The disagreement coefficient measures how this probability scales with r: it is defined to be
θ = sup P[DIS(B(h∗,r))]. r>0 r
Let’s work through an example. Suppose X = R and H consists of thresholds. For any two thresholds h < h′, the distance d(h,h′) is simply the P-mass of the interval [h,h′). If h∗ is the best threshold, then B(h∗,r) consists exactly of the interval I that contains: h∗; the segment to the immediate left of h∗ of P-mass r; and the segment to the immediate right of h∗ of P-mass r. The disagreement region DIS(B(h∗, r)) is this same interval I; and since it has mass 2r, it follows that the disagreement coefficient is 2.
Disagreement coefficients have been derived for various concept classes and data distributions of interest, including:
• Thresholds in R: θ = 2, as explained above.
• Homogeneous (through-the-origin) linear separators in Rd, with a data distribution P that is uniform
over the surface of the unit sphere [15]: θ ≤ √d.
• Linear separators in Rd, with a smooth data density bounded away from zero [14]: θ = c(h∗)d, where
c(h∗) is some constant depending on the target hypothesis h∗.
2.6 Further work on mellow active learning
A more refined disagreement coefficient
When the disagreement coefficient θ is bounded, CAL and DHM offer better label complexity than supervised learning. But there are simple instances in which θ is arbitrarily large. For instance, suppose again that X = R but that the hypotheses consist of intervals: +1 if a ≤ x ≤ b
H = {ha,b : a, b ∈ R}, ha,b(x) = −1 otherwise Then the distance between any two hypotheses ha,b and ha′,b′ is
d(ha,b, ha′,b′ ) = P{x : x ∈ [a, b] ∪ [a′, b′], x ̸∈ [a, b] ∩ [a′, b′]} = P([a, b]∆[a′, b′]), 9
where S∆T denotes the symmetric set difference (S ∪ T ) \ (S ∩ T ). Now suppose the target hypothesis is some hα,β with α ≤ β. If r > P[α, β] then B(hα,β , r) includes all intervals of probability mass ≤ r − P[α, β]. Thus, if P is a density, the disagreement region of B(hα,β,r) is all of X! Letting r approach P[α,β] from above, we see that θ is at least 1/P[α, β], which is unbounded as β gets closer to α.
A saving grace is that for smaller values r ≤ P[α, β], the hypotheses in B(hα,β , r) are intervals intersecting hα,β, and consequently the disagreement region has mass at most 4r. Thus there are two regimes in the active learning process for H: an initial phase in which the radius of uncertainty r is brought down to P[α, β], and a subsequent phase in which r is further decreased to O(ǫ). The first phase might be slow, but the second should behave as if θ = 4. Moreover, the dependence of the label complexity upon ǫ should arise entirely from the second phase. A series of recent papers [4, 14] analyzes such cases by loosening the definition of disagreement coefficient from
sup P[DIS(B(h∗,r))] to limsup P[DIS(B(h∗,r))]. r>0 r r→0 r
In the example above, the revised disagreement coefficient is 4.
Other loss functions
DHM uses a supervised learner as a black box, and assumes that this subroutine truly returns a hypothesis minimizing the empirical error. However, in many cases of interest, such as high-dimensional linear separa- tors, this minimization is NP-hard and is typically solved in practice by substituting a convex loss function in place of 0 − 1 loss. Some recent work [5] develops a mellow active learning scheme for general loss functions, using importance weighting.
2.7 Illustrative experiments
We now show DHM at work in a few toy cases. The first is our recurring example of threshold functions on the line. Figure 9 shows the results of an experiment in which the (unlabeled) data is distributed uniformly over the interval [0, 1] and the target threshold is 0.5. Three types of noise distribution are considered:
• No noise: every point above 0.5 gets a label of +1 and the rest get a label of −1.
• Random noise: each data point’s label is flipped with a certain probability, 10% in one experiment and
20% in another. This is a benign form of noise.
• Boundary noise: the noisy labels are concentrated near the boundary (for more details, see [12]). The fraction of corrupted labels is 10% in one experiment and 20% in another. This kind of noise is challenging for active learning, because the learner’s region of disagreement gets progressively more noisy as it shrinks.
Each experiment has a stream of 10,000 unlabeled points; the figure shows how many of these are queried during the course of learning.
Figures 10 and 11 show similar results for the hypothesis class of intervals. The distribution of queries is initially all over the place, but eventually concentrates at the decision boundary of the target hypothesis. Although the behavior of this active learner seems qualitatively similar to that of the heuristics we described in the introduction, it differs in one crucial respect: here, all points get labeled to avoid bias in the training set.
The experiments so far are for one-dimensional data. There are two significant hurdles in scaling up the DHM algorithm to real data sets:
1. The version space Ht is defined using a generalization bound. Current bounds are tight only in a few special cases, such as small finite hypothesis classes and thresholds on the line. Otherwise they can be extremely loose, with the result that Ht ends up being larger than necessary, and far too many points get queried.
10
4000
3500
3000
2500
2000
1500
1000
500 0
Figure 9: Here the data distribution is uniform over X = [0, 1] and H consists of thresholds on the line. The target threshold is at 0.5. We test five different noise models for the conditional distribution of labels.
Number of queries Number of queries
0 5000
boundary noise 0.2
boundary noise 0.1
random noise 0.2 random noise 0.1
no noise
10000
Number of data points seen
3500
3000
2500
2000
1500
1000
500 0
Figure 10: The data distribution is uniform over X = [0, 1], and H consists of intervals. We vary the width of the target interval and the noise model for the conditional distribution of labels.
11
0 5000
width 0.1, random noise 0.1 width 0.2, random noise 0.1
width 0.1, no noise width 0.2, no noise
10000
Number of data points seen
0 0.4 0.6 1
Figure 11: The distribution of queries for the experiment of Figure 10, with target interval [0.4,0.6] and
random noise of 0.1. The initial distribution of queries is shown above and the eventual distribution below. 2. Its querying policy is not at all aggressive.
The first problem is perhaps eroding with time, as better and better bounds are found. How well would CAL/DHM perform if this problem vanished altogether? That is, what are the limits of mellow active learning?
One way to investigate this question is to run DHM with the best possible bound. Recall the definition of the version space:
Ht+1 = {h ∈ Ht : errt(h) ≤ errt(ht) + ∆t},
where errt(·) is the empirical error on the first t points, and ht is the minimizer of errt(·). Instead of obtaining ∆t from large deviation theory, we can simply set it to the smallest value that retains the target hypothesis h∗ within the version space; that is, ∆t = errt(h∗) − errt(ht). This cannot be used in practice because errt(h∗) is unknown; here we use it in a synthetic example to explore how effective mellow active learning could be if the problem of loose generalization bounds were to vanish.
Figure 12 shows the result of applying this optimistic bound to a ten-dimensional learning problem with two Gaussian classes. The hypotheses are linear separators and the best separator has an error of 5%. A stream of 500 data points is processed, and as before, the label complexity is impressive: less than 70 of these points get queried. The figure on the right presents a different view of the same experiment. It shows how the test error decreases over the first 50 queries of the active learner, as compared to a supervised learner (which asks for every point’s label). There is an initial phase where active learning offers no advantage; this is the time during which a reasonably good classifier, with about 8% error, is learned. Thereafter, active learning kicks in and after 50 queries reaches a level of error that the supervised learner will not attain until after about 500 queries. However, this final error rate is still 5%, and so the benefit of active learning is realized only during the last 3% decrease in error rate.
2.8 Where next
The single biggest open problem is to develop active learning algorithms with more aggressive querying strategies. This would appear to significantly complicate problems of sampling bias, and has thus proved tricky to analyze mathematically. Such schemes might be able to exploit their querying ability earlier in the learning process, rather than having to wait until a reasonably good classifier has already been obtained.
Next, we’ll turn to an entirely different view of active learning and present a scheme that is, in fact, able to realize early benefits.
12
80
70
60
50
40
30
20
10
0
0 100 200
300 400 500
Number of data points seen
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0 10 20 30 40 50 60
Number of labels seen
supervised
active
Figure 12: Here X = R10 and H consists of linear separators. Each class is Gaussian and the best separator has 5% error. Left: Queries made over a stream of 500 examples. Right: Test error for active versus supervised sampling.
3 Exploiting cluster structure in data
Figure 4, right, immediately brings to mind an active learning strategy that starts with a pool of unlabeled data, clusters it, asks for one label per cluster, and then takes these labels as representative of their entire respective clusters. Such a scheme is fraught with problems: (i) there might not be an obvious clustering, or alternatively, (ii) good clusterings might exist at many granularities (for instance, there might be a good partition into ten clusters but also into twenty clusters), or, worst of all, (iii) the labels themselves might not be aligned with the clusters.
What is needed is a scheme that does not make assumptions about the distribution of data and labels, but is able to exploit situations where there exist clusters that are fairly homogeneous in their labels. An elegant example of this is due to Zhu, Lafferty, and Ghahramani [20]. Their algorithm begins by imposing a neighborhood graph on an unlabeled data set S, and then works by locally propagating any labels it obtains, roughly as follows:
Pick a few points from S at random and get their labels Repeat:
Propagate labels to “nearby” unlabeled points in the graph Query in an “unknown” part of the graph
This kind of nonparametric learner appears to have the usual problems of sampling bias, but differs from the approaches of the previous section, and has been studied far less. One recently-proposed algorithm, which we will call DH after its authors [11], attempts to capture the spirit of local propagation schemes while maintaining sound statistics on just how “unknown” different regions are; we now turn to it.
3.1 Three rules for querying
In the previous section the unlabeled points arrived in a stream, one at a time. Here we’ll assume we have all of them at the outset: a pool S ⊂ X . The entire learning process will then consist of requesting labels of points from S.
13
Number of queries made
Error
In the DH algorithm, the only guide in deciding where to query is cluster structure. But the clusters in use may change over time. Let Ct be the clustering operative at time t, while the tth query is being chosen. This Ct is some partition of S into groups; more formally,
C = S, and any C, C′ ∈ Ct are either identical or disjoint. C ∈Ct
To avoid biases induced by sampling, DH follows a simple rule for querying:
Rule 1. At any time t, the learner specifies a cluster C ∈ Ct and receives the label of a point
chosen uniformly at random from C (along with the identity of that point).
In other words, the learner is not allowed to specify exactly which point to label, but merely the cluster from which it will be randomly drawn. In the case where there are binary labels {−1, +1}, it helps to think of each cluster as a biased coin whose heads probability is simply the fraction of points within it with label +1. The querying process is then a toss of this biased coin. The scheme also works if the set of labels Y is some finite set, in which case samples from the cluster are modeled by a multinomial.
If the labeling budget runs out at time T, the current clustering CT is used to assign labels to all of S: each point gets the majority label of its cluster.
− +− + −− ++ +
+ ⇒−−+++ −+−−++
− −−− −− −−
+ −++++ + ++++
++ ++
To control the error induced by this process, it is important to find clusters that are as homogeneous as possible in their labels. In the above example, we can be fairly sure of this. We have five random labels from the left cluster, all of which are −. Using a tail bound for the binomial distribution, we can obtain an interval (such as [0.8, 1.0]) in which the true bias of this cluster is very likely to lie. The DH algorithm makes heavy use of such confidence intervals.
If the current clustering C has a cluster that is very mixed in its labels, then this cluster needs to be split further, to get a new clustering C′:
−−−−−−−− −−+⇒−−+
−−−− −− −−
++ ++
++
The lefthand cluster of C can be left alone (for the time being), since the one on the right is clearly more troublesome. A fortunate consequence of Rule 1 is that the queries made to C can be reused for C′: a random label in the righthand cluster of C is also a random label for the new cluster in which it falls.
Thus the clustering of the data changes only by splitting clusters.
Rule 2. Pick any two times t′ > t. Then Ct′ must be a refinement of Ct, that is,
forallC′ ∈Ct′,thereexistssomeC∈Ct suchthatC′ ⊂C. 14
1
24 5678
45% 5% 5%
3
9
45%
Figure 13: The top few nodes of a hierarchical clustering.
As in the example above, the nested structure of clusters makes it possible to re-use labels when the clustering changes. If at time t, the querying process yields (x, y) for some x ∈ C ∈ Ct, then later, at t′ > t, this same (x, y) is reusable as a random draw from the C′ ∈ Ct′ to which x belongs.
The final rule imposes a constraint on the manner in which a clustering is refined.
Rule 3. When a cluster is split to obtain a new clustering, Ct → Ct+1, the manner of split cannot depend upon the labels seen.
This avoids complicated dependencies. The upshot of it is that we might as well start off with a hierarchical clustering of S, set C1 to the root of the clustering, and gradually move down the hierarchy, as needed, during the querying process.
3.2 An illustrative example
Given the three rules for querying, the specification of the DH cluster-based learner is quite intuitive. Let’s get a sense of its behavior by revisiting the example in Figure 3 that presented difficulties for many active learning heuristics.
DH would start with a hierarchical clustering of this data. Figure 13 shows how it might look: only the top few nodes of the hierarchy are depicted, and their numbering is arbitrary. At any given time, the learner works with a particular partition of the data set, given by a pruning of the tree. Initially, this is just {1}, a single cluster containing everything. Random points are drawn from this cluster and their labels are queried. Suppose one of these points, x, lies in the rightmost group. Then it is a random sample from node 1 of the hierarchy, but also from nodes 3 and 9. Based on such random samples, each node of the tree maintains counts of the positive and negative instances seen within it. A few samples reveal that the top node 1 is very mixed while nodes 2 and 3 are substantially more pure. Once this transpires, the partition {1} is replaced by {2, 3}. Subsequent random samples are chosen from either 2 or 3, according to a sampling strategy favoring the less-pure node. A few more queries down the line, the pruning is refined to {2, 4, 9}. This is when the benefits of the partitioning scheme become most obvious; based on the samples seen, it is concluded that cluster 9 is (almost) pure, and thus (almost) no more queries are made from it until the rest of the space has been partitioned into regions that are similarly pure.
The querying can be stopped at any stage; then, each cluster in the current partition is assigned the majority label of the points queried from it. In this way, DH labels the entire data set, while trying to keep the number of induced erroneous labels to a minimum. If desired, the labels can be used for a subsequent round of supervised learning, with any learning algorithm and hypothesis class.
15
P ← {root} (current pruning of tree)
L(root) ← 1 (arbitrary starting label for root) For t = 1, 2, . . . (until the budget runs out):
Repeat B times:
v ← select(P )
Pick a random point z from subtree Tv
Query z’s label
Update counts for all nodes u on path from z to v
In a bottom-up pass of T , compute bound(u) for all nodes u ∈ T For each (selected) v ∈ P :
Let (P′,L′) be the pruning and labeling of Tv minimizingbound(v) P ← (P \ {v}) ∪ P ′
L(v) ← L′(u) for all u ∈ P′
For each cluster v ∈ P :
Assign each point in Tv the label L(v)
Figure 14: The DH cluster-adaptive active learning algorithm.
3.3 The learning algorithm
Figure 14 shows the DH active learning algorithm. Its input consists of a hierarchical clustering T whose leaves are the n data points, as well as a batch size B. This latter quantity specifies how many queries will be asked per iteration: it is frequently more convenient to make small batches of queries at a time rather than single queries. At the end of any iteration, the algorithm can be made to stop and output a labeling of all the data.
The pseudocode needs some explaining. The subtree rooted at a node u ∈ T is denoted Tu. At any given time, the algorithm works with a particular pruning P ⊂ T , a set of nodes such that the subtrees (clusters) {Tu : u ∈ P} are disjoint and contain all the leaves. Each node u of the pruning is assigned a label L(u), with the intention that if the algorithm were to be abruptly stopped, all of the data points in Tu would be given this label. The resulting mislabeling error within Tu (that is, the fraction of Tu whose true label isn’t L(u)) can be upper-bounded using the samples seen so far; bound(u) is a conservative such estimate. These bounds can be computed directly from Hoeffding’s inequality, or better still, from the tails of the binomial or multinomial; details are in [11]. Computationally, what matters is that after each batch of queries, all the bounds can be updated in a linear-time bottom-up pass through the tree, at which point the pruning P might be refined.
Two things remain unspecified: the manner in which the hierarchical clustering is built and the procedure select, which picks a cluster to query. We discuss these next, but regardless of how these decisions are made, the DH algorithm has a basic statistical soundness: it always maintains valid estimates of the error induced by its current pruning, and it refines prunings to drive down the error. This leaves a lot of flexibility to explore different clustering and sampling strategies.
The select procedure
The select(P ) procedure controls the selective sampling. There are many choices for how to do this:
1. Choose v ∈ P with probability ∝ wv. This is similar to random sampling.
2. Choose v ∈ P with probability ∝ wv bound(v).
This is an active learning rule that reduces sampling in regions of the space that have already been observed to be fairly pure in their labels.
16
3. For each subtree (Tz,z ∈ P), find the observed majority label, and assign this label to all points in the subtree; fit a classifier h to this data; and choose v ∈ P with probability ∝ min{|{x ∈ Tv : h(x) = +1}|, |{x ∈ Tv : h(x) = −1}|}.
This biases sampling towards regions close to the current decision boundary.
Innumerable variations of the third strategy are possible. Such schemes have traditionally suffered from consistency problems (recall Figure 3), for instance because entire regions of space are overconfidently over- looked. The DH framework relieves such concerns because there is always an accurate bound on the error induced by the current labeling.
Building a hierarchical clustering
DH works best when there is a pruning P of the tree such that |P| is small and a significant fraction of its constituent clusters are almost-pure. There are several ways in which one might try to generate such a tree. The first option is simply to run a standard hierarchical clustering algorithm, such as Ward’s average linkage method. If domain knowledge is available in the form of a specialized distance function, this can be used for the clustering; for instance, if the data is believed to lie on a low-dimensional manifold, a distance function can be generated by running Dijkstra’s algorithm on a neighborhood graph, as in Isomap [18]. A third option is to use a small set of labeled data to guide the construction of the hierarchical clustering, by providing soft constraints.
Label complexity
A rudimentary label complexity result for this model is proved in [11]: if the provided hierarchical clustering contains a pruning P whose clusters are ǫ-pure in their labels, then the learner will find a labeling that is O(ǫ)-pure with O(|P|d(P)/ǫ) labels, where d(P) is the maximum depth of a node in P.
3.4 An illustrative experiment
The MNIST data set1 is a widely-used benchmark for the multiclass problem of handwritten digit recognition. It contains over 60,000 images of digits, each a vector in R784. We began by extracting 10,000 training images and 2,000 test images, divided equally between the digits.
The first step in the DH active learning scheme is to hierarchically cluster the training data, bearing in mind that the efficacy of the subsequent querying process depends heavily on how well the clusters align with true classes. Previous work has found that the use of specialized distance functions such as tangent distance or shape context dramatically improves nearest-neighbor classification performance on this data set; the most sensible thing to do would therefore be to cluster using one of these distance measures. Instead, we tried our luck with Euclidean distance and a standard hierarchical agglomerative clustering procedure called Ward’s algorithm [19]. (Briefly, this algorithm starts with each data point in its own cluster, and then repeatedly merges pairs of clusters until there is a single cluster containing all the data. The merger chosen at each point in time is that which occasions the smallest increase in k-means cost.)
It is useful to look closely at the resulting hierarchical clustering, a tree with 10,000 leaves. Since there are ten classes (digits), the best possible scenario would be one in which there were a pruning consisting of ten nodes (clusters), each pure in its label. The worst possible scenario would be if purity were achieved only at the leaves – a pruning with 10,000 nodes. Reality thankfully falls much closer to the former extreme. Figure 15, left, relates the size of a pruning to the induced mislabeling error, that is, the fraction of points misclassified when each cluster is assigned its most-frequent label. For instance, the entry (50, 0.12) means that there exists a pruning with 50 nodes such that the induced label error is just 12%. This isn’t too bad given the number of classes, and bodes well for active learning.
A good pruning exists relatively high in the tree; but does the querying process find it or something comparable? The analysis shows that it must, using a number of queries roughly proportional to the number
1 http://yann.lecun.com/exdb/mnist/
17
0.8
0.6
0.4
0.2
0
0 200 400
600 800 1000 Number of clusters
random
active
0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08
0 2000 4000
6000 8000
10000
Number of labels
Figure 15: Results on OCR data. Left: Errors of the best prunings in the OCR digits tree. Right: Test error curves on classification task.
of clusters. As empirical corroboration, we ran the hierarchical sampler ten times, and on average 400 queries were needed to discover a pruning of error rate 12% or less.
So far we have only talked about error rates on the training set. We can complete the picture by using the final labeling from the sampling scheme as input to a supervised learner (logistic regression with l2 regularization, the trade-off parameter chosen by 10-fold cross validation). A good baseline for comparison is the same experiment but with random instead of active sampling. Figure 15, right, shows the resulting learning curves: the tradeoff between the number of labels and the error rate on the held-out test set. The initial advantage of cluster-adaptive sampling reflects its ability to discover and subsequently ignore relatively pure clusters at the onset of sampling. Later on, it is left sampling from clusters of easily confused digits (the prime culprits being 3’s, 5’s, and 8’s).
3.5 Where next
The DH algorithm attempts to capture the spirit of nonparametric local-propagation active learning, while maintaining reliable statistics about what it does and doesn’t know. In the process, however, it loses some flexibility in its sampling strategy – it would be useful to be able to relax this.
On the theory front, the label complexity of DH needs to be better understood. At present there is no clean characterization of when it does better than straight random sampling, and by how much. Among other things, this would help in choosing the querying rule (that is, the procedure select).
A very general way to describe this cluster-based active learner is to say it uses unlabeled data to create a data-dependent hypothesis class that is particularly amenable to adaptive sampling. This broad approach to active learning might be worth studying systematically.
Acknowledgements
The author is grateful for his collaborators – Alina Beygelzimer, Daniel Hsu, Adam Kalai, John Langford, and Claire Monteleoni – and for the support of the National Science Foundation under grant IIS-0713540. There were also two anonymous reviewers who gave very helpful feedback on the first draft of this paper.
18
Fraction of labels incorrect
Error
S = ∅ (points with inferred labels) T = ∅ (points with queried labels) For t = 1, 2, . . .:
Receive xt
If (h+1 = learn(S ∪ {(xt , +1)}, T )) fails:
If (h−1 = learn(S ∪ {(xt , −1)}, T )) fails: Iferr(h−1,S∪T)−err(h+1,S∪T)>∆t: Add(xt,+1)toSandbreak Iferr(h+1,S∪T)−err(h−1,S∪T)>∆t: Add(xt,−1)toSandbreak Request yt and add (xt, yt) to T
Figure 16: The DHM selective sampling algorithm. Here, err(h, A) = (1/|A|) (x,y)∈A 1(h(x) ̸= y). A possible setting for ∆t is shown in Equation 1. At any time, the current hypothesis is learn(S, T ).
4 Appendix: the DHM algorithm
For technical reasons, the DHM algorithm (Figure 16) makes black-box calls to a special type of supervised learner: for A, B ⊂ X × {±1},
learn(A,B) returns a hypothesis h ∈ H consistent with A, and with minimum error on B. If there is no hypothesis consistent with A, a failure flag is returned.
For some simple hypothesis classes like intervals on the line, or rectangles in R2, it is easy to construct such a learner. For more complex classes like linear separators, the main bottleneck is the hardness of minimizing the 0 − 1 loss on B (that is, the hardness of agnostic supervised learning).
During the learning process, DHM labels every point but divides them into two groups: those whose labels are queried (denoted T in the figure) and those whose labels are inferred (denoted S). Points in S are subsequently used as hard constraints; with high probability, all these labels are consistent with the target hypothesis h∗.
The generalization bound ∆t can be set to
∆t =βt2 +βterr(h+1,S∪T)+err(h−1,S∪T), βt =Cdlogt+log(1/δ) (1)
Add (xt , −1) to S and break Add (xt , +1) to S and break
t
where C is a universal constant, d is the VC dimension of class H, and δ is the overall permissible failure
probability.
References
[1] D. Angluin. Queries revisited. In Proceedings of the Twelfth International Conference on Algorithmic Learning Theory, pages 12–31, 2001.
[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In International Conference on Machine Learning, 2006.
[3] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Conference on Learning Theory, 2007.
[4] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In Proceed- ings of the 21st Annual Conference on Learning Theory, 2008.
[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In International Conference on Machine Learning, 2009.
19
[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture Notes in Artificial Intelligence, 3176:169–207, 2004.
[7] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339–2353, 2008.
[8] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear- threshold algorithms. In Advances in Neural Information Processing Systems, 2004.
[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
[10] S. Dasgupta. Coarse sample complexity bounds for active learning. In Neural Information Processing Systems, 2005.
[11] S. Dasgupta and D.J. Hsu. Hierarchical sampling for active learning. In International Conference on Machine Learning, 2008.
[12] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Neural Information Processing Systems, 2007.
[13] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2):133–168, 1997.
[14] E. Friedman. Active learning for smooth problems. In Conference on Learning Theory, 2009.
[15] S. Hanneke. A bound on the label complexity of agnostic active learning. In International Conference
on Machine Learning, 2007.
[16] S. Hanneke. Theoretical Foundations of Active Learning. PhD Thesis, CMU Machine Learning Depart-
ment, 2009.
[17] H. Schutze, E. Velipasaoglu, and J. Pedersen. Performance thresholding in practical text classification.
In ACM International Conference on Information and Knowledge Management, 2006.
[18] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality
reduction. Science, 290(5500):2319–2323, 2000.
[19] J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical
Association, 58:236–244, 1963.
[20] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML Workshop on the Continuum from Labeled to Unlabeled Data, 2003.
20