Knows What It Knows: A Framework For Self-Aware Learning
Lihong Li lihong@cs.rutgers.edu
Michael L. Littman mlittman@cs.rutgers.edu
Thomas J. Walsh thomaswa@cs.rutgers.edu
Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA
Abstract
We introduce a learning framework that
combines elements of the well-known PAC
and mistake-bound models. The KWIK
(knows what it knows) framework was de-
signed particularly for its utility in learning
settings where active exploration can impact
the training examples the learner is exposed
to, as is true in reinforcement-learning and
active-learning problems. We catalog several
KWIK-learnable classes and open problems.
1. Motivation
At the core of recent reinforcement-learning algo-
rithms that have polynomial sample complexity guar-
antees (Kearns & Singh, 2002; Kearns & Koller, 1999;
Kakade et al., 2003; Strehl et al., 2007) lies the idea
of distinguishing between instances that have been
learned with sufficient accuracy and those whose out-
puts are still unknown.
The Rmax algorithm (Brafman & Tennenholtz, 2002),
for example, estimates transition probabilities for each
state–action–next-state triple of a Markov decision
process (MDP). The estimates are made separately,
as licensed by the Markov property, and the accuracy
of the estimate is bounded using Hoeffding bounds.
The algorithm explicitly distinguishes between proba-
bilities that have been estimated accurately (known)
and those for which more experience will be needed
(unknown). By encouraging the agent to gather more
experience in the unknown states, Rmax can guaran-
tee a polynomial bound on the number of timesteps in
which it has a non-near-optimal policy (Kakade, 2003).
In this paper, we make explicit the properties that are
sufficient for a learning algorithm to be used in efficient
Appearing in Proceedings of the 25 th International Confer-
ence on Machine Learning, Helsinki, Finland, 2008. Copy-
right 2008 by the author(s)/owner(s).
[1,1,1][1,1,1] [1,1,1] [1,1,1]
[0,0,1] [0,0,1]
[1,0,0]
[0,1,1] [0,0,0]
Figure 1. A cost-vector navigation graph.
exploration algorithms like Rmax. Roughly, the learn-
ing algorithm needs to make only accurate predictions,
although it can opt out of predictions by saying “I
don’t know” (⊥). However, there must be a (polyno-
mial) bound on the number of times the algorithm can
respond ⊥. We call such a learning algorithm KWIK
(“know what it knows”).
Section 2 provides a motivating example and sketches
possible uses for KWIK algorithms. Section 3 defines
the KWIK conditions more precisely and relates them
to established models from learning theory. Sections 4
and 5 survey a set of hypothesis classes for which
KWIK algorithms can be created.
2. A KWIK Example
Consider the simple navigation task in Figure 1. There
is a set of nodes connected by edges, with the node on
the left as the source and the dark one on the right
as the sink. Each edge in the graph is associated with
a binary cost vector of dimension d = 3, indicated in
the figure. The cost of traversing an edge is the dot
product of its cost vector with a fixed weight vector
w = [1, 2, 0]. Assume that w is not known to the agent,
but the graph topology and all cost vectors are. In each
episode, the agent starts from the source and moves
along some path to the sink. Each time it crosses an
edge, the agent observes its true cost. Once the sink
is reached, the next episode begins. The learning task
is to take a non-cheapest path in as few episodes as
possible. There are 3 distinct paths in this example.
Given the w above, the top has a cost of 12, the middle
KWIK Learning Framework
13, and the bottom 15.
A simple approach for this task is for the agent to
assume edge costs are uniform and walk the shortest
(middle) path to collect data. It would gather 4 exam-
ples of [1, 1, 1] → 3 and one of [1, 0, 0] → 1. Standard
regression algorithms could use this dataset to find a
ŵ that fits this data. Here, ŵ = [1, 1, 1] is a natural
choice. The learned weight vector could then be used
to estimate costs for the three paths: 14 for the top,
13 for the middle, 14 for the bottom. Using these es-
timates, an agent would continue to take the middle
path forever, never realizing it is not optimal.
In contrast, consider a learning algorithm that “knows
what it knows”. Instead of creating an approximate
weight vector ŵ, it reasons about whether the costs
for each edge can be obtained from the available data.
The middle path, since we’ve seen all its edge costs,
is definitely 13. The last edge of the bottom path has
cost vector [0, 0, 0], so its cost must be zero, but the
penultimate edge of this path has cost vector [0, 1, 1].
This vector is a linear combination of the two observed
cost vectors, so, regardless of w,
w·[0, 1, 1] = w·([1, 1, 1]−[1, 0, 0]) = w·[1, 1, 1]−w·[1, 0, 0],
which is just 3 − 1 = 2. Thus, we know the bottom
path’s cost is 14—worse than the middle path.
The vector [0, 0, 1] on the top path is linearly inde-
pendent of the cost vectors we’ve seen, so its cost is
unconstrained. We know we don’t know. A safe thing
to assume provisionally is that it’s zero, encouraging
the agent to try the top path in the second episode.
Now, it observes [0, 0, 1] → 0, allowing it to solve for
w and accurately predict the cost for any vector (the
training data spans
hypotheses in the version space disagree. In this case,
the algorithm returns ⊥ and receives the true label y.
It then computes an updated version space
Ĥ ′ = {h|h ∈ Ĥ ∧ h(x) = y}.
Because |L̂| > 1, there must be some h ∈ Ĥ such
that h(x) 6= y. Therefore, the new version space must
be smaller |Ĥ ′| ≤ |Ĥ| − 1. Before the next input is
received, the version space is updated Ĥ := Ĥ ′.
If |Ĥ| = 1 at any point, |L̂| = 1, and the algorithm will
no longer return ⊥. Therefore, |H|−1 is the maximum
number of ⊥s the algorithm can return.
Example 1 You own a bar that is frequented by a
group of n patrons P . There is one patron f ∈ P who
is an instigator—whenever a group of patrons is in the
bar G ⊆ P , if f ∈ G, a fight will break out. However,
there is another patron p ∈ P , who is a peacemaker.
Figure 3. Schematic of behavior of the planar-distance al-
gorithm after the first (a), second (b), and third (c) time
it returns ⊥.
If p is in the group, it will prevent a fight, even if f is
present.
You want to predict whether a fight will break out
among a subset of patrons, initially without knowing
the identities of f and p. The input space is X = 2P
and the output space is Y = {fight, no fight}.
The memorization algorithm achieves a KWIK bound
of 2n for this problem, since it may have to see each
possible subset of patrons. However, the enumeration
algorithm can KWIK learn this hypothesis class with a
bound of n(n−1) since there is one hypothesis for each
possible assignment of a patron to f and p. Each time
it reports ⊥, it is able to rule out at least one possible
instigator–peacemaker combination.
4.2. Real-valued Functions
The previous two examples exploited the finiteness of
the hypothesis class and input space. KWIK bounds
can also be achieved when these sets are infinite.
Algorithm 3 Define X = <2, Y = <, and
H = {f |f : X → Y, c ∈ <2, f(x) = ‖x− c‖2}.
This is, there is an unknown point and the target func-
tion maps input points to the distance from the un-
known point. The planar-distance algorithm can learn
in this hypothesis class with a KWIK bound of 3.
The algorithm proceeds as follows, illustrated in Fig-
ure 3. First, given initial input x, the algorithm says
⊥ and receives output y. Since y is the distance be-
tween x and some unknown point c, we know c must
lie on the circle illustrated in Figure 3(a). (If y = 0,
then c = x.) Let’s call this input–output pair x1, y1.
The algorithm will return y1 for any future input that
matches x1. Otherwise, it will need to return ⊥ and
will obtain a new input–output pair x, y, as shown in
Figure 3(b). They become x2 and y2.
Now, the algorithm can narrow down the location of c
to the two hatch-marked points. In spite of this ambi-
guity, for any input on the dark diagonal line the algo-
rithm will be able to return the correct distance—all
KWIK Learning Framework
these points are equidistant from the two possibilities.
The two circles must intersect, assuming the target
hypothesis is in H1.
Once an input x is received that is not co-linear with
x1 and x2, the algorithm returns ⊥ and obtains an-
other x, y pair, as illustrated in Figure 3(c). Finally,
since three circles will intersect at at most one point,
the algorithm can identify the location of c and use
it to correctly answer any future query. Thus, three
⊥s suffice for KWIK learning in this setting. The al-
gorithm generalizes to d-dimensional versions of the
setting with a KWIK bound of d + 1.
Algorithm 3 illustrates a number of important points.
First, since learners have no control over their inputs
in the KWIK setting, they must be robust to degen-
erate inputs such as inputs that lie precisely on a line.
Second, they can often return valid answers for some
inputs even before they have learned the target func-
tion over the entire input space.
4.3. Noisy Observations
Up to this point, observations have been noise free.
Next, we consider the simplest noisy KWIK learning
problem in the Bernoulli case.
Algorithm 4 The coin-learning algorithm can accu-
rately predict the probability that a biased coin will
come up heads given Bernoulli observations with a
KWIK bound of B(�, δ) = 1
2�2
ln 2
δ
= O
(
1
�2
ln 1
δ
)
.
We have a biased coin whose unknown probability of
heads is p. In the notation of this paper, |X| = 1,
Y = [0, 1], and Z = {0, 1}. We want to learn an
estimate p̂ that is accurate (|p̂ − p| ≤ �) with high
probability (1− δ).
If we could observe p, then this problem would be triv-
ial: Say ⊥ once, observe p, and let p̂ = p. The KWIK
bound is thus 1. Now, however, observations are noisy.
Instead of observing p, we see either 1 (with probabil-
ity p) or 0 (with probability 1− p).
Each time the algorithm says ⊥, it gets an independent
trial that it can use to compute p̂ = 1
T
∑T
t=1 zt, where
zt ∈ Z is the tth observation in T trials. The number
of trials needed before we are 1−δ certain our estimate
is within � can be computed using a Hoeffding bound:
T ≥
1
2�2
ln
2
δ
= O
(
1
�2
ln
1
δ
)
.
1They can also intersect at one point, if the circles are
tangent, in which case the algorithm can identify c unam-
biguously.
Algorithm 5 Define X =
gorithms. The union algorithm reports ⊥ in this case
because at least one of the algorithms has learned the
wrong hypothesis and it needs to know which.
Any algorithm that made a prediction other than y or
⊥ is “killed”—removed from consideration. That is,
Â′ = {i|i ∈ Â ∧ (ŷi = ⊥ ∨ ŷi = y)}.
On each input for which the union algorithm reports
⊥, either one of the subalgorithms reported ⊥ (at most∑
i Bi(�, δ) times) or two algorithms disagreed and at
least one was removed from  (at most k − 1 times).
The KWIK bound follows from these facts.
Example 2 Let X = Y = <. Now, define H1 =
{f |f(x) = |x − c|, c ∈ <}. That is, each function in
H1 maps x to its distance from some unknown point
c. We can learn H1 with a KWIK bound of 2 using
a 1-dimensional version of Algorithm 3. Next, define
H2 = {f |f(x) = yx + b, m ∈ <, b ∈ <}. That is, H2
is the set of lines. We can learn H2 with a KWIK
bound of 2 using the regression algorithm in Section 2.
Finally, define H = H1 ∪ H2, the union of these two
classes. We can use Algorithm 6 to KWIK learn H.
Assume the first input is x1 = 2. The union algorithm
asks the learners for H1 and H2 the output for x1 and
neither has any idea, so it returns ⊥ and receives the
feedback y1 = 2, which it passes to the subalgorithms.
The next input is x2 = 8. The learners for H1 and H2
still don’t have enough information, so it returns ⊥
and sees y2 = 4, which it passes to the subalgorithms.
Next, x3 = 1. Now, the learner for H1 unambiguously
computes c = 4, because that’s the only interpretation
consistent with the first two examples (|2 − 4| = 2,
|8 − 4| = 4), so it returns |1 − 4| = 3. On the other
hand, the learner for H2 unambiguously computes m =
1/3 and b = 4/3, because that’s the only interpretation
consistent with the first two examples (2×1/3+4/3 =
2, 8 × 1/3 + 4/3 = 4), so it returns 1 × 1/3 + 4/3 =
5/3. Since the two subalgorithms disagree, the union
algorithm returns ⊥ one last time and finds out that
y3 = 3. It makes all future predictions (accurately)
using the algorithm for H1.
Next, we consider a variant of Algorithm 1 that com-
bines learners across disjoint input spaces.
Algorithm 7 Let X1, . . . , Xk be a set of disjoint in-
put spaces (Xi ∩ Xj = ∅ if i 6= j) and Y be an out-
put space. Let H1, . . . ,Hk be a set of KWIK learnable
hypothesis classes with bounds of B1(�, δ), . . . , Bk(�, δ)
where Hi ∈ (Xi → Y ). The input-partition algorithm
can learn the hypothesis class H ∈ (X1∪· · ·∪Xk → Y )
with a KWIK bound of B(�, δ) =
∑
i Bi(�, δ/k).
The input-partition algorithm runs the learning algo-
rithm for each subclass Hi. When it receives an input
x ∈ Xi, it queries the learning algorithm for class Hi
and returns its response, which is � accurate, by re-
quest. To achieve 1− δ certainty, it insists on 1− δ/k
certainty from each of the subalgorithms. By the union
bound, the overall failure probability must be less than
the sum of the failure probabilities for the subalgo-
rithms.
Example 3 An MDP consists of n states and m ac-
tions. For each combination of state and action and
next state, the transition function returns a probability.
As the reinforcement-learning agent moves around in
the state space, it observes state–action–state transi-
tions and must predict the probabilities for transitions
it has not yet observed. In the model-based setting,
an algorithm learns a mapping from the size n2m in-
put space of state–action–state combinations to prob-
abilities via Bernoulli observations. Thus, the prob-
lem can be solved via the input-partition algorithm
(Algorithm 7) over a set of individual probabilities
learned via Algorithm 4. The resulting KWIK bound
is B(�, δ) = O
(
n2m
�2
ln nm
δ
)
.
Note that this approach is precisely what is found in
most efficient RL algorithms in the literature (Kearns
& Singh, 2002; Brafman & Tennenholtz, 2002).
Algorithm 7 combines hypotheses by partitioning the
input space. In contrast, the next example concerns
combinations in input and output space.
Algorithm 8 Let X1, . . . , Xk and Y1, . . . , Yk be a set
of input and output spaces and H1, . . . ,Hk be a set
of KWIK learnable hypothesis classes with bounds of
B1(�, δ), . . . , Bk(�, δ) on these spaces. That is, Hi ∈
(Xi → Yi). The cross-product algorithm can learn the
hypothesis class H ∈ ((X1×· · ·×Xk) → (Y1×· · ·×Yk))
with a KWIK bound of B(�, δ) =
∑
i Bi(�, δ/k).
Here, each input consists of a vector of inputs from
each of the spaces X1, . . . , Xk and outputs are vectors
of outputs from Y1, . . . , Yk. Like Algorithm 7, each
component of this vector can be learned independently
via the corresponding algorithm. Each is learned to
within an accuracy of � and confidence 1 − δ/k. Any
time any component returns ⊥, the cross-product algo-
rithm returns ⊥. Since each ⊥ returned can be traced
to one of the subalgorithms, the total is bounded as
KWIK Learning Framework
described above. By the union bound, total failure
probability is no more than k × δ/k = δ.
Example 4 Transitions in factored-state MDP can be
thought of as mappings from vectors to vectors. Given
known dependencies, the cross-product algorithm can
be used to learn each component of the transition func-
tion. Each component is, itself, an instance of Algo-
rithm 7 applied to the coin-learning algorithm. This
three-level KWIK algorithm provides an approach to
learn the transition function of a factored-state MDP
with a polynomial KWIK bound. This insight can be
used to derive the factored-state-MDP learning algo-
rithm used by Kearns and Koller (1999).
The previous two algorithms apply to both determin-
istic and noisy observations. We next provide a pow-
erful algorithm that generalizes the union algorithm
(Algorithm 6) to work with noisy observations as well.
Algorithm 9 Let F : X → Y be the set of functions
mapping input set X to output set Y = [0, 1]. Let Z =
{0, 1} be a binary observation set. Let H1, . . . ,Hk be a
set of KWIK learnable hypothesis classes with bounds
of B1(�, δ), . . . , Bk(�, δ) where Hi ⊆ F for all 1 ≤ i ≤
k. That is, all the hypothesis classes share the same in-
put/output sets. The noisy union algorithm can learn
the joint hypothesis class H =
⋃
i Hi with a KWIK
bound of B(�, δ) = O
(
k
�2
ln k
δ
)
+
∑k
i=1 Bi(
�
4
, δ
k+1
).
For simplicity, we sketch the special case of k = 2.
The general case will be briefly discussed at the end.
The noisy union algorithm is similar to the union al-
gorithm (Algorithm 6), except that it has to deal with
noisy observations. The algorithm proceeds by run-
ning the KWIK algorithms, using parameters (�0, δ0),
as subalgorithms for each of the Hi hypothesis classes,
where �0 = �4 and δ0 =
δ
3
. Given an input xt in trial t,
it queries each algorithm i to obtain a prediction ŷti.
Let L̂t be the set of responses.
If ⊥ ∈ L̂t, the noisy union algorithm reports ⊥, ob-
tains an observation zt ∈ Z, and sends it to all subal-
gorithms i with ŷti = ⊥ to allow them to learn. In the
following, we focus on the other case where ⊥ /∈ L̂t.
If |ŷt1 − ŷt2| ≤ 4�0, then these two predictions are suf-
ficiently consistent, and we claim that, with high prob-
ability, the prediction p̂t = (ŷt1 + ŷt2)/2 is �-close to
yt = Pr(zt = 1). This claim follows because, by as-
sumption, one of the predictions, say ŷt1, deviates from
yt by at most �0 with probability at least 1− δ/3, and
hence |p̂t − yt| = |p̂t − ŷt1 + ŷt1 − yt| ≤ |p̂t − ŷt1| +
|ŷt1 − ŷt| = |ŷt1 − ŷt2| /2 + |ŷt1 − ŷt| ≤ 2�0 + �0 < �.
If |ŷt1 − ŷt2| > 4�0, then the individual predictions are
not consistent enough for the noisy union algorithm to
make an �-accurate prediction. Thus, the noisy union
algorithm reports ⊥ and needs to know which subal-
gorithm provided an inaccurate response. But, since
the observations are noisy in this problem, it cannot
eliminate hi on the basis of a single observation. In-
stead, it maintains the total squared prediction error
for every subalgorithm i: `i =
∑
t∈I (ŷti − zt)
2, where
I = {t| |ŷt1 − ŷt2| > 4�0} is the set of trials in which
the subalgorithms gave inconsistent predictions. We
observe that |I| is the number of ⊥s returned by the
noisy union algorithm alone (not counting those re-
turned by the subalgorithms). Our last step is to show
`i provides a robust measure for eliminating invalid
predictors when |I| is sufficiently large.
Applying the Hoeffding bound and some algebra, we
find Pr (`1 > `2) ≤
exp
(
−
∑
t∈I |ŷt1 − ŷt2|
2
8
)
≤ exp
(
−2�20 |I|
)
.
Setting the righthand side to be δ/3 and solving for
|I|, we have |I| = 1
2�20
ln 3
δ
= O
(
1
�2
ln 1
δ
)
.
Since each hi succeeds with probability 1− δ3 , and the
comparison of `1 and `2 also succeeds with probability
1− δ
3
, a union bound implies that the noisy union algo-
rithm succeeds with probability at least 1− δ. All ⊥s
are either from a subalgorithm (at most
∑
i Bi(�0, δ0))
or from the noisy union algorithm (O
(
1
�2
ln 1
δ
)
).
The general case where k > 2 can be reduced to the
k = 2 case by pairing the k learners and running the
noisy union algorithm described above on each pair.
Here, each subalgorithm is run with parameter �
4
and
δ
k+1
. Although there are
(
k
2
)
= O(k2) pairs, a slightly
improved reduction and analysis can reduce the de-
pendence of |I| on k from quadratic to linearithmic,
leading to the bound given in the statement.
Example 5 Without known dependencies, learning
a factored-state MDP is more challenging. Strehl
et al. (2007) showed that each possible dependence
structure can be viewed as a separate hypothesis and
provided an algorithm for learning the dependencies
in a factored-state MDP while learning the transition
probabilities. The algorithm can be viewed as a four-
level KWIK algorithm with a noisy union algorithm
at the top (to discover the dependence structure), a
cross-product algorithm beneath it (to decompose the
transitions for the separate components of the factored-
state representation), an input-partition algorithm be-
low that (to handle the different combinations of state
component and action), and a coin-learning algorithm
KWIK Learning Framework
at the very bottom (to learn the transition probabili-
ties themselves). Note that Algorithm 9 is conceptu-
ally simpler, significantly more efficient (k log k vs. k2
dependence on k), and more generally applicable than
the one due to Strehl et al. (2007).
6. Conclusion and Future Work
We described the KWIK (“knows what it knows”)
model of supervised learning, which identifies and gen-
eralizes a key step common to a class of algorithms
for efficient exploration. We provided algorithms for a
set of basic hypothesis classes given deterministic and
noisy observations as well as methods for composing
hypothesis classes to create more complex algorithms.
One example algorithm consisted of a four-level de-
composition of an existing learning algorithm from the
reinforcement-learning literature.
By providing a set of example algorithms and compo-
sition rules, we hope to encourage the use of KWIK
algorithms as a component in machine-learning appli-
cations as well as spur the development of novel algo-
rithms. One concern of particular interest in applying
the KWIK framework to real-life data we leave as an
open problem.
Open Problem 4 How can KWIK be adapted to ap-
ply in the unrealizable setting in which the target hy-
pothesis can be chosen from outside the hypothesis
class H?
References
Angluin, D. (2004). Queries revisited. Theoretical Com-
puter Science, 313, :175–194.
Bagnell, J., Ng, A. Y., & Schneider, J. (2001). Solving
uncertain Markov decision problems (Technical Report
CMU-RI-TR-01-25). Robotics Institute, Carnegie Mel-
lon University, Pittsburgh, PA.
Blum, A. (1994). Separating distribution-free and mistake-
bound learning models over the Boolean domain. SIAM
Journal on Computing, 23, 990–1000.
Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—a
general polynomial time algorithm for near-optimal re-
inforcement learning. Journal of Machine Learning Re-
search, 3, 213–231.
Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006).
Worst-case analysis of selective sampling for linear clas-
sification. Journal of Machine Learning Research, 7,
1205–1230.
Cesa-Bianchi, N., Lugosi, G., & Stoltz, G. (2005). Minimiz-
ing regret with label efficient prediction. IEEE Transac-
tions on Information Theory, 51, 2152–2162.
Cohn, D. A., Atlas, L., & Ladner, R. E. (1994). Improving
generalization with active learning. Machine Learning,
15, 201–221.
Fong, P. W. L. (1995). A quantitative study of hypothesis
selection. Proceedings of the Twelfth International Con-
ference on Machine Learning (ICML-95) (pp. 226–234).
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K.
(1997). Using and combining predictors that specialize.
STOC ’97: Proceedings of the twenty-ninth annual ACM
symposium on Theory of computing (pp. 334–343).
Helmbold, D. P., Littlestone, N., & Long, P. M. (2000).
Apple tasting. Information and Computation, 161, 85–
139.
Kakade, S., Kearns, M., & Langford, J. (2003). Explo-
ration in metric state spaces. Proceedings of the 20th
International Conference on Machine Learning.
Kakade, S. M. (2003). On the sample complexity of rein-
forcement learning. Doctoral dissertation, Gatsby Com-
putational Neuroscience Unit, University College Lon-
don.
Kearns, M. J., & Koller, D. (1999). Efficient reinforce-
ment learning in factored MDPs. Proceedings of the 16th
International Joint Conference on Artificial Intelligence
(IJCAI) (pp. 740–747).
Kearns, M. J., & Singh, S. P. (2002). Near-optimal rein-
forcement learning in polynomial time. Machine Learn-
ing, 49, 209–232.
Lane, T., & Brodley, C. E. (2003). An empirical study of
two approaches to sequence learning for anomaly detec-
tion. Machine Learning, 51, 73–107.
Littlestone, N. (1987). Learning quickly when irrelevant
attributes abound: A new linear-threshold algorithm.
Machine Learning, 2, 285–318.
Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient
structure learning in factored-state MDPs. Proceedings
of the Twenty-Second National Conference on Artificial
Intelligence (AAAI-07).
Strehl, A. L., & Littman, M. L. (2008). Online linear re-
gression and its application to model-based reinforce-
ment learning. Advances in Neural Information Process-
ing Systems 20.
Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh,
H. (2006). Experience-efficient learning in associative
bandit problems. Proceedings of the Twenty-third Inter-
national Conference on Machine Learning (ICML-06).
Valiant, L. G. (1984). A theory of the learnable. Commu-
nications of the ACM, 27, 1134–1142.
Weiss, G. M., & Tian, Y. (2006). Maximizing classifier util-
ity when training data is costly. SIGKDD Explorations,
8, 31–38.
Acknowledgments
Support provided by NSF IIS-0325281 and DARPA TL.