Active Learning
Maria-Florina Balcan
04/01/2015
Logistics
• HWK #6 due on Friday.
• Midway Project Review due on Monday. Make sure to talk to your mentor TA!
Classic Fully Supervised Learning Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.
Protein sequences Billions of webpages
Images
Modern ML: New Learning Approaches
Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for
expert/human intervention.
Paradigms where there has been great progress.
• Semi-supervised Learning, (Inter)active Learning. Expert
Learning Algorithm
Unlabeled examples
Expert / Oracle
Semi-Supervised Learning Data Source
Unlabeled examples
Labeled Examples
Sl={(x1,y1), …,(xml,yml)}
xi drawn i.i.d from D, yi = c∗(xi)
errD h = Pr(hx ≠c∗(x)) x~D
Algorithm outputs a classifier
Goal: h has small error over D.
Su={x1, …,xmu} drawn i.i.d from D
E.g., “large margin separator” [Joachims ’99]
Similarity based
(“small cut”) [B&C01], [ZGL03]
+
“self-consistent rules” [Blum & Mitchell ’98]
x = h x1, x2 i h1(x1)=h2(x2)
Prof. Avrim My Advisor
x1- Text info x2- Link info
Semi-supervised Learning
Key Insight/Underlying Fundamental Principle
Unlabeled data useful if we have a bias/belief not only about the form of the target, but also about its relationship with the underlying data distribution.
+ +
_ _
•
Unlabeled data can help reduce search space or re-order the fns in the search space according to our belief, biasing the search towards fns satisfying the belief (which becomes concrete once we see unlabeled data).
• •
• How much unlabeled data is needed.
• depends both on complexity of H and of compatibility notion.
• Ability of unlabeled data to reduce #of labeled examples. • compatibility of the target, helpfulness of the distrib.
Survey on “Semi-Supervised Learning” (Jerry Zhu, 2010) explains the SSL techniques from this point of view.
Note: the mixture method that Tom talked about on Feb 25th can be explained from this point of view too. See the Zhu survey.
A General Discriminative Model for SSL
[BalcanBlum, COLT 2005; JACM 2010]
As in PAC/SLT, discuss algorithmic and sample complexity issues. Analyze fundamental sample complexity aspects:
Active Learning
Additional resources:
• Two faces of active learning. Sanjoy Dasgupta. 2011.
• Active Learning. Bur Settles. 2012.
• Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015
Learning Algorithm
Batch Active Learning
Data Source
Unlabeled examples
Request for the Label of an Example A Label for that Example
Request for the Label of an Example A Label for that Example
Algorithm outputs a classifier w.r.t D
Expert
Underlying data distr. D.
• •
Learner can choose specific examples to be labeled.
Goal: use fewer labeled examples [pick informative examples to be labeled].
. ..
Selective Sampling Active Learning
Learning Algorithm
132
Unlabeled example 𝑥𝑥𝑥
Expert
Data Source
Underlying data distr. D.
A label 𝑦𝑦13 for example 𝑥𝑥13 Request label
Let it go
•
•
Algorithm outputs a classifier w.r.t D
Selective sampling AL (Online AL): stream of unlabeled examples,
when each arrives make a decision to ask for label or not.
Goal: use fewer labeled examples
[pick informative examples to be labeled].
Request for label or let it go?
What Makes a Good Active Learning Algorithm?
• Guaranteedtooutputarelativelygoodclassifier for most learning problems.
• Doesn’t make too many label requests. Hopefully a lot less than passive learning and SSL.
• Need to choose the label requests carefully, to get informative labels.
Can adaptive querying really do better than passive/random sampling?
• YES!(sometimes)
• We often need far fewer labels for active learning than for passive.
• This is predicted by theory and has been observed in practice.
Can adaptive querying help? [CAL92, Dasgupta04]
• Threshold fns on the real line: hw(x) = 1(x ̧ w), C = {hw: w 2 R}
Active Algorithm
-+
w
•
•
• •
Get N unlabeled examples
How can we recover the correct labels with ≪ N queries?
Do binary search! Just need O(log N) labels!
+
—
Output a classifier consistent with the N inferred labels.
• N = O(1/ε) we are guaranteed to get a classifier of error ≤ ε. Passive supervised: Ω(1/ε) labels to find an -accurate threshold. Active: only O(log 1/ε) labels. Exponential improvement.
Common Technique in Practice
Uncertainty sampling in SVMs common and quite useful
in practice.
Active SVM Algorithm
• At any time during the alg., we have a “current guess” wt of the separator: the max-margin separator of all labeled points so far.
• Request the label of the example closest to the current separator.
E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010; Schohon Cohn, ICML 2000]
Common Technique in Practice
Active SVM seems to be quite useful in practice. [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]
Algorithm (batch version)
Input Su={x1, …,xmu} drawn i.i.d from the underlying source D
Start: query for the labels of a few random 𝑥𝑖s. For 𝒕=𝟏, ….,
• •
Find 𝑤𝑡 the max-margin separator of all labeled points so far.
Request the label of the example closest to the current separator: minimizing 𝑥𝑖 ⋅ 𝑤𝑡 .
(highest uncertainty)
Common Technique in Practice
Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010
Newsgroups dataset (20.000 documents from 20 categories)
Common Technique in Practice
Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010
CIFAR-10 image dataset (60.000 images from 10 categories)
Active SVM/Uncertainty Sampling
• Works sometimes….
• However, we need to be very very very careful!!!
• Myopic, greedy technique can suffer from sampling bias.
• A bias created because of the querying strategy; as time goes on the sample is less and less representative of the true data source.
[Dasgupta10]
Active SVM/Uncertainty Sampling
• Works sometimes….
• However, we need to be very very careful!!!
• •
Active SVM/Uncertainty Sampling
Works sometimes….
However, we need to be very very careful!!!
• Myopic, greedy technique can suffer from sampling bias.
• Bias created because of the querying strategy; as time goes on
the sample is less and less representative of the true source.
• Observed in practice too!!!!
Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.
•
Safe Active Learning Schemes
Disagreement Based Active Learning Hypothesis Space Search
[CAL92] [BBL06]
[Hanneke’07, DHM’07, Wang’09 , Fridman’09, Kolt10, BHW’08, BHLZ’10, H’10, Ailon’12, …]
Version Spaces
• X – feature/instance space; distr. D over X; 𝑐∗ target fnc
• Fix hypothesis space H.
Definition (Mitchell’82)
Assume realizable case: c∗ ∈ H.
Given a set of labeled examples (x1,y1), …,(xml,yml), yi = c∗(xi)
Version space of H: part of H consistent with labels so far. I.e.,h∈ VS(H)iffh xi =c∗ xi ∀i∈{1,…,ml}.
Version Spaces
• X – feature/instance space; distr. D over X; 𝑐∗ target fnc
• Fix hypothesis space H.
Definition (Mitchell’82)
Assume realizable case: c∗ ∈ H.
Given a set of labeled examples (x1,y1), …,(xml,yml), yi = c∗(xi)
Version space of H: part of H consistent with labels so far. current version spa
E.g.,: data lies on
circle in R2, H = + + homogeneous
linear seps.
ce
region of disagreement in data space
Version Spaces. Region of Disagreement
Definition (CAL’92)
Version space: part of H consistent with labels so far.
Region of disagreement = part of data space about which there
is still some uncertainty (i.e. disagreement within version space) x∈X,x∈DIS(VS H )iff∃h1,h2 ∈VS(H),h1 x ≠h2(x)
E.g.,: data lies on circle in R2, H = homogeneous linear seps.
+
current version space
+
region of disagreement in data space
Disagreement Based Active Learning [CAL92]
current version space
region of uncertainy
Pick a few points at random from the current region of uncertainty and query their labels.
Stop when region of uncertainty is small.
Note: it is active since we do not waste labels by querying in regions of space we are certain about the labels.
Algorithm:
Disagreement Based Active Learning [CAL92]
current version space
region of uncertainy
Query for the labels of a few random 𝑥𝑖s.
Let H1 be the current version space. For 𝒕=𝟏, ….,
Pick a few points at random from the current region of disagreement DIS(Ht) and query their labels.
Let Ht+1 be the new version space.
Algorithm:
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
+
current version space
+
region of uncertainty in data space
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
+
new version space
+
New region of disagreement in data space
How about the agnostic case where the target might not belong the H?
A2 Agnostic Active Learner [BBL’06] current version space
Algorithm:
Let H = H. 1
region of disagreement
Careful use of generalization bounds; Avoid the sampling bias!!!!
For 𝒕=𝟏, ….,
• Pick a few points at random from the current region
of disagreement DIS(Ht) and query their labels.
• Throw out hypothesis if you are statistically confident they are suboptimal.
When Active Learning Helps. Agnostic case
A2 the first algorithm which is robust to noise.
[Balcan, Beygelzimer, Langford, ICML’06] [Balcan, Beygelzimer, Langford, JCSS’08]
“Region of disagreement” style: Pick a few points at random from the current region of disagreement, query their labels, throw out hypothesis if you are statistically confident they are suboptimal.
Guarantees for A2 [BBL’06,’08]:
• It is safe (never worse than passive learning) & exponential improvements.
• C – thresholds, low noise, exponential improvement. • C – homogeneous linear separators in Rd,
D – uniform, low noise, only d2 log (1/) labels.
c*
A lot of subsequent work.
[Hanneke’07, DHM’07, Wang’09 , Fridman’09, Kolt10, BHW’08, BHLZ’10, H’10, Ailon’12, …]
General guarantees for A2 Agnostic Active Learner
“Disagreement based”: Pick a few points at random from the current region of
uncertainty, query their labels, throw out hypothesis if you are statistically
confident they are suboptimal. [BBL’06]
How quickly the region of disagreement
Guarantees for A2 [Hanneke’07]:
Disagreement coefficient
collapses as we get closer and closer to optimal classifier
Realizable case:
Linear Separators, uniform distr.: c*
• •
Disagreement Based Active Learning
“Disagreement based ” algos: query points from current region of disagreement, throw out hypotheses when statistically confident they are suboptimal.
Generic (any class), adversarial label noise.
Computationally efficient for classes of small VC-dimension
Still, could be suboptimal in label complex & computationally inefficient in general.
Lots of subsequent work trying to make is more efficient computationally
and more aggressive too: [Hanneke07, DasguptaHsuMontleoni’07, Wang’09 , Fridman’09, Koltchinskii10, BHW’08, BeygelzimerHsuLangfordZhang’10, Hsu’10, Ailon’12, …]
Other Interesting ALTechniques used in Practice
Interesting open question to analyze under what conditions they are successful.
Density-Based Sampling
Centroid of largest unsampled cluster
[Jaime G. Carbonell]
Uncertainty Sampling
Closest to decision boundary (Active SVM)
[Jaime G. Carbonell]
Maximal Diversity Sampling
Maximally distant from labeled x’s
[Jaime G. Carbonell]
Ensemble-Based Possibilities
Uncertainty + Diversity criteria
Density + uncertainty criteria
[Jaime G. Carbonell]
Graph-based Active and Semi-Supervised Methods
• •
•
Graph-based Methods
Assume we are given a pairwise similarity fnc and that very similar examples probably have the same label.
If we have a lot of labeled data, this suggests a Nearest-Neighbor type of algorithm.
If you have a lot of unlabeled data, perhaps can use them as “stepping stones”.
E.g., handwritten digits [Zhu07]:
Graph-based Methods
Idea: construct a graph with edges between very similar examples.
Unlabeled data can help “glue” the objects of the same class together.
Graph-based Methods
Often, transductive approach. (Given L + U, output predictions on U). Are alllowed to output any labeling of 𝐿 ∪ 𝑈.
Main Idea:
• •
•
Construct graph G with edges between very similar examples.
Might have also glued together in G examples of different classes.
Run a graph partitioning algorithm to separate the graph into pieces.
Several methods:
– Minimum/Multiwaycut[Blum&Chawla01]
– Minimum“soft-cut”[ZhuGhahramaniLafferty’03] – Spectralpartitioning
–…
SSL using soft cuts
[ZhuGhahramaniLafferty’03]
Solve for label function 𝑓 𝑥 ∈ 0,1 to minimize:
𝐽𝑓=
𝑤𝑖𝑗𝑓𝑥𝑖 −𝑓(𝑥𝑗)2+ 𝜆𝑓𝑥𝑖 −𝑦𝑖2 𝑥𝑖∈𝐿
𝑒𝑑𝑔𝑒𝑠 (𝑖,𝑗)
Similar nodes get similar labels (weighted similarity)
Agreement with labels (agreement not strictly enforces)
Active learning with label propagation
(using soft-cuts)
How to choose which node to query?
44
Active learning with label propagation
One natural idea: query the most uncertain point.
But this has only one edge. Query won’t have much impact!
(even worse: a completely isolated node)
(using soft-cuts)
45
Active learning with label propagation
Instead, use a 1-step-lookahead heuristic:
• For a node with label 𝑝, assume that querying will have prob
𝑝 of returning answer 1, 1 − 𝑝 of returning answer 0.
• Compute “average confidence” after running soft-cut in each case:
𝑝𝑛1 𝑥𝑖 max 𝑓1 𝑥𝑖 ,1−𝑓1 𝑥𝑖 +(1−𝑝)𝑛1 𝑥𝑖 max 𝑓0 𝑥𝑖 ,1−𝑓0 𝑥𝑖
• Query node s.t. this quantity is highest (you want to be more confident on average).
(using soft-cuts)
46
Active Learning with Label Propagation in Practice
• Does well for Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).
What You Should Know
• Active learning could be really helpful, could provide exponential improvements in label complexity (both theoretically and practically)!
• Common heuristics (e.g., those based on uncertainty sampling). Need to be very careful due to sampling bias.
• Safe Disagreement Based Active Learning Schemes.
• Understand how they operate precisely in noise free scenarios.