CS代考计算机代写 scheme algorithm Active Learning

Active Learning
Maria-Florina Balcan
04/01/2015

Logistics
• HWK #6 due on Friday.
• Midway Project Review due on Monday. Make sure to talk to your mentor TA!

Classic Fully Supervised Learning Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.
Protein sequences Billions of webpages
Images

Modern ML: New Learning Approaches
Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for
expert/human intervention.
Paradigms where there has been great progress.
• Semi-supervised Learning, (Inter)active Learning. Expert

Learning Algorithm
Unlabeled examples
Expert / Oracle
Semi-Supervised Learning Data Source
Unlabeled examples
Labeled Examples
Sl={(x1,y1), …,(xml,yml)}
xi drawn i.i.d from D, yi = c∗(xi)
errD h = Pr(hx ≠c∗(x)) x~D
Algorithm outputs a classifier
Goal: h has small error over D.
Su={x1, …,xmu} drawn i.i.d from D

E.g., “large margin separator” [Joachims ’99]
Similarity based
(“small cut”) [B&C01], [ZGL03]
+
“self-consistent rules” [Blum & Mitchell ’98]
x = h x1, x2 i h1(x1)=h2(x2)
Prof. Avrim My Advisor
x1- Text info x2- Link info
Semi-supervised Learning
Key Insight/Underlying Fundamental Principle
Unlabeled data useful if we have a bias/belief not only about the form of the target, but also about its relationship with the underlying data distribution.
+ +
_ _

Unlabeled data can help reduce search space or re-order the fns in the search space according to our belief, biasing the search towards fns satisfying the belief (which becomes concrete once we see unlabeled data).

• •
• How much unlabeled data is needed.
• depends both on complexity of H and of compatibility notion.
• Ability of unlabeled data to reduce #of labeled examples. • compatibility of the target, helpfulness of the distrib.
Survey on “Semi-Supervised Learning” (Jerry Zhu, 2010) explains the SSL techniques from this point of view.
Note: the mixture method that Tom talked about on Feb 25th can be explained from this point of view too. See the Zhu survey.
A General Discriminative Model for SSL
[BalcanBlum, COLT 2005; JACM 2010]
As in PAC/SLT, discuss algorithmic and sample complexity issues. Analyze fundamental sample complexity aspects:

Active Learning
Additional resources:
• Two faces of active learning. Sanjoy Dasgupta. 2011.
• Active Learning. Bur Settles. 2012.
• Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015

Learning Algorithm
Batch Active Learning
Data Source
Unlabeled examples
Request for the Label of an Example A Label for that Example
Request for the Label of an Example A Label for that Example
Algorithm outputs a classifier w.r.t D
Expert
Underlying data distr. D.
• •
Learner can choose specific examples to be labeled.
Goal: use fewer labeled examples [pick informative examples to be labeled].
. ..

Selective Sampling Active Learning
Learning Algorithm
132
Unlabeled example 𝑥𝑥𝑥
Expert
Data Source
Underlying data distr. D.
A label 𝑦𝑦13 for example 𝑥𝑥13 Request label
Let it go


Algorithm outputs a classifier w.r.t D
Selective sampling AL (Online AL): stream of unlabeled examples,
when each arrives make a decision to ask for label or not.
Goal: use fewer labeled examples
[pick informative examples to be labeled].
Request for label or let it go?

What Makes a Good Active Learning Algorithm?
• Guaranteedtooutputarelativelygoodclassifier for most learning problems.
• Doesn’t make too many label requests. Hopefully a lot less than passive learning and SSL.
• Need to choose the label requests carefully, to get informative labels.

Can adaptive querying really do better than passive/random sampling?
• YES!(sometimes)
• We often need far fewer labels for active learning than for passive.
• This is predicted by theory and has been observed in practice.

Can adaptive querying help? [CAL92, Dasgupta04]
• Threshold fns on the real line: hw(x) = 1(x ̧ w), C = {hw: w 2 R}
Active Algorithm
-+
w


• •
Get N unlabeled examples
How can we recover the correct labels with ≪ N queries?
Do binary search! Just need O(log N) labels!
+

Output a classifier consistent with the N inferred labels.
• N = O(1/ε) we are guaranteed to get a classifier of error ≤ ε. Passive supervised: Ω(1/ε) labels to find an -accurate threshold. Active: only O(log 1/ε) labels. Exponential improvement.

Common Technique in Practice
Uncertainty sampling in SVMs common and quite useful
in practice.
Active SVM Algorithm
• At any time during the alg., we have a “current guess” wt of the separator: the max-margin separator of all labeled points so far.
• Request the label of the example closest to the current separator.
E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010; Schohon Cohn, ICML 2000]

Common Technique in Practice
Active SVM seems to be quite useful in practice. [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]
Algorithm (batch version)
Input Su={x1, …,xmu} drawn i.i.d from the underlying source D
Start: query for the labels of a few random 𝑥𝑖s. For 𝒕=𝟏, ….,
• •
Find 𝑤𝑡 the max-margin separator of all labeled points so far.
Request the label of the example closest to the current separator: minimizing 𝑥𝑖 ⋅ 𝑤𝑡 .
(highest uncertainty)

Common Technique in Practice
Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010
Newsgroups dataset (20.000 documents from 20 categories)

Common Technique in Practice
Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010
CIFAR-10 image dataset (60.000 images from 10 categories)

Active SVM/Uncertainty Sampling
• Works sometimes….
• However, we need to be very very very careful!!!
• Myopic, greedy technique can suffer from sampling bias.
• A bias created because of the querying strategy; as time goes on the sample is less and less representative of the true data source.
[Dasgupta10]

Active SVM/Uncertainty Sampling
• Works sometimes….
• However, we need to be very very careful!!!

• •
Active SVM/Uncertainty Sampling
Works sometimes….
However, we need to be very very careful!!!
• Myopic, greedy technique can suffer from sampling bias.
• Bias created because of the querying strategy; as time goes on
the sample is less and less representative of the true source.
• Observed in practice too!!!!
Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.

Safe Active Learning Schemes
Disagreement Based Active Learning Hypothesis Space Search
[CAL92] [BBL06]
[Hanneke’07, DHM’07, Wang’09 , Fridman’09, Kolt10, BHW’08, BHLZ’10, H’10, Ailon’12, …]

Version Spaces
• X – feature/instance space; distr. D over X; 𝑐∗ target fnc
• Fix hypothesis space H.
Definition (Mitchell’82)
Assume realizable case: c∗ ∈ H.
Given a set of labeled examples (x1,y1), …,(xml,yml), yi = c∗(xi)
Version space of H: part of H consistent with labels so far. I.e.,h∈ VS(H)iffh xi =c∗ xi ∀i∈{1,…,ml}.

Version Spaces
• X – feature/instance space; distr. D over X; 𝑐∗ target fnc
• Fix hypothesis space H.
Definition (Mitchell’82)
Assume realizable case: c∗ ∈ H.
Given a set of labeled examples (x1,y1), …,(xml,yml), yi = c∗(xi)
Version space of H: part of H consistent with labels so far. current version spa
E.g.,: data lies on
circle in R2, H = + + homogeneous
linear seps.
ce
region of disagreement in data space

Version Spaces. Region of Disagreement
Definition (CAL’92)
Version space: part of H consistent with labels so far.
Region of disagreement = part of data space about which there
is still some uncertainty (i.e. disagreement within version space) x∈X,x∈DIS(VS H )iff∃h1,h2 ∈VS(H),h1 x ≠h2(x)
E.g.,: data lies on circle in R2, H = homogeneous linear seps.
+
current version space
+
region of disagreement in data space

Disagreement Based Active Learning [CAL92]
current version space
region of uncertainy
Pick a few points at random from the current region of uncertainty and query their labels.
Stop when region of uncertainty is small.
Note: it is active since we do not waste labels by querying in regions of space we are certain about the labels.
Algorithm:

Disagreement Based Active Learning [CAL92]
current version space
region of uncertainy
Query for the labels of a few random 𝑥𝑖s.
Let H1 be the current version space. For 𝒕=𝟏, ….,
Pick a few points at random from the current region of disagreement DIS(Ht) and query their labels.
Let Ht+1 be the new version space.
Algorithm:

Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
+
current version space
+
region of uncertainty in data space

Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
+
new version space
+
New region of disagreement in data space

How about the agnostic case where the target might not belong the H?

A2 Agnostic Active Learner [BBL’06] current version space
Algorithm:
Let H = H. 1
region of disagreement
Careful use of generalization bounds; Avoid the sampling bias!!!!
For 𝒕=𝟏, ….,
• Pick a few points at random from the current region
of disagreement DIS(Ht) and query their labels.
• Throw out hypothesis if you are statistically confident they are suboptimal.

When Active Learning Helps. Agnostic case
A2 the first algorithm which is robust to noise.
[Balcan, Beygelzimer, Langford, ICML’06] [Balcan, Beygelzimer, Langford, JCSS’08]
“Region of disagreement” style: Pick a few points at random from the current region of disagreement, query their labels, throw out hypothesis if you are statistically confident they are suboptimal.
Guarantees for A2 [BBL’06,’08]:
• It is safe (never worse than passive learning) & exponential improvements.
• C – thresholds, low noise, exponential improvement. • C – homogeneous linear separators in Rd,
D – uniform, low noise, only d2 log (1/) labels.
c*
A lot of subsequent work.
[Hanneke’07, DHM’07, Wang’09 , Fridman’09, Kolt10, BHW’08, BHLZ’10, H’10, Ailon’12, …]

General guarantees for A2 Agnostic Active Learner
“Disagreement based”: Pick a few points at random from the current region of
uncertainty, query their labels, throw out hypothesis if you are statistically
confident they are suboptimal. [BBL’06]
How quickly the region of disagreement
Guarantees for A2 [Hanneke’07]:
Disagreement coefficient
collapses as we get closer and closer to optimal classifier
Realizable case:
Linear Separators, uniform distr.: c*

• •
Disagreement Based Active Learning
“Disagreement based ” algos: query points from current region of disagreement, throw out hypotheses when statistically confident they are suboptimal.
Generic (any class), adversarial label noise.
Computationally efficient for classes of small VC-dimension
Still, could be suboptimal in label complex & computationally inefficient in general.
Lots of subsequent work trying to make is more efficient computationally
and more aggressive too: [Hanneke07, DasguptaHsuMontleoni’07, Wang’09 , Fridman’09, Koltchinskii10, BHW’08, BeygelzimerHsuLangfordZhang’10, Hsu’10, Ailon’12, …]

Other Interesting ALTechniques used in Practice
Interesting open question to analyze under what conditions they are successful.

Density-Based Sampling
Centroid of largest unsampled cluster
[Jaime G. Carbonell]

Uncertainty Sampling
Closest to decision boundary (Active SVM)
[Jaime G. Carbonell]

Maximal Diversity Sampling
Maximally distant from labeled x’s
[Jaime G. Carbonell]

Ensemble-Based Possibilities
Uncertainty + Diversity criteria
Density + uncertainty criteria
[Jaime G. Carbonell]

Graph-based Active and Semi-Supervised Methods

• •

Graph-based Methods
Assume we are given a pairwise similarity fnc and that very similar examples probably have the same label.
If we have a lot of labeled data, this suggests a Nearest-Neighbor type of algorithm.
If you have a lot of unlabeled data, perhaps can use them as “stepping stones”.
E.g., handwritten digits [Zhu07]:

Graph-based Methods
Idea: construct a graph with edges between very similar examples.
Unlabeled data can help “glue” the objects of the same class together.

Graph-based Methods
Often, transductive approach. (Given L + U, output predictions on U). Are alllowed to output any labeling of 𝐿 ∪ 𝑈.
Main Idea:
• •

Construct graph G with edges between very similar examples.
Might have also glued together in G examples of different classes.
Run a graph partitioning algorithm to separate the graph into pieces.
Several methods:
– Minimum/Multiwaycut[Blum&Chawla01]
– Minimum“soft-cut”[ZhuGhahramaniLafferty’03] – Spectralpartitioning
–…

SSL using soft cuts
[ZhuGhahramaniLafferty’03]
Solve for label function 𝑓 𝑥 ∈ 0,1 to minimize:
𝐽𝑓=
𝑤𝑖𝑗𝑓𝑥𝑖 −𝑓(𝑥𝑗)2+ 𝜆𝑓𝑥𝑖 −𝑦𝑖2 𝑥𝑖∈𝐿
𝑒𝑑𝑔𝑒𝑠 (𝑖,𝑗)
Similar nodes get similar labels (weighted similarity)
Agreement with labels (agreement not strictly enforces)

Active learning with label propagation
(using soft-cuts)
How to choose which node to query?
44

Active learning with label propagation
One natural idea: query the most uncertain point.
But this has only one edge. Query won’t have much impact!
(even worse: a completely isolated node)
(using soft-cuts)
45

Active learning with label propagation
Instead, use a 1-step-lookahead heuristic:
• For a node with label 𝑝, assume that querying will have prob
𝑝 of returning answer 1, 1 − 𝑝 of returning answer 0.
• Compute “average confidence” after running soft-cut in each case:
𝑝𝑛1 𝑥𝑖 max 𝑓1 𝑥𝑖 ,1−𝑓1 𝑥𝑖 +(1−𝑝)𝑛1 𝑥𝑖 max 𝑓0 𝑥𝑖 ,1−𝑓0 𝑥𝑖
• Query node s.t. this quantity is highest (you want to be more confident on average).
(using soft-cuts)
46

Active Learning with Label Propagation in Practice
• Does well for Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).

What You Should Know
• Active learning could be really helpful, could provide exponential improvements in label complexity (both theoretically and practically)!
• Common heuristics (e.g., those based on uncertainty sampling). Need to be very careful due to sampling bias.
• Safe Disagreement Based Active Learning Schemes.
• Understand how they operate precisely in noise free scenarios.