Semi-Supervised Learning Maria-Florina Balcan
03/30/2015 Readings:
• Semi-Supervised Learning. Encyclopedia of Machine Learning. Jerry Zhu, 2010
• Combining Labeled and Unlabeled Data with Co- Training. Avrim Blum, Tom Mitchell. COLT 1998.
Learning Algorithm
Fully Supervised Learning
Data Source
Distribution D on X
Expert / Oracle
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm))
Alg.outputs
h:X!Y
x1 > 5
c* : X ! Y + + –
x6 > 2 +1
+1 -1
+ —
Fully Supervised Learning
Data Source
Distribution D on X
Expert / Oracle
Learning Algorithm
Sl={(x1,y1), …,(xml,yml)}
Alg.outputs
c* : X ! Y Goal: h has small error over D.
errD h = Pr(hx ≠c∗(x)) x~D
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm))
h:X!Y xi drawn i.i.d from D, yi = c∗(xi)
Two Core Aspects of Supervised Learning
Algorithm Design. How to optimize?
Computation
Automatically generate rules that do well on observed data.
• E.g.: Naïve Bayes, logistic regression, SVM, Adaboost, etc.
Confidence Bounds, Generalization
(Labeled) Data
Confidence for rule effectiveness on future data.
• VC-dimension, Rademacher complexity, margin based bounds, etc.
Classic Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.
Protein sequences Billions of webpages
Images
Modern ML: New Learning Approaches
Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for
expert/human intervention.
Paradigms where there has been great progress.
• Semi-supervised Learning, (Inter)active Learning. Expert
Learning Algorithm
Unlabeled examples
Expert / Oracle
Semi-Supervised Learning Data Source
Unlabeled examples
Labeled Examples
Sl={(x1,y1), …,(xml,yml)}
xi drawn i.i.d from D, yi = c∗(xi)
errD h = Pr(hx ≠c∗(x)) x~D
Algorithm outputs a classifier
Goal: h has small error over D.
Su={x1, …,xmu} drawn i.i.d from D
•
•
Semi-supervised Learning
Major topic of research in ML.
Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
– Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03]
Test of time awards at ICML!
Workshops [ICML ’03, ICML’ 05, …]
Books:
• •
Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds)
Introduction to Semi-Supervised Learning,
Morgan & Claypool, 2009
Zhu & Goldberg
•
•
Semi-supervised Learning
Major topic of research in ML.
Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
– Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03]
Both wide spread applications and solid foundational understanding!!!
Test of time awards at ICML!
•
•
Semi-supervised Learning
Major topic of research in ML.
Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
– Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03] Today: discuss these methods.
Very interesting, they all exploit unlabeled data in different, very interesting and creative ways.
Test of time awards at ICML!
Semi-supervised learning: no querying. Just have lots of additional unlabeled data.
A bit puzzling; unclear what unlabeled data can do for us…. It is missing the most important info. How can it help us in substantial ways?
Key Insight
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Semi-supervised SVM
[Joachims ’99]
Margins based regularity
Target goes through low density regions (large margin).
• assume we are looking for linear separator
• belief: should exist one with large separation
+ +
_ _
+ +
_
_
+ +
_
_
SVM
Labeled data only
Transductive SVM
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
0
0
unlabeled data. [Joachims ’99]
Input:S={(x ,y ),…,(x ,y )} — +++
0
0
𝑤’ ⋅ 𝑥 = 1
l11mlml 000 000
+0
0 00 𝑤’⋅𝑥=−1 000000
0
00 00 0
00
0 +w’
0 0
0 00
00000 0
S={x,…,x } —
0
0
0
u1 mu argminw w 2 s.t.:
• yiw⋅xi≥1,foralli∈{1,…,ml}
• yuw⋅xu ≥1,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.
0
000000 0
00000 000
00 0
0 00
0 0
0
0
0
00 0
Transductive Support Vector Machines
Optimize for the separator with large margin wrt labeled and
0
0
unlabeled data. [Joachims ’99]
Input:S={(x ,y ),…,(x ,y )} — +++
0
0
𝑤’ ⋅ 𝑥 = 1
0 00 𝑤’⋅𝑥=−1 000000
0
00 00 0
00
0 +w’
0 0
0 00
l11mlml 000 000
+0
S={x,…,x } —
0
0
0
u1 mu
argminw w2+𝐶 𝑖𝜉𝑖+𝐶 𝑢𝜉𝑢
00000 000
• yiw⋅xi≥1-𝜉𝑖,foralli∈{1,…,ml}0 0
• yuw⋅xu ≥1− 𝜉𝑢,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
0
Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.
0
000000 0
00 0
0 00
0 0
00000 0
0
0
0
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
unlabeled data.
0
Input: Sl={(x1,y1), …,(xml,yml)} Su={x1, …,xmu}
argminw w2+𝐶 𝑖𝜉𝑖+𝐶 𝑢𝜉𝑢
• yiw⋅xi≥1-𝜉𝑖,foralli∈{1,…,ml}
• yuw⋅xu ≥1− 𝜉𝑢,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
NP-hard….. Convex only after you guessed the labels… too many possible guesses…
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
unlabeled data.
Heuristic (Joachims) high level idea:
• First maximize margin over the labeled points
• Use this to give initial labels to unlabeled points based on this separator.
• Try flipping labels of unlabeled points to see if doing so can increase margin
Keep going until no more improvements. Finds a locally-optimal solution.
Experiments [Joachims99]
Transductive Support Vector Machines
Helpful distribution
+ +
Non-helpful distributions
Margin not satisfied
Highly compatible
_
_
Margin satisfied
1/°2 clusters, all partitions separable by large margin
Co-training
[Blum & Mitchell ’98]
Different type of underlying regularity assumption: Consistency or Agreement Between Parts
Co-training: Self-consistency
Agreement between two parts : co-training [Blum-Mitchell98].
– examples contain two sufficient sets of features, x = h x1, x2 i
– belief: the parts are consistent, i.e. 9 c1, c2 s.t. c1(x1)=c2(x2)=c*(x)
For example, if we want to classify web pages: x = h x1, x2 i as faculty member homepage or not
Prof. Avrim Blum My Advisor
x – Link info & Text info
x1- Text info
Prof. Avrim Blum My Advisor
x2- Link info
Iterative Co-Training
Idea: Use small labeled sample to learn initial rules.
• E.g., “my advisor” pointing to a page is a good indicator it is a
• E.g., “I am teaching” on a page is a good indicator it is a faculty home page.
Idea: Use unlabeled data to propagate learned information. my advisor
faculty home page.
Iterative Co-Training
Idea: Use small labeled sample to learn initial rules.
• E.g., “my advisor” pointing to a page is a good indicator it is a
• E.g., “I am teaching” on a page is a good indicator it is a faculty home page.
faculty home page.
Idea: Use unlabeled data to propagate learned information. Look for unlabeled examples where one rule is confident and
the other is not. Have it label the example for the other.
hx1,x2i hx1,x2i hx1,x2i
hx1,x2i hx1,x2i hx1,x2i
Training 2 classifiers, one on each type of info. Using each to help train the other.
Iterative Co-Training
Works by using unlabeled data to propagate learned information.
X1 X2 h++
• •
1+h Have learning algos A1, A2 on each of the two views.
Use labeled data to learn two initial hyp. h1, h2. Repeat
• Look through unlabeled data to find examples where one of hi is confident but other is not.
• Have the confident hi label it for algorithm A3-i.
Original Application: Webpage classification
12 labeled examples, 1000 unlabeled
(sample run)
Iterative Co-Training
A Simple Example: Learning Intervals
Labeled examples
Unlabeled examples
+
–
c2
h21
–
c1
h11
Use unlabeled data to bootstrap
Use labeled data to learn h11 and h21
h22
h21
h12
h12
c1
Expansion, Examples: Learning Intervals
Consistency: zero probability mass in the regions
c2
Non-expanding (non-helpful)
distribution D+
Expanding distribution
D+
c1 S2
c1
S1
c2
c2
Co-training: Theoretical Guarantees
What properties do we need for co-training to work well? We need assumptions about:
1. the underlying data distribution
2. the learning algos on the two sides
[Blum & Mitchell, COLT ‘98]
1. Independence given the label
2. Alg. for learning from random noise.
[Balcan, Blum, Yang, NIPS 2004]
View 1
𝐷1+
View 2
𝐷2+
𝐷2−
1. Distributional expansion.
2. Alg. for learning from positve data only.
𝐷1−
Co-training [BM’98]
Say that h1 is a weakly-useful predictor if
Prh1 𝑥 =1𝑐1 𝑥 =1 >Prh1 𝑥 =1𝑐1 𝑥 =0 +𝛾.
Has higher probability of saying positive on a true positive than it does on a true negative, by at least some gap 𝛾
Say we have enough labeled data to produce such a starting point.
Theorem: if 𝐶 is learnable from random classification noise, we can use a weakly-useful h1 plus unlabeled data to create a strong learner under independence given the label.
Co-training: Benefits in Principle [BB’05]: Under independence given the label, any pair 〈h1, h2〉 with high
agreement over unlabeled data must be close to:
𝑐1, 𝑐2 , 〈¬𝑐1, ¬𝑐2〉, 〈𝑡𝑟𝑢𝑒, 𝑡𝑟𝑢𝑒〉, or 〈𝑓𝑎𝑙𝑠𝑒, 𝑓𝑎𝑙𝑠𝑒〉
View 1
View 2
𝐷2+
𝐷1+
𝐷1−
𝐷2−
Co-training: Benefits in Principle [BB’05]: Under independence given the label, any pair 〈h1, h2〉 with high
agreement over unlabeled data must be close to:
𝑐1, 𝑐2 , 〈¬𝑐1, ¬𝑐2〉, 〈𝑡𝑟𝑢𝑒, 𝑡𝑟𝑢𝑒〉, or 〈𝑓𝑎𝑙𝑠𝑒, 𝑓𝑎𝑙𝑠𝑒〉
E.g.,
𝒉𝟏
𝐷1+
View 1
View 2
𝐷2+
𝒉𝟐
𝐷1−
𝐷2−
Because of independence, we will see lot disagreement….
Co-training/Multi-view SSL: Direct Optimization of Agreement
Input: Sl={(x1,y1), …,(xml,yml)} Su={x1, …,xmu}
argminh1,h2
2 ml l=1 i=1
l(hl xi ,yi)+C
mu i=1
agreement(h1 xi ,h2 xi )
Each of them has small labeled error
E.g.,
P. Bartlett, D. Rosenberg, AISTATS 2007;
Regularizer to encourage agreement over unlabeled dat
K. Sridharan, S. Kakade, COLT 2008
Co-training/Multi-view SSL: Direct Optimization of Agreement
Input: Sl={(x1,y1), …,(xml,yml)} Su={x1, …,xmu}
argminh1,h2
• l(hxi ,yi)lossfunction
agreement(h1 xi ,h2 xi )
2 ml l=1 i=1
mu i=1
l(hl xi ,yi)+C
• E.g.,squarelosslhxi ,yi = yi−hxl • E.g., 0/1 loss l h xi , yi = 1𝑦𝑖≠h(𝑥𝑖)
E.g.,
P. Bartlett, D. Rosenberg, AISTATS 2007;
2
K. Sridharan, S. Kakade, COLT 2008
Original Application: Webpage classification
12 labeled examples, 1000 unlabeled
(sample run)
Many Other Applications
E.g., [Levin-Viola-Freund03] identifying objects in images. Two different kinds of preprocessing.
E.g., [Collins&Singer99] named-entity extraction.
– “I arrived in London yesterday”
Central to NELL!!!
… …
Similarity Based Regularity
[Blum&Chwala01], [ZhuGhahramaniLafferty03]
• •
•
Graph-based Methods
Assume we are given a pairwise similarity fnc and that very similar examples probably have the same label.
If we have a lot of labeled data, this suggests a Nearest-Neighbor type of algorithm.
If you have a lot of unlabeled data, perhaps can use them as “stepping stones”.
E.g., handwritten digits [Zhu07]:
Graph-based Methods
Idea: construct a graph with edges between very similar examples.
Unlabeled data can help “glue” the objects of the same class together.
Graph-based Methods
Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together.
Person Identification in Webcam Images: An Application of Semi-Supervised
Learning. [Balcan,Blum,Choi, Lafferty, Pantano, Rwebangira, Xiaojin Zhu], ICML 2005 Workshop on Learning with Partially Classified Training Data.
Graph-based Methods
Often, transductive approach. (Given L + U, output predictions on U). Are alllowed to output any labeling of 𝐿 ∪ 𝑈.
Main Idea:
• •
•
Construct graph G with edges between very similar examples.
Might have also glued together in G examples of different classes.
Run a graph partitioning algorithm to separate the graph into pieces.
Several methods:
– Minimum/Multiwaycut[Blum&Chawla01]
– Minimum“soft-cut”[ZhuGhahramaniLafferty’03] – Spectralpartitioning
–…
•
What You Should Know
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Different types of algorithms (based on different beliefs).
– Transductive SVM [Joachims ’99]
– Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03]
•
Additional Material on Graph Based Methods
Minimum/Multiway Cut[Blum&Chawla01] Objective: Solve for labels on unlabeled points that minimize total
weight of edges whose endpoints have different labels. (i.e., the total weight of bad edges)
•
– –
–
If just two labels, can be solved efficiently using max-flow min-cut algorithms
Create super-source 𝑠 connected by edges of weight ∞ to all + labeled pts.
Create super-sink 𝑡 connected by edges of weight ∞ to all − labeled pts.
Find minimum-weight 𝑠-𝑡 cut
Minimum “soft cut”
[ZhuGhahramaniLafferty’03]
Objective Solve for probability vector over labels 𝑓𝑖 on each unlabeled point 𝑖.
(labeled points get coordinate vectors in direction of their known label)
(0100000000)
(0000000100)
2
where ‖𝑓𝑖 − 𝑓𝑗‖ is Euclidean distance.
• Minimize 𝑒=(𝑖,𝑗) 𝑤𝑒 𝑓𝑖 − 𝑓𝑗
(0001000000)
• Can be done efficiently by solving
a set of linear equations. (1000000000)
(0000000010)
(0000000001)
How to Create the Graph
• Empirically, the following works well:
1. Compute distance between i, j
2. For each i, connect to its kNN. k very small but still connects the graph
3. Optionally put weights on (only) those edges
4. Tune