Semi-Supervised Learning
Xiaojin Zhu, University of Wisconsin-Madison
Synonyms: Learning from labeled and unlabeled data, transductive learn- ing
Definition
Semi-supervised learning uses both labeled and unlabeled data to perform an otherwise supervised learning or unsupervised learning task.
In the former case, there is a distinction between inductive semi-supervised learning and transductive learning. In inductive semi-supervised learning,
l iid
the learner has both labeled training data {(xi, yi)}i=1 ∼ p(x, y) and unla-
l+u iid
beled training data {xi}i=l+1 ∼ p(x), and learns a predictor f : X → Y,
f ∈ F where F is the hypothesis space. Here x ∈ X is an input instance,
y ∈ Y its target label (discrete for classification or continuous for regression),
p(x,y) the unknown joint distribution and p(x) its marginal, and typically
l ≪ u. The goal is to learn a predictor that predicts future test data better
than the predictor learned from the labeled training data alone. In trans-
ductive learning, the setting is the same except that one is solely interested
in the predictions on the unlabeled training data {xi}l+u , without any i=l+1
intention to generalize to future test data.
In the latter case, an unsupervised learning task is enhanced by labeled
data. For example, in semi-supervised clustering (a.k.a. constrained clus- tering) one may have a few must-links (two instances must be in the same cluster) and cannot-links (two instances cannot be in the same cluster) in ad- dition to the unlabeled instances to be clustered; in semi-supervised dimen- sionality reduction one might have the target low-dimensional coordinates on a few instances.
This entry will focus on the former case of learning a predictor.
1
Motivation and Background
Semi-supervised learning is initially motivated by its practical value in learn- ing faster, better, and cheaper. In many real world applications, it is rela- tively easy to acquire a large amount of unlabeled data {x}. For example, documents can be crawled from the Web, images can be obtained from surveillance cameras, and speech can be collected from broadcast. However, their corresponding labels {y} for the prediction task, such as sentiment orientation, intrusion detection, and phonetic transcript, often requires slow human annotation and expensive laboratory experiments. This labeling bot- tleneck results in a scarce of labeled data and a surplus of unlabeled data. Therefore, being able to utilize the surplus unlabeled data is desirable.
Recently, semi-supervised learning also finds applications in cognitive psychology as a computational model for human learning. In human cate- gorization and concept forming, the environment provides unsupervised data (e.g., a child watching surrounding objects by herself) in addition to labeled data from a teacher (e.g., Dad points to an object and says “bird!”). There is evidence that human beings can combine labeled and unlabeled data to facilitate learning.
The history of semi-supervised learning goes back to at least the 70s, when self-training, transduction, and Gaussian mixtures with the EM al- gorithm first emerged. It enjoyed an explosion of interest since the 90s, with the development of new algorithms like co-training and transductive support vector machines, new applications in natural language processing and computer vision, and new theoretical analyses. More discussions can be found in section 1.1.3 in [7].
Theory
It is obvious that unlabeled data {xi}l+u by itself does not carry any i=l+1
information on the mapping X → Y. How can it help us learn a better predictor f : X → Y? Balcan and Blum pointed out in [2] that the key lies in an implicit ordering of f ∈ F induced by the unlabeled data. Informally, if the implicit ordering happens to rank the target predictor f∗ near the top, then one needs less labeled data to learn f∗. This idea will be formalized later on using PAC learning bounds. In other contexts, the implicit ordering is interpreted as a prior over F or as a regularizer.
A semi-supervised learning method must address two questions: what implicit ordering is induced by the unlabeled data, and how to algorith-
2
mically find a predictor near the top of this implicit ordering and fits the labeled data well. Many semi-supervised learning methods have been pro- posed, with different answers to these two questions [15, 7, 1, 10]. It is impossible to enumerate all methods in this entry. Instead, we present a few representative methods.
Generative Models
This semi-supervised learning method assumes the form of joint probability p(x,y | θ) = p(y | θ)p(x | y,θ). For example, the class prior distribution p(y | θ) can be a multinomial over Y, while the class conditional distribution p(x | y, θ) can be a multivariate Gaussian in X [6, 9]. We use θ ∈ Θ to denote the parameters of the joint probability. Each θ corresponds to a predictor fθ via Bayes rule:
p(x, y | θ) fθ(x) ≡ argmaxp(y | x,θ) = argmax ′ .
y y y′ p(x,y |θ)
Therefore, F = {fθ : θ ∈ Θ}.
What is the implicit ordering of fθ induced by unlabeled training data
{xi}l+u ? It is the large to small ordering of log likelihood of θ on unlabeled i=l+1
data:
logp({xi}l+u i=l+1
l+u
| θ) = log
i=l+1
p(xi,y | θ)
y∈Y
The top ranked fθ is the one whose θ (or rather the generative model with parameters θ) best fits the unlabeled data. Therefore, this method assumes that the form of the joint probability is correct for the task.
To identify the fθ that both fits the labeled data well and ranks high, one maximizes the log likelihood of θ on both labeled and unlabeled data:
argmax log p({xi, yi}li=1 | θ) + λ log p({xi}l+u | θ),
θ
where λ is a balancing weight. This is a non-concave problem. A local max- imum can be found with the Expectation-Maximization (EM) algorithm, or other numerical optimization methods.
Semi-Supervised Support Vector Machines
This semi-supervised learning method assumes that the decision boundary f(x) = 0 is situated in a low-density region (in terms of unlabeled data)
3
i=l+1
between the two classes y ∈ {−1, 1} [12, 8]. Consider the following hat loss function on an unlabeled instance x:
max(1 − |f (x)|, 0)
which is positive when −1 < f(x) < 1, and zero outside. The hat loss thus measures the violation in (unlabeled) large margin separation between f and x. Averaging over all unlabeled training instances, it induces an implicit ordering from small to large over f ∈ F:
1 l+u
max(1−|f(x)|,0). i=l+1
The top ranked f is one whose decision boundary avoids most unlabeled instances by a large margin.
To find the f that both fits the labeled data well and ranks high, one typically minimizes the following objective:
1 l
argmin max(1−yif(xi),0)+λ1∥f∥2 +λ2
1 l+u
max(1−|f(x)|,0),
u
f l i=1
u i=l+1
which is a combination of the objective for supervised support vector ma- chines, and the average hat loss. Algorithmically, the optimization prob- lem is difficult because the hat loss is non-convex. Existing solutions in- clude semi-definite programming relaxation, deterministic annealing, con- tinuation method, concave-convex procedure (CCCP), stochastic gradient descent, and Branch and Bound.
Graph-Based Models
This semi-supervised learning method assumes that there is a graph G = {V,E} such that the vertices V are the labeled and unlabeled training instances, and the undirected edges E connect instances i,j with weight wij [4, 14, 3]. The graph is sometimes assumed to be a random instanti- ation of an underlying manifold structure that supports p(x). Typically, wij reflects the proximity of xi,xj. For example, the Gaussian edge weight functiondefineswij =exp−∥xi −xj∥2/σ2. Asanotherexample,thekNN edge weight function defines wij = 1 if xi is within the k nearest neighbors of xj or vice versa, and wij = 0 otherwise. Other commonly used edge weight functions include ε-radius neighbors, b-matching, and combinations of the above.
4
Large wij implies a preference for the predictions f(xi) and f(xj) to be the same. This can be formalized by the graph energy of a function f:
l+u
wij(f(xi) − f(xj))2. i,j =1
The graph energy induces an implicit ordering of f ∈ F from small to large. The top ranked function is the smoothest with respect to the graph (in fact, it is any constant function). The graph energy can be equivalently expressed using the so-called unnormalized graph Laplacian matrix. Variants including the normalized Laplacian and the powers of these matrices.
To find the f that both fits the labeled data well and ranks high (i.e., be- ing smooth on the graph or manifold), one typically minimizes the following objective:
1l l+u
argmin c(f(xi),yi)+λ1∥f∥2 +λ2 wij(f(xi)−f(xj))2,
f l i=1 i,j=1
where c(f (x), y) is a convex loss function such as the hinge loss or the squared
loss. This is a convex optimization problem with efficient solvers.
Co-training and Multiview Models
This semi-supervised learning method assumes that there are multiple, dif- ferent learners trained on the same labeled data, and these learners agree on the unlabeled data. A classic algorithm is co-training [5]. Take the ex- ample of web page classification, where each web page x is represented by two subsets of features, or “views” x = ⟨x(1),x(2)⟩. For instance, x(1) can represent the words on the page itself, and x(2) the words on the hyper- links (on other web pages) pointing to this page. The co-training algorithm trains two predictors: f(1) on x(1) (ignoring the x(2) portion of the feature) and f(2) on x(2), both initially from the labeled data. If f(1) confidently predicts the label of an unlabeled instance x, then the instance-label pair (x,f(1)(x)) is added to f(2)’s labeled training data, and vice versa. Note this promotes f(1) and f(2) to predict the same on x. This repeats so that each view teaches the other. Multiview models generalize co-training by utilizing more than two predictors, and relaxing the requirement of having separate views [11]. In either case, the final prediction is obtained from a (confidence weighted) average or vote among the predictors.
To define the implicit ordering on the hypothesis space, we need a slight extension. In general, let there be m predictors f(1), . . . f(m). Now let a
5
hypothesis be an m-tuple of predictors ⟨f(1), . . . f(m)⟩. The disagreement of a tuple on the unlabeled data can be defined as
l+u m
c(f(u)(xi), f(v)(xi)), i=l+1 u,v=1
where c() is a loss function. Typical choices of c() are the 0-1 loss for classification, and the squared loss for regression. Then the disagreement induces an implicit ordering on tuples from small to large.
It is important for these m predictors to be of diverse types, and have different inductive biases. In general, each predictor f(u),u = 1...m may be evaluated by its individual loss function c(u) and regularizer Ω(u). To find a hypothesis (i.e,. m predictors) that fits the labeled data well and ranks high, one can minimize the following objective:
argmin
⟨f(1),...f(m)⟩
m 1 l
u=1 l i=1 l+u
c(u)(f(u)(xi), yi) + λ1Ω(u)(f(u))
m
+λ2 c(f(u)(xi), f(v)(xi)).
i=l+1 u,v=1
Multiview learning typically optimizes this objective directly. When the loss functions and regularizers are convex, numerical solution is relatively easy to obtain. In the special cases when the loss functions are the squared loss, and the regularizers are squared l2 norms, there is a closed form solution. On the other hand, the co-training algorithm, as presented earlier, optimizes the objective indirectly with the iterative procedure. One advantage of co- training is that the algorithm is a wrapper method, in that it can use any “blackbox” learners f(1) and f(2) without the need to modify the learners.
A PAC Bound for Semi-Supervised Learning
Previously, we presented several semi-supervised learning methods, each in- duces an implicit ordering on the hypothesis space using the unlabeled train- ing data, and each attempts to find a hypothesis that fit the labeled training data well as well as rank high in that implicit ordering. We now present a theoretical justification on why this is a good idea. In particular, we present a uniform convergence bound by Balcan and Blum (Theorem 11 in [2]). Alternative theoretical analyses on semi-supervised learning can be found by following the recommended reading.
6
First, we introduce some notations. Consider the 0-1 loss for classifica-
tion. Let c∗ : X → {0, 1} be the unknown target function, which may not
be in F. Let err(f) = Ex∼p[f(x) ̸= c∗(x)] be the true error rate of a hy-
pothesis f, and err(f) = 1 l f(xi) ̸= c∗(xi) be the empirical error rate l i=1
of f on the labeled training sample. To characterize the implicit ordering, we defined an “unlabeled error rate” errunl(f) = 1 − Ex∼p[χ(f, x)], where the compatibility function χ : F × X → [0, 1] measures how “compatible” f is to an unlabeled instance x. As an example, in semi-supervised support vector machines, if x is far away from the decision boundary produced by f, then χ(f,x) is large; but if x is close to the decision boundary, χ(f,x) is small. In this example, a large errunl(f) then means that the decision boundary of f cuts through dense unlabeled data regions, and thus f is un- desirable for semi-supervised learning. In contrast, a small errunl(f) means that the decision boundary of f lies in a low density gap, which is more desirable. In theory, the implicit ordering on f ∈ F is to sort errunl(f) from small to large. In practice, we use the empirical unlabeled error rate errunl(f)=1− 1 l+u χ(f,xi).
u i=l+1
Our goal is to show that if an f ∈ F “fits the labeled data well and
ranks high”, then f is almost as good as the best hypothesis in F. Let
t ∈ [0,1]. We first consider the best hypothesis ft∗ in the subset of F
that consists of hypotheses whose unlabeled error rate is no worse than
t: ft∗ = argminf′∈F,errunl(f′)≤t err(f′). Obviously, t = 1 gives the best
hypothesis in the whole F. However, the nature of the guarantee has the
form err(f) ≤ err(ft∗)+EstimationError(t)+c, where the EstimationError
term increases with t. Thus, with t = 1 the bound can be loose. On the
other hand, if t is close to 0, EstimationError(t) is small, but err(ft∗) can
be much worse than err(f∗ ). The bound will account for the optimal t. t=1
We introduce a few more definitions. Let F(f) = {f′ ∈ F : errunl(f′) ≤ errunl(f)} be the subset of F with empirical error no worse than that of f. As a complexity measure, let [F(f)] be the number of different partitions of the first l unlabeled instances xl+1 . . . x2l, using f ∈ F(f). Finally, let εˆ(f) =
24 log(8[F (f )]). Then we have the following agnostic bound (meaning that l
c∗ may not be in F, and errunl(f) may not be zero for any f ∈ F): Theorem 1. Given l labeled instances and sufficient unlabeled instances,
with probability at least 1 − δ, the function
f =argminerr(f′)+εˆ(f′)
f′∈F
7
satisfies the guarantee that
log(8/δ) err(f) ≤ min(err(ft∗) + εˆ(ft∗)) + 5 .
tl
If a function f fits the labeled data well, it has a small err(f). If it
ranks high, then F(f) will be a small set, consequently εˆ(f) is small. The
argmin operator identifies the best such function during training. The bound
account for the minimum of all possible t tradeoffs. Therefore, we see that
the “lucky” case is when the implicit ordering is good such that f∗ , the t=1
best hypothesis in F, is near the top of the ranking. This is when semi- supervised learning is expected to perform well. Balcan and Blum also give results addressing the key issue of how much unlabeled data is needed for errunl(f) and errunl(f) to be close for all f ∈ F.
Applications
Because the type of semi-supervised learning discussed in this entry has the same goal of creating a predictor as supervised learning, it is applicable to essentially any problems where supervised learning can be applied. For ex- ample, semi-supervised learning has been applied to natural language pro- cessing (word sense disambiguation [13], document categorization, named entity classification, sentiment analysis, machine translation), computer vi- sion (object recognition, image segmentation), bioinformatics (protein func- tion prediction), and cognitive psychology. Follow the recommended reading for individual papers.
Future Directions
There are several directions to further enhance the value semi-supervised learning. First, we need guarantees that it will outperform supervised learn- ing. Currently, the practitioner has to manually choose a particular semi- supervised learning method, and often manually set learning parameters. Sometimes, a bad choice that does not match the task (e.g., modeling each class with a Gaussian when the data does not have this distribution) can make semi-supervised learning worse than supervised learning. Second, we need methods that benefit from unlabeled when l, the size of labeled data, is large. It has been widely observed that the gain over supervised learn- ing is the largest when l is small, but diminishes as l increases. Third, we need good ways to combine semi-supervised learning and active learning. In
8
natural learning systems such as humans, we routinely observe unlabeled in- put, which often naturally leads to questions. And finally, we need methods that can efficiently process massive unlabeled data, especially in an online learning setting.
Cross References
active learning, classification, constrained clustering, dimensionality reduc- tion, online learning, regression, supervised learning, unsupervised learning
Recommended Reading
[1] S. Abney. Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC, 2007.
[2] M.-F. Balcan and A. Blum. A discriminative model for semi-supervised learning. Journal of the ACM, 2009.
[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, November 2006.
[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proc. 18th International Conf. on Machine Learning, 2001.
[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, 1998.
[6] V. Castelli and T. Cover. The exponential value of labeled samples. Pattern Recognition Letters, 16(1):105–111, 1995.
[7] O. Chapelle, A. Zien, and B. Sch ̈olkopf, editors. Semi-supervised learn- ing. MIT Press, 2006.
[8] T. Joachims. Transductive inference for text classification using sup- port vector machines. In Proc. 16th International Conf. on Machine Learning, pages 200–209. Morgan Kaufmann, San Francisco, CA, 1999.
9
[9] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text clas- sification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.
[10] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001.
[11] V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularized approach to semi-supervised learning with multiple views. In Proc. of the 22nd ICML Workshop on Learning with Multiple Views, August 2005.
[12] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
[13] D. Yarowsky. Unsupervised word sense disambiguation rivaling super- vised methods. In Proceedings of the 33rd Annual Meeting of the As- sociation for Computational Linguistics, pages 189–196, 1995.
[14] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning us- ing Gaussian fields and harmonic functions. In The 20th International Conference on Machine Learning (ICML), 2003.
[15] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised Learn- ing. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009.
10