Semi-supervised Learning
Dong Gong
University of Adelaide
Slides by Lingqiao Liu and Dong Gong
Outlines
University of Adelaide 2
• Overview of Semi-supervised Learning
• Some commonly used semi-supervised learning approaches
– Self-training or Pseudo labeling
– Co-training
– S3SVM
– Graph-based approach
• Deep semi-supervised learning
– Why deep semi-supervised learning
– Example: Consistency-based approaches
Semi-supervised learning
University of Adelaide 3
• What is semi-supervised learning?
– Learn a model on two types of training data, one with label and
one without
• Why bother?
– Labeled data can be hard to get
• Human annotation is boring
• Labeling may require experts or special devices
– Unlabeled data is cheap
Example
University of Adelaide 4
Example
University of Adelaide 5
Example
University of Adelaide 6
Image classification
Image categorization of “eclipse”
It may not be difficult to label many instances for some tasks.
But there are always more unlabeled data.
Semi-supervised learning
University of Adelaide 7
• Goal:
– Using both labeled and unlabeled data to build better models,
than using each one alone
• Notations:
How can unlabeled data ever help?
University of Adelaide 8
• We usually have some assumptions about the data distribution of
each class in the unlabeled dataset
• Example assumption: each class is a coherent group (e.g. Gaussian)
– In the above figure, with and without unlabeled data: decision boundary
shift
How can unlabeled data ever help?
University of Adelaide 9
• With and without unlabeled data: decision boundary shift.
– The unlabeled data points shift the decision boundary (as a more accurate boundary).
• Example assumption: each class is a coherent group (e.g. Gaussian)
This is just one of many ways to use unlabelled data; Different
Semi-supervised learning approach may take different
assumptions
Does unlabeled data always help?
University of Adelaide 10
• Unfortunately, this is not the case.
– The simple assumption may not hold
Semi-supervised learning approach
University of Adelaide 11
• Many of them:
– Self-training or Pseudo-label-based approaches
– Co-training
– Tri-training
– Semi-supervised Support Vector Machine (S3VM)
– …
• An active research direction
• This lecture: some classic methods and simple methods,
but can be very useful in practise.
Self-training
University of Adelaide 12
• Also called Pseudo labeling approach
– Assigning unbaled samples pseudo label in learning process
• Algorithm
Prediction confidence
• How to calculate the confidence of prediction
– Many possible ways
• The classification probability from softmax can be used
to measure the confidence on the predicted labels.
– For example:
• Three classes
• A sample is classified as the #2nd class with
– 0.85 ([0.03, 0.85, 0.12]) à high confidence
– 0.4 ([0.3, 0.4, 0.3]) à low confidence
• A threshold may be needed to define “high confidence”.
University of Adelaide 13
Self-training
University of Adelaide 14
• Assumption:
– High confidence predictions are correct
• Why it helps?
– Correct prediction does not mean zero loss value.
– Optimizing on the confident pseudo samples to further lifting the
performance on these samples.
– More samples -> better model -> more accurate prediction ->
More samples
Self-training: Example
University of Adelaide 15
Training results on the labeled samples.
Self-training: Example
University of Adelaide 16
Assigning pseudo labels.
Self-training: Example
University of Adelaide 17
Re-train the model on the labeled data and data with pseudo labels.
Old decision boundaryUpdated/shifted
decision boundary
Self-training
University of Adelaide 18
• Advantages:
– Simple wrapper approach: apply to any model flexibly.
– Easy to implement
• Disadvantages
– If Pseudo labels are incorrect, no way to correct it.
– The early mistakes can reinforce themselves.
– Sensitive to prediction error
Co-training
University of Adelaide 19
• Idea: many problems have two different sources of info
you can use to determine label
• E.g., classifying webpages: can use
words/images/contents on page or words on links
pointing to the page
Co-training
University of Adelaide 20
• e.g., “colleagues” pointing to a page is a good indicator it
is a faculty home page.
• e.g., “I am teaching ML course” on a page is a good
indicator it is a faculty home page.
Co-training
University of Adelaide 21
• Then look for unlabelled examples where one rule is
confident and the other is not. Have it label the example
for the other.
• For example, if the prediction from one classifier is
sufficient confident, it generates pseudo label for the
other classifier
Co-training
University of Adelaide 22
• Feature split
• Co-training algorithm
Semi-Supervised Learning Algorithms Multiview Algorithms
Feature split
Each instance is represented by two sets of features x = [x(1);x(2)]
x
(1) = image features
x
(2) = web page text
This is a natural feature split (or multiple views)
Co-training idea:
Train an image classifier and a text classifier
The two classifiers teach each other
Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 105 / 135
Semi-Supervised Learning Algorithms Multiview Algorithms
Co-training algorithm
Co-training algorithm
1 Train two classifiers: f (1) from (X(1)l , Yl), f
(2) from (X(2)l , Yl).
2 Classify Xu with f
(1) and f (2) separately.
3 Add f (1)’s k-most-confident (x, f (1)(x)) to f (2)’s labeled data.
4 Add f (2)’s k-most-confident (x, f (2)(x)) to f (1)’s labeled data.
5 Repeat.
Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 107 / 135
Co-training
University of Adelaide 23
• Assumptions for feature splitting-based co-training
Semi-Supervised Learning Algorithms Multiview Algorithms
Co-training assumptions
Assumptions
feature split x = [x(1);x(2)] exists
x
(1) or x(2) alone is su�cient to train a good classifier
x
(1) and x(2) are conditionally independent given the class
X1 view X2 view
+
+
++
+
+
+
++
+
−
− −
−
−
−
−
−
+
−
++
++
+
+
+
++
+++
+
+
++
− −
− −
−
−
−
−
−
−−
−
+
+
+
+
+
+
+
+
++
+
−
−−
−
−
−
−
−
−
++
+
+
+
+
+ +
+
+
+
+
+
+
+
+
−
−
−
−
−
−
−
−
−
−
−
−
Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 106 / 135
Co-training
University of Adelaide 24
• How to apply co-training if there is only one source?
– Generate two independent classifiers: e.g. SVM with different
kernels, two different neural networks
– Key: make sure those classifiers make independent decisions.
Jizong Peng, Guillermo Estradab, Marco Pedersoli, Christian Desrosiers Deep Co-Training for Semi-
Supervised Image Segmentation. European Conference on Computer Vision 2018.
Co-training
University of Adelaide 25
• Advantages:
– Simple wrapper approach: apply to any model f
– Less sensitive to prediction mistake than self-training
• Disadvantages
– Natural split of feature sources can be hard to obtain
– Models using BOTH features should do better.
S3VM: Semi-supervised Support Vector
Machine
University of Adelaide 26
• Semi-supervised SVMs (S3VMs) = Transductive SVMs
(TSVMs)
• Maximizes “unlabeled data margin”
• Assumption: we believe target separator goes through
low density regions of the space/large margin.
• Aim for separator with large margin with respect to
labelled and unlabelled data. (L+U)
S3SVM: Semi-supervised Support Vector
Machine
University of Adelaide 27
Low density region
SVM Recap
University of Adelaide 28
How to understand this term?
SVM Recap
University of Adelaide 29
How to understand this term?
It measures how much it violates the hard margin constraints. Recall that in hard
margin case, we expect that
Geometrically speaking
University of Adelaide 30
S3VM: Encourage w going through low
density regions
• S3SVM will repurpose the interpretation of to enforce
its key idea: encourage w going through low density
regions
University of Adelaide 31
Solution
University of Adelaide 32
• We want separator goes through
low density regions of the
space/large margin.
• Assume each unlabelled sample
being class 1 or -1. Calculate loss
and
• We want to minimize
Solution
University of Adelaide 33
• We want separator goes through
low density regions of the
space/large margin.
• Assume each unlabelled sample
being class 1 or -1. Calculate loss
and
• We want to minimize
According to this scheme, any samples falling between those two red dash lines
will be penalized, that is, incurring nonzero loss. Thus the optimization process
will seek to place the decision boundary to the region which leads to minimal
loss, equivalently, low density region.
Formulation of S3VM: [optional]
University of Adelaide 34
Assume y = 1
Assume y = -1
S3VM: Semi-supervised Support Vector
Machine
University of Adelaide 35
• It is not a convex problem. More advanced optimization
algorithms are needed.
• The key idea and formulation applies to other semi-
supervised learning approach
S3VM: Semi-supervised Support Vector
Machine
University of Adelaide 36
• Advantages:
– Applicable to wherever SVMs are applicable
– Clear mathematical formulation
• Disadvantage:
– Optimization can be difficult
– Can be trapped into bad local minima
Graph-based semi-supervised learning
University of Adelaide 37
• Assumption: we believe that very similar examples
probably have the same label.
• We can use unlabelled data as “stepping stones” to
propagate the similarity and label
Graph-based semi-supervised learning
University of Adelaide 38
• Idea: Construct a graph on the labeled and unlabeled
data. Instances connected by heavy edge tend to have the
same label.
• Unlabelled data can help “glue” the objects of the same
class together.
Graph-based semi-supervised learning
University of Adelaide 39
The Graph
Graph-based semi-supervised learning
University of Adelaide 40
• Key idea: if two samples are neighbours, the prediction values of
them should be similar
Graph-based semi-supervised learning
University of Adelaide 41
• Formulation:
s(x, x’) represents the the strength of the connection (or “similarity”) on the graph.
Different values of s(x, x’) lead to different regularization strength.
Graph-based semi-supervised learning
University of Adelaide 42
• Formulation:
First two terms: same as SVM, you can replace them with any linear classifier
Last term: encouraging similar samples having similar prediction
Graph-based semi-supervised learning
University of Adelaide 43
• Advantages:
– Good performance if the graph fits the task
– Clear mathematical formulation
• Disadvantage:
– Performance is bad if the graph is bad
– Not scalable to large dataset
Deep semi-supervised learning
University of Adelaide Semi-supervised Learning via Conditional Rotation Prediction44
• What has changed in the context of deep learning
– Training methods: from convex optimization to SGD
– Scale of data: Large scale
– The structure of predictive model: simple vs. complex, layerwise
– Feature representation learning with deep learning
Deep Semi-supervised Learning
• A very active research field
• Again, many methods
• Current research progress
University of Adelaide 45
Consistency-based deep semi-supervised
learning
• Require the network to be less sensitive to the input
perturbation
– Supervision signal not relying on the labels
• How to make a perturbation
– Add noise on the input (e.g., images)
– Data augmentation: For images, shifting, mirroring, color
jittering etc.
– Data augmentation: For text, back translation
University of Adelaide 46
[1] Samuli Laine, Timo Aila. Temporal Ensembling for Semi-Supervised Learning. ICLR 2017
[2] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. Unsupervised Data
Augmentation for Consistency Training. Arxiv 29 Apr 2019.
Consistency-based deep semi-supervised
learning
• Loss function
• Problem:
– At the beginning of training, f may not generate meaningful
outputs. Enforcing consistency may result in trivial solutions
– Solution: weight rump up
University of Adelaide 47
Conclusion
• Semi-supervised learning is an area of increasing
importance in Machine Learning.
• Automatic methods of collecting data make it more
important than ever to develop methods to make use of
unlabelled data
University of Adelaide 48