CS计算机代考程序代写 scheme deep learning flex algorithm Semi-supervised Learning

Semi-supervised Learning

Dong Gong
University of Adelaide

Slides by Lingqiao Liu and Dong Gong

Outlines

University of Adelaide 2

• Overview of Semi-supervised Learning
• Some commonly used semi-supervised learning approaches

– Self-training or Pseudo labeling
– Co-training
– S3SVM
– Graph-based approach

• Deep semi-supervised learning
– Why deep semi-supervised learning
– Example: Consistency-based approaches

Semi-supervised learning

University of Adelaide 3

• What is semi-supervised learning?
– Learn a model on two types of training data, one with label and

one without

• Why bother?
– Labeled data can be hard to get

• Human annotation is boring
• Labeling may require experts or special devices

– Unlabeled data is cheap

Example

University of Adelaide 4

Example

University of Adelaide 5

Example

University of Adelaide 6

Image classification
Image categorization of “eclipse”

It may not be difficult to label many instances for some tasks.
But there are always more unlabeled data.

Semi-supervised learning

University of Adelaide 7

• Goal:
– Using both labeled and unlabeled data to build better models,

than using each one alone
• Notations:

How can unlabeled data ever help?

University of Adelaide 8

• We usually have some assumptions about the data distribution of
each class in the unlabeled dataset

• Example assumption: each class is a coherent group (e.g. Gaussian)
– In the above figure, with and without unlabeled data: decision boundary

shift

How can unlabeled data ever help?

University of Adelaide 9

• With and without unlabeled data: decision boundary shift.
– The unlabeled data points shift the decision boundary (as a more accurate boundary).

• Example assumption: each class is a coherent group (e.g. Gaussian)

This is just one of many ways to use unlabelled data; Different
Semi-supervised learning approach may take different
assumptions

Does unlabeled data always help?

University of Adelaide 10

• Unfortunately, this is not the case.
– The simple assumption may not hold

Semi-supervised learning approach

University of Adelaide 11

• Many of them:
– Self-training or Pseudo-label-based approaches
– Co-training
– Tri-training
– Semi-supervised Support Vector Machine (S3VM)
– …

• An active research direction
• This lecture: some classic methods and simple methods,

but can be very useful in practise.

Self-training

University of Adelaide 12

• Also called Pseudo labeling approach
– Assigning unbaled samples pseudo label in learning process

• Algorithm

Prediction confidence

• How to calculate the confidence of prediction
– Many possible ways

• The classification probability from softmax can be used
to measure the confidence on the predicted labels.
– For example:

• Three classes
• A sample is classified as the #2nd class with

– 0.85 ([0.03, 0.85, 0.12]) à high confidence
– 0.4 ([0.3, 0.4, 0.3]) à low confidence

• A threshold may be needed to define “high confidence”.

University of Adelaide 13

Self-training

University of Adelaide 14

• Assumption:
– High confidence predictions are correct

• Why it helps?
– Correct prediction does not mean zero loss value.
– Optimizing on the confident pseudo samples to further lifting the

performance on these samples.
– More samples -> better model -> more accurate prediction ->

More samples

Self-training: Example

University of Adelaide 15

Training results on the labeled samples.

Self-training: Example

University of Adelaide 16

Assigning pseudo labels.

Self-training: Example

University of Adelaide 17

Re-train the model on the labeled data and data with pseudo labels.

Old decision boundaryUpdated/shifted
decision boundary

Self-training

University of Adelaide 18

• Advantages:
– Simple wrapper approach: apply to any model flexibly.
– Easy to implement

• Disadvantages
– If Pseudo labels are incorrect, no way to correct it.
– The early mistakes can reinforce themselves.
– Sensitive to prediction error

Co-training

University of Adelaide 19

• Idea: many problems have two different sources of info
you can use to determine label

• E.g., classifying webpages: can use
words/images/contents on page or words on links
pointing to the page

Co-training

University of Adelaide 20

• e.g., “colleagues” pointing to a page is a good indicator it
is a faculty home page.

• e.g., “I am teaching ML course” on a page is a good
indicator it is a faculty home page.

Co-training

University of Adelaide 21

• Then look for unlabelled examples where one rule is
confident and the other is not. Have it label the example
for the other.

• For example, if the prediction from one classifier is
sufficient confident, it generates pseudo label for the
other classifier

Co-training

University of Adelaide 22

• Feature split

• Co-training algorithm

Semi-Supervised Learning Algorithms Multiview Algorithms

Feature split

Each instance is represented by two sets of features x = [x(1);x(2)]
x

(1) = image features
x

(2) = web page text
This is a natural feature split (or multiple views)

Co-training idea:

Train an image classifier and a text classifier

The two classifiers teach each other

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 105 / 135

Semi-Supervised Learning Algorithms Multiview Algorithms

Co-training algorithm

Co-training algorithm

1 Train two classifiers: f (1) from (X(1)l , Yl), f
(2) from (X(2)l , Yl).

2 Classify Xu with f
(1) and f (2) separately.

3 Add f (1)’s k-most-confident (x, f (1)(x)) to f (2)’s labeled data.
4 Add f (2)’s k-most-confident (x, f (2)(x)) to f (1)’s labeled data.
5 Repeat.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 107 / 135

Co-training

University of Adelaide 23

• Assumptions for feature splitting-based co-training
Semi-Supervised Learning Algorithms Multiview Algorithms

Co-training assumptions

Assumptions

feature split x = [x(1);x(2)] exists
x

(1) or x(2) alone is su�cient to train a good classifier

x
(1) and x(2) are conditionally independent given the class

X1 view X2 view

+
+

++
+

+
+

++

+


− −



+


++

++

+
+

+
++

+++
+

+

++

− −

− −



−−

+

+

+

+
+

+

+

+

++

+

−−



++
+
+
+

+

+ +

+

+

+

+

+

+
+

+




Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 106 / 135

Co-training

University of Adelaide 24

• How to apply co-training if there is only one source?
– Generate two independent classifiers: e.g. SVM with different

kernels, two different neural networks
– Key: make sure those classifiers make independent decisions.

Jizong Peng, Guillermo Estradab, Marco Pedersoli, Christian Desrosiers Deep Co-Training for Semi-
Supervised Image Segmentation. European Conference on Computer Vision 2018.

Co-training

University of Adelaide 25

• Advantages:
– Simple wrapper approach: apply to any model f
– Less sensitive to prediction mistake than self-training

• Disadvantages
– Natural split of feature sources can be hard to obtain
– Models using BOTH features should do better.

S3VM: Semi-supervised Support Vector
Machine

University of Adelaide 26

• Semi-supervised SVMs (S3VMs) = Transductive SVMs
(TSVMs)

• Maximizes “unlabeled data margin”
• Assumption: we believe target separator goes through

low density regions of the space/large margin.
• Aim for separator with large margin with respect to

labelled and unlabelled data. (L+U)

S3SVM: Semi-supervised Support Vector
Machine

University of Adelaide 27

Low density region

SVM Recap

University of Adelaide 28

How to understand this term?

SVM Recap

University of Adelaide 29

How to understand this term?

It measures how much it violates the hard margin constraints. Recall that in hard
margin case, we expect that

Geometrically speaking

University of Adelaide 30

S3VM: Encourage w going through low
density regions
• S3SVM will repurpose the interpretation of to enforce

its key idea: encourage w going through low density
regions

University of Adelaide 31

Solution

University of Adelaide 32

• We want separator goes through
low density regions of the
space/large margin.

• Assume each unlabelled sample
being class 1 or -1. Calculate loss

and
• We want to minimize

Solution

University of Adelaide 33

• We want separator goes through
low density regions of the
space/large margin.

• Assume each unlabelled sample
being class 1 or -1. Calculate loss

and
• We want to minimize

According to this scheme, any samples falling between those two red dash lines
will be penalized, that is, incurring nonzero loss. Thus the optimization process
will seek to place the decision boundary to the region which leads to minimal
loss, equivalently, low density region.

Formulation of S3VM: [optional]

University of Adelaide 34

Assume y = 1

Assume y = -1

S3VM: Semi-supervised Support Vector
Machine

University of Adelaide 35

• It is not a convex problem. More advanced optimization
algorithms are needed.

• The key idea and formulation applies to other semi-
supervised learning approach

S3VM: Semi-supervised Support Vector
Machine

University of Adelaide 36

• Advantages:
– Applicable to wherever SVMs are applicable
– Clear mathematical formulation

• Disadvantage:
– Optimization can be difficult
– Can be trapped into bad local minima

Graph-based semi-supervised learning

University of Adelaide 37

• Assumption: we believe that very similar examples
probably have the same label.

• We can use unlabelled data as “stepping stones” to
propagate the similarity and label

Graph-based semi-supervised learning

University of Adelaide 38

• Idea: Construct a graph on the labeled and unlabeled
data. Instances connected by heavy edge tend to have the
same label.

• Unlabelled data can help “glue” the objects of the same
class together.

Graph-based semi-supervised learning

University of Adelaide 39

The Graph

Graph-based semi-supervised learning

University of Adelaide 40

• Key idea: if two samples are neighbours, the prediction values of
them should be similar

Graph-based semi-supervised learning

University of Adelaide 41

• Formulation:

s(x, x’) represents the the strength of the connection (or “similarity”) on the graph.
Different values of s(x, x’) lead to different regularization strength.

Graph-based semi-supervised learning

University of Adelaide 42

• Formulation:

First two terms: same as SVM, you can replace them with any linear classifier
Last term: encouraging similar samples having similar prediction

Graph-based semi-supervised learning

University of Adelaide 43

• Advantages:
– Good performance if the graph fits the task
– Clear mathematical formulation

• Disadvantage:
– Performance is bad if the graph is bad
– Not scalable to large dataset

Deep semi-supervised learning

University of Adelaide Semi-supervised Learning via Conditional Rotation Prediction44

• What has changed in the context of deep learning
– Training methods: from convex optimization to SGD
– Scale of data: Large scale
– The structure of predictive model: simple vs. complex, layerwise
– Feature representation learning with deep learning

Deep Semi-supervised Learning

• A very active research field
• Again, many methods
• Current research progress

University of Adelaide 45

Consistency-based deep semi-supervised
learning
• Require the network to be less sensitive to the input

perturbation
– Supervision signal not relying on the labels

• How to make a perturbation
– Add noise on the input (e.g., images)
– Data augmentation: For images, shifting, mirroring, color

jittering etc.
– Data augmentation: For text, back translation

University of Adelaide 46

[1] Samuli Laine, Timo Aila. Temporal Ensembling for Semi-Supervised Learning. ICLR 2017
[2] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. Unsupervised Data
Augmentation for Consistency Training. Arxiv 29 Apr 2019.

Consistency-based deep semi-supervised
learning
• Loss function

• Problem:
– At the beginning of training, f may not generate meaningful

outputs. Enforcing consistency may result in trivial solutions
– Solution: weight rump up

University of Adelaide 47

Conclusion

• Semi-supervised learning is an area of increasing
importance in Machine Learning.

• Automatic methods of collecting data make it more
important than ever to develop methods to make use of
unlabelled data

University of Adelaide 48