COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Hidden Unit Dynamics
2
COMP9444 20T2
Hidden Unit Dynamics
3
3a. Hidden Unit Dynamics
geometry of hidden unit activations (8.2) limitations of 2-layer networks
vanishing/exploding gradients
alternative activation functions (6.3)
Textbook, Sections 5.2-5.3, 6.3, 7.11-7.12, 8.2
ways to avoid overfitting in neural networks (5.2-5.3) dropout (7.11-7.12)
Encoder Networks
N–2–N Encoder
identity mapping through a bottleneck
also called N–M–N task
used to investigate hidden unit representations
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
Inputs
Outputs
10000
01000
00100
00010
00001
10000
01000
00100
00010
00001
COMP9444 20T2 Hidden Unit Dynamics
1
HU Space:
COMP9444 20T2
Hidden Unit Dynamics
4
COMP9444 20T2
Hidden Unit Dynamics
5
8–3–8 Encoder
Hinton Diagrams
Exercise:
Draw the hidden unit space for 2-2-2, 3-2-3, 4-2-4 and 5-2-5 encoders.
Sharp Left
Straight Ahead
Sharp Right
Represent the input-to-hidden weights for each input unit by a point, and the hidden-to-output weights for each output unit by a line.
4 Hidden Units
Now consider the 8-3-8 encoder with its 3-dimensional hidden unit space. What shape would be formed by the 8 points representing the input-to-hidden weights for the 8 input units? What shape would be formed by the planes representing the hidden-to-output weights for each output unit?
30×32 Sensor Input Retina
Hint: think of two platonic solids, which are “dual” to each other.
used to visualize higher dimensions white = positive, black = negative
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Hidden Unit Dynamics
6
COMP9444 20T2 Hidden Unit Dynamics
7
Learning Face Direction
Learning Face Direction
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
30 Output Units
COMP9444 20T2 Hidden Unit Dynamics
8
COMP9444 20T2 Hidden Unit Dynamics 9
Weight Space Symmetry (8.2)
Controlled Nonlinearity
swap any pair of hidden nodes, overall function will be the same
on any hidden node, reverse the sign of all incoming and outgoing
for small weights, each layer implements an approximately linear function, so multiple layers also implement an approximately linear function.
weights (assuming symmetric transfer function)
hidden nodes with identical input-to-hidden weights in theory would never separate; so, they all have to begin with different (small) random weights
for large weights, transfer function approximates a step function, so computation becomes digital and learning becomes very slow.
in practice, all hidden nodes try to do similar job at first, then gradually specialize.
with typical weight values, two-layer neural network implements a function which is close to linear, but takes advantage of a limited degree of nonlinearity.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Hidden Unit Dynamics
10
COMP9444 20T2 Hidden Unit Dynamics 11
Limitations of Two-Layer Neural Networks
Adding Hidden Layers
Some functions cannot be learned with a 2-layer sigmoidal network.
Twin Spirals can be learned by 3-layer network with shortcut connections
−2
−4
−6
6
4
first hidden layer learns linearly separable features
second hidden layer learns “convex” features
output layer combines these to produce “concave” features
training the 3-layer network is delicate
learning rate and initial weight values must be very small
otherwise, the network will converge to a local optimum
2
0
−6 −4 −2 0 2 4 6
For example, this Twin Spirals problem cannot be learned with a 2-layer network, but it can be learned using a 3-layer network if we include shortcut connections between non-consecutive layers.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Hidden Unit Dynamics
12
COMP9444 20T2 Hidden Unit Dynamics 13
Vanishing / Exploding Gradients
Vanishing / Exploding Gradients
Training by backpropagation in networks with many layers is difficult.
Ways to avoid vanishing/exploding gradients:
When the weights are small, the differentials become smaller and smaller as we backpropagate through the layers, and end up having no effect.
new activations functions
weight initialization (Week 5)
batch normalization (Week 5)
long short term memory (LSTM) (Week 6)
layerwise unsupervised pre-training (Week 9)
When the weights are large, the activations in the higher layers will saturate to extreme values. As a result, the gradients at those layers will become very small, and will not be propagated to the earlier layers.
When the weights have intermediate values, the differentials will sometimes get multiplied many times is places where the transfer function is steep, causing them to blow up to large values.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Hidden Unit Dynamics
14
COMP9444 20T2 Hidden Unit Dynamics 15
Activation Functions (6.3)
Activation Functions
COMP9444
COMP9444
⃝c Alan Blair, 2017-20
44
33
22
Sigmoid and hyperbolic tangent traditionally used for 2-layer networks, but suffer from vanishing gradient problem in deeper networks.
11
00
-1 -1
-2 -2 -4 -2 0 2 4 -4
-2 0 2 4
Sigmoid Rectified Linear Unit (ReLU)
Rectified Linear Units (ReLUs) are popular for deep networks, including convolutional networks. Gradients don’t vanish.
But, their highly linear nature may cause other problems.
44
33
22
11
Scaled Exponential Linear Units (SELUs) are a recent innovation which seems to work well for very deep networks.
00
-1
-1
-2
-2
-4 -2 0 2 4
-4
-2 0 2 4
Hyperbolic Tangent
Scaled Exponential Linear Unit (SELU)
⃝c Alan Blair, 2017-20
Error
Error
COMP9444 20T2 Hidden Unit Dynamics
16
COMP9444 20T2 Hidden Unit Dynamics 17
Ways to Avoid Overfitting in Neural Netorks
Training, Validation and Test Error
Limit the number of hidden nodes or connections
Limit the number of training epochs (weight updates) Weight Decay
Dropout
0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002
Training set error
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Hidden Unit Dynamics
18
COMP9444 20T2
Hidden Unit Dynamics 19
Overfitting in Neural Networks
Dropout (7.12)
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
Training set error
0
0 1000
2000 3000
Error versus weight updates (example 2)
Validation set error
4000 5000 6000 Number of weight updates
In this example, the validation set error is much larger than the training set error; but, it is still decreasing at epoch 6000.
For each minibatch, randomly choose a subset of nodes to not be used. Each node is chosen with some fixed probability (usually, one half).
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
0 5000
10000 15000 Number of weight updates
20000
Error versus weight updates (example 1)
Validation set error
Choose number of hidden nodes, or number of weight updates, to minimize validation set error, and it will likely also perform well on the test set.
COMP9444 20T2 Hidden Unit Dynamics 20
COMP9444 20T2 Hidden Unit Dynamics 21
Dropout (7.12)
Ensembling (7.11)
When training is finished and the network is deployed, all nodes are used, but their activations are multiplied by the same probability that was used in the dropout.
Ensembling is a method where a number of different classifiers are trained on the same task, and the final class is decided by “voting” among them.
Thus, the activation received by each unit is the average value of what it would have received during training.
In order to benefit from ensembling, we need to have diversity in the different classifiers.
Dropout forces the network to achieve redundancy because it must deal with situations where some features are missing.
For example, we could train three neural networks with different architectures, three Support Vector Machines with different dimensions and kernels, as well as two other classifiers, and ensemble all of them to produce a final result.
(Kaggle Competition entries are often done in this way).
Another way to view dropout is that it implicitly (and efficiently) simulates an ensemble of different architectures.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Hidden Unit Dynamics 22
COMP9444 20T2 Hidden Unit Dynamics 23
Bagging
Dropout as an Implicit Ensemble
Diversity can also be achieved by training on different subsets of data. Suppose we are given N training items.
In the case of dropout, the same data are used each time but a different architecture is created by removing the nodes that are dropped.
Each time we train a new classifier, we choose N items from the training set with replacement. This means that some items will not be chosen, while others are chosen two or three times.
The trick of multiplying the output of each node by the probability of dropout implicitly averages the output over all of these different models.
There will be diversity among the resulting classifiers because they have each been trained on a different subset of data. They can be ensembled to produce a more accurate result than a single classifier.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20