Lecture 8. Deep Learning. Convolutional ANNs. Autoencoders COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Deeplearning
∗ Representation capacity
∗ Deep models and representation learning
• ConvolutionalNeuralNetworks
∗ Convolution operator
∗ Elements of a convolution-based network
• Autoencoders
∗ Learning efficient coding
2
COMP90051 Statistical Machine Learning
Deep Learning and Representation Learning
Hidden layers viewed as feature space transformation
3
COMP90051 Statistical Machine Learning
Representational capacity
• ANNs with a single hidden layer are universal approximators
• For example, such ANNs can represent any Boolean function
𝑂𝑂𝑂𝑂(𝑥𝑥,𝑥𝑥) 𝑢𝑢=𝑔𝑔(𝑥𝑥 +𝑥𝑥 –0.5) 12 12
𝐴𝐴𝐴𝐴𝐴𝐴(𝑥𝑥1, 𝑥𝑥2) 𝑢𝑢 = 𝑔𝑔(𝑥𝑥1 + 𝑥𝑥2 – 1.5) 𝐴𝐴𝑂𝑂𝑁𝑁(𝑥𝑥1) 𝑢𝑢 = 𝑔𝑔(−𝑥𝑥1)
𝑔𝑔 𝑟𝑟 =1if𝑟𝑟≥0and𝑔𝑔 𝑟𝑟 =0otherwise
• Any Boolean function over 𝑚𝑚 variables can be implemented
using a hidden layer with up to 2𝑚𝑚 elements • More efficient to stack several hidden layers
4
COMP90051 Statistical Machine Learning
Deep networks
𝑥𝑥1 𝑎𝑎𝑖𝑖𝑗𝑗 𝑠𝑠1 𝑏𝑏𝑗𝑗𝑘𝑘 𝑡𝑡1 𝑐𝑐𝑘𝑘𝑙𝑙 𝑢𝑢1 𝑑𝑑𝑙𝑙𝑚𝑚 𝑧𝑧1 𝑥𝑥2 𝑠𝑠𝑝𝑝1 𝑡𝑡2 𝑢𝑢 𝑧𝑧2
“Depth” refers to number of hidden layers
…𝑡𝑡
𝑝𝑝2 𝑝𝑝𝑝 𝑧𝑧𝑞𝑞
…
𝑥𝑥𝑝𝑝 hidden hidden
…
…
input layer
… layer 1 hidden layer 3 output
layer 2 layer
𝒔𝒔=tanh 𝑨𝑨′𝒙𝒙 𝒕𝒕=tanh 𝑩𝑩′𝒔𝒔 𝒖𝒖=tanh 𝑪𝑪′𝒕𝒕 𝒛𝒛=tanh 𝑫𝑫′𝒖𝒖
5
COMP90051 Statistical Machine Learning
Deep ANNs as representation learning
• Consecutive layers form representations of the input of increasing complexity
• An ANN can have a simple linear output layer, but using complex non-linear representation
′′′′ 𝒛𝒛=tanh 𝑫𝑫 tanh 𝑪𝑪 tanh 𝑩𝑩 tanh 𝑨𝑨𝒙𝒙
• Equivalently, a hidden layer can be thought of as the transformed feature space, e.g., 𝒖𝒖 = 𝜑𝜑 𝒙𝒙
• Parameters of such a transformation are learned from data Bias terms are omitted for simplicity
6
COMP90051 Statistical Machine Learning
ANN layers as data transformation
𝑠𝑠1 𝑡𝑡1 𝑢𝑢1 𝑧𝑧1
…𝑡𝑡
𝑠𝑠𝑝𝑝1 𝑡𝑡2 𝑢𝑢 𝑧𝑧2
… ……
𝑝𝑝2 𝑝𝑝𝑝 𝑧𝑧𝑞𝑞
𝑥𝑥1 𝑥𝑥2
𝑥𝑥 𝑝𝑝
…
input data the model
7
COMP90051 Statistical Machine Learning
ANN layers as data transformation
𝑡𝑡1 𝑢𝑢1 𝑧𝑧1 𝑡𝑡2 𝑢𝑢… 𝑧𝑧2
𝑡𝑡𝑝𝑝2 𝑝𝑝𝑝 𝑧𝑧 𝑞𝑞
……
𝑥𝑥1 𝑥𝑥2
𝑥𝑥 𝑝𝑝
𝑠𝑠1
𝑠𝑠… 𝑝𝑝1
…
pre-processed data the model
8
COMP90051 Statistical Machine Learning
ANN layers as data transformation
𝑢𝑢1
𝑧𝑧1
𝑢𝑢… 𝑝𝑝𝑝
𝑧𝑧2
𝑧𝑧 𝑞𝑞
…
𝑥𝑥1 𝑥𝑥2
𝑡𝑡… 𝑝𝑝2
𝑥𝑥 𝑝𝑝
𝑠𝑠1
𝑠𝑠… 𝑝𝑝1
𝑡𝑡1 𝑡𝑡2
…
pre-processed data the model
9
COMP90051 Statistical Machine Learning
ANN layers as data transformation
𝑧𝑧1
𝑧𝑧2
𝑧𝑧 𝑞𝑞
…
𝑥𝑥1 𝑥𝑥2
𝑝𝑝𝑝
𝑠𝑠1 𝑡𝑡1 𝑢𝑢1 …
…𝑡𝑡
𝑠𝑠𝑝𝑝1 𝑡𝑡2 𝑢𝑢
𝑥𝑥… 𝑝𝑝
…
𝑝𝑝2
pre-processed data the model
10
COMP90051 Statistical Machine Learning
Depth vs width
• A single infinitely wide layer in theory gives a universal approximator
• However (empirically) depth yields more accurate models Biological inspiration from the eye:
∗ first detect small edges and color patches;
∗ compose these into smaller shapes;
∗ building to more complex detectors, of e.g. textures, faces, etc.
• Seek to mimic layered complexity in a network
• However vanishing gradient problem affects learning with very deep models
11
COMP90051 Statistical Machine Learning
Animals in the zoo
Artificial Neural Networks (ANNs)
Recurrent neural networks
Feed-forward networks
Multilayer perceptrons
Perceptrons
Convolutional neural networks
• Recurrent neural networks are not covered in this subject
• An autoencoder is an ANN trained in a specific way.
∗ E.g.,amultilayerperceptroncanbetrainedasanautoencoder,ora
recurrent neural network can be trained as an autoencoder.
12
art: OpenClipartVectors at pixabay.com (CC0)
COMP90051 Statistical Machine Learning
Convolutional Neural Networks (CNN)
Based on repeated application of small filters to patches of a 2D image or range of a 1D input
13
COMP90051 Statistical Machine Learning
Convolution
𝒃𝒃 is output vector Σ𝑏𝑏2 …
…
…
…
…
…
×𝑤𝑤1 ×𝑤𝑤2 ×𝑤𝑤𝑝
𝑎𝑎1 𝑎𝑎2 𝑎𝑎𝑝 𝑎𝑎4 …𝒂𝒂 is input vector sliding window
…
…
…
14
COMP90051 Statistical Machine Learning
Convolution
𝒃𝒃 is output vector
𝑏𝑏2 𝑏𝑏𝑝 𝑏𝑏4 Σ𝑏𝑏5 …
…
…
𝐶𝐶
𝑏𝑏𝑖𝑖 = � 𝑎𝑎(𝑖𝑖+𝛿𝛿)𝑤𝑤(𝛿𝛿+𝐶𝐶+1)
𝑖𝑖≥2 𝛿𝛿=−𝐶𝐶 𝒃𝒃=𝒂𝒂∗𝒘𝒘
𝒘𝒘= 𝑤𝑤1,𝑤𝑤2,𝑤𝑤𝑝 𝑇𝑇
is called kernel* ×𝑤𝑤1 ×𝑤𝑤2 ×𝑤𝑤𝑝orfilter
𝑎𝑎1 𝑎𝑎2 𝑎𝑎𝑝 𝑎𝑎4 … … …
𝒂𝒂 is input vector
…
*Later in the subject, we will also use an unrelated definition of kernel as a function representing a dot product
15
COMP90051 Statistical Machine Learning
Convolution on 2D images
W…Σ
𝑩𝑩=𝑨𝑨∗𝑾𝑾
input
…
𝐵𝐵𝑖𝑖𝑗𝑗 = � � 𝐴𝐴𝑖𝑖+𝛿𝛿𝑖𝑖,𝑗𝑗+𝛿𝛿𝑗𝑗 𝑊𝑊𝛿𝛿𝑖𝑖+𝐶𝐶+1,𝛿𝛿𝑗𝑗+𝐷𝐷+1 𝛿𝛿𝑖𝑖 =−𝐶𝐶 𝛿𝛿𝑗𝑗=−𝐷𝐷
one output pixel
𝐶𝐶𝐷𝐷
output
16
COMP90051 Statistical Machine Learning
Filters as feature detectors
convolve with a vertical edge filter
activation function
-1
0
1
-1
0
1
-1
0
1
𝑨𝑨 is input image
filtered image
17
COMP90051 Statistical Machine Learning
Filters as feature detectors
convolve with a horizontal edge filter
activation function
1
1
1
0
0
0
-1
-1
-1
𝑨𝑨 is input image
filtered image
18
COMP90051 Statistical Machine Learning
Stacking convolutions
filters
-1 1 0 1 1 1 -1 0 0 0 1 0
-1-10 -11 -1
…
• Develop complex representations at different scales and complexity
• Filters are learned from training data!
…
downsampling and further convolutions
19
COMP90051 Statistical Machine Learning
CNN for computer vision
patches of 48 × 48 fully
…
48 × 48 × 5 1×1440 downsampling
downsampling
2D convolution connected
3Dconvolution
24 × 24 × 10
…
linear 1 × 720 regression
flattening
12 × 12 × 10
Implemented by Jizhizi Li
based on LeNet5: http://deeplearning.net/tutorial/lenet.html
20
24 × 24 × 5
COMP90051 Statistical Machine Learning
Components of a CNN
• Convolutionallayers
∗ Complex input representations based on convolution
operation
∗ Filter weights are learned from training data
• Downsampling,usuallyviaMaxPooling
∗ Re-scales to smaller resolution, limits parameter explosion
• Fullyconnectedpartsandoutputlayer ∗ Merges representations together
21
COMP90051 Statistical Machine Learning
Downsampling via max pooling
• Special type of processing layer. For an 𝑚𝑚 × 𝑚𝑚 patch 𝑣𝑣 = max 𝑢𝑢11,𝑢𝑢12,…,𝑢𝑢𝑚𝑚𝑚𝑚
• Strictly speaking, not everywhere differentiable. Instead, gradient is defined according to “sub-gradient”
∗ Tiny changes in values of 𝑢𝑢 that is not max do not change 𝑣𝑣 𝑖𝑖𝑗𝑗
∗ If 𝑢𝑢𝑖𝑖𝑗𝑗 is max value, tiny changes in that value change 𝑣𝑣 linearly ∗ Use 𝜕𝜕𝑣𝑣 =1if𝑢𝑢𝑖𝑖𝑗𝑗 =𝑣𝑣,and 𝜕𝜕𝑣𝑣 =0otherwise
𝜕𝜕𝑢𝑢𝑖𝑖𝑗𝑗 𝜕𝜕𝑢𝑢𝑖𝑖𝑗𝑗
• Forward pass records maximising element, which is then used in the backward pass during back-propagation
22
COMP90051 Statistical Machine Learning
Convolution as a regulariser
… …
Fully connected, unrestricted
Fully connected, unrestricted
Restriction: same color – same weight
23
COMP90051 Statistical Machine Learning
Document classification (Kalchbrenner et al, 2014)
Structure of text important for classifying documents
Capture patterns of nearby words using 1d convolutions
24
COMP90051 Statistical Machine Learning
Autoencoder
An ANN training setup that can be used for unsupervised learning, initialisation, or just efficient coding
25
COMP90051 Statistical Machine Learning
Autoencoding idea
• Supervisedlearning:
∗ Univariate regression: predict 𝑦𝑦 from 𝒙𝒙
∗ Multivariate regression: predict 𝒚𝒚 from 𝒙𝒙
• Unsupervisedlearning:exploredata𝒙𝒙1,…,𝒙𝒙𝑛𝑛 ∗ No response variable
• Foreach𝒙𝒙 set𝒚𝒚 ≡𝒙𝒙 𝑖𝑖𝑖𝑖𝑖𝑖
• Train an ANN to predict 𝒚𝒚𝑖𝑖 from 𝒙𝒙𝑖𝑖 • Pointless?
26
COMP90051 Statistical Machine Learning
Autoencoder topology
• Given data without labels 𝒙𝒙1, … , 𝒙𝒙𝑛𝑛, set 𝒚𝒚𝑖𝑖 ≡ 𝒙𝒙𝑖𝑖 and train an ANN to predict 𝒛𝒛 𝒙𝒙𝑖𝑖 ≈ 𝒙𝒙𝑖𝑖
• Setbottlenecklayer𝒖𝒖inmiddle“thinner”thaninput 𝒙𝒙𝒖𝒖𝒛𝒛
adapted from: Chervinskii at Wikimedia Commons (CC4)
27
COMP90051 Statistical Machine Learning
• •
Suppose you managed to train a network that gives a good restoration of the original signal 𝒛𝒛 𝒙𝒙𝑖𝑖 ≈ 𝒙𝒙𝑖𝑖
Introducing the bottleneck
This means that the data structure can be effectively described (encoded) by a lower dimensional representation 𝒖𝒖
adapted from: Chervinskii at Wikimedia Commons (CC4)
𝒙𝒙𝒖𝒖𝒛𝒛
28
COMP90051 Statistical Machine Learning
Dimensionality reduction
• Autoencoderscanbeusedforcompressionand dimensionality reduction via a non-linear transformation
• Ifyouuselinearactivationfunctionsandonlyone hidden layer, then the setup becomes almost that of Principal Component Analysis (stay tuned!)
∗ ANN might find a different solution, doesn’t use eigenvalues (directly)
29
COMP90051 Statistical Machine Learning
Tools
• Tensorflow, Theano, Torch
∗ python / lua toolkits for deep learning
∗ symbolic or automatic differentiation
∗ GPU support for fast compilation
∗ Theano tutorials at http://deeplearning.net/tutorial/
• Various others ∗ Caffe
∗ CNTK
∗ deeplearning4j …
• Keras: high-level Python API. Can run on top of TensorFlow, CNTK, or Theano
30
COMP90051 Statistical Machine Learning
This lecture
• Deep learning
∗ Representation capacity
∗ Deep models and representation learning
• Convolutional Neural Networks
∗ Convolution operator
∗ Elements of a convolution-based network
• Autoencoders
∗ Learning efficient coding
• Workshops Week #5: Neural net topics
• Next lectures: Kernel methods
31