Lecture 9: Neural Networks
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
2
Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Neural Networks
• Multilayer Perceptron
• Motivation and architecture
• Linear vs. non-linear classifiers
2
Introduction
Classifier Recap
Perceptron
yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
3
Classifier Recap
Perceptron
yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Logistic Regression
1
P(y = 1|x;θ) = 1+exp(−(Ff=0 θfxf))
• View 1: Model of P(y = 1|x), maximizing the data log likelihood • View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)
3
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
4
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
Other types of neural networks
• Convolutional neural networks • Recurrent neural networks
• Autoencoder (unsupervised)
4
Perceptron Unit (recap)
A single processing unit
1 x1 x2 …
θF xF
θ0 θ1
θ2
f(x;θ) y
output
…
y = f(θTx)
activation
input weigthts
A neural network is a combination of lots of these units.
5
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1
x1 x2
… xF
f(x;θ) y
Input layer
Output layer
6
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1
x1 x2
… xF
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
y
Input layer
Hidden layer 1
Output layer
6
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1 x1
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
x … xF
f(x;θ)
f(x;θ)
2
y
f(x;θ)
Input layer
Hidden layer 1
Hidden layer 2
Output layer
6
Why the Hype? I
Linear classification
• The perceptron, naive bayes, logistic regression are linear classifiers • Decision boundary is a linear combination of features i θi xi
• Cannot learn ‘feature interactions’ naturally
• Perceptron can solve only linearly separable problems
Non-linear classification
• Neural networks with at least 1 hidden layer and non-linear activations are non-linear classifiers
• Decision boundary is a non-linear function of the inputs
• Capture ‘feature interactions’
7
Why the Hype? II
Feature Engineering
• (more next week!)
• The perceptron, naive Bayes and logistic regression require a fixed set
of informative features
• e.g.,outlook∈{overcast,sunny,rainy},wind∈{high,low}etc
• Requiring domain knowledge
Feature learning
• Neural networks take as input ‘raw’ data
• They learn features themselves as intermediate representations
• They learn features as part of their target task (e.g., classification)
• ‘Representation learning’: learning representations (or features) of the
data that are useful for the target task
• Note: often feature engineering is replaced at the cost of additional paramter tuning (layers, activations, learning rates, …)
8
Multilayer Perceptron: Motivation I
Example Classification dataset
Outlook Temperature
sunny hot
sunny hot overcast hot rainy mild
…
We really observe raw data
Date
01/03/1966 01/04/1966 01/05/1966 01/06/1966
Humidity
high high high high
Windy
FALSE TRUE FALSE FALSE
True Label
no
no yes yes
measurements
True Label
no
no yes yes
0.4 4.7 3.4 -0.7 0.3 8.7 5.5 5.7
1.5 12.7 …
3.8 18.7 … 136.9 17 … 65.5 2.7 …
9
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
m2
m3
m4
Play?
…
mF
10
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
x1 x2
x3 x4 … xF
Input layer,
f(x;θ)
Play?
m1
m2
m3
m4
…
mF
1 unit,
output layer
10
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
m2
Outlook
m3
Temperature
m4
Play?
Humidity
…
mF
Windy
10
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
???
m2
m3
???
m4
Play?
???
…
mF
???
10
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
m1
x1 x2
x3 x4 … xF
Input layer,
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1 hidden layer,
f(x;θ)
Play?
m2
m3
m4
…
mF
output layer
10
Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
m1
x1 x2 x3 x … xF
Input layer,
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
Play?
m2
m3
m4
4
f(x;θ)
…
mF
2 hidden layer,
output layer
10
Another Example: Face Recognition
• the hidden layers learn increasingly high-level feature representations • e.g., given an image, predict the person:
Source: https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/
11
Multilayer Perceptron
Terminology
• input units xj , one per feature j
• Multiple layers l = 1…L of nodes. L is the depth of the network.
• Each layer l has a number of units Kl . Kl is the width of layer l.
• The width can vary from layer to layer
• output unit y
• Each layer l is fully connected to its neighboring layers l − 1 and l + 1 • one weight θ(l) for each connection ij (including ‘bias’ θ0)
ij
• non-linear activation function for layer l as φ(l)
12
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
h(1) =φ(1)θ(1)x i ijj
j
h(2) =φ(2)θ(2)h(1) i ijj
j
yi =φ(3)θ(3)h(2) ij j
j
13
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
Or in vectorized form
h(1) =φ(1)θ(1)x i ijj
j
h(2) =φ(2)θ(2)h(1) i ijj
j
yi =φ(3)θ(3)h(2) ij j
j
h(1) = φ(1)θ(1)T x h(2) = φ(2)θ(2)T h(1) y = φ(3)θ(3)T h(2)
where the activation functions φ(l) are applied element-wise to all entries
13
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 111 100 010 000
14
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 110 101 011 001
14
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 110 101 011 000
14
A Multilayer Perceptron for XOR
φ(x) = 1 if x >= 0 and recall: h(l) = φ(l) θ(l)h(l−1) + b(l)
0 ifx<0 i j ij j j
Source: https: //www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
15
Designing Neural Networks I: Inputs
Inputs and feature functions
• x could be a patient with features {blood pressure, height, age, weight, ...}
• x could be a texts, i.e., a sequence of words
• x could be an image, i.e., a matrix of pixels
Non-numerical features need to be mapped to numerical
• For language, typical to map words to pre-trained embedding vectors • for 1-hot: dim(x) = V (words in the vocabulary)
• for embedding: dim(x) = k, dimensionality of embedding vectors
• Alternative: 1-hot encoding
• For pixels, map to RGB, or other visual features
16
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmode, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
17
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmode, RelU, ...)
• Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
1. logistic (aka sigmoid) (“σ”):
f(x) = 1
1+e−x
2. hyperbolic tan (“tanh”):
f(x)= e2x −1
e2x +1
3. rectified linear unit (“ReLU”):
f(x) = max(0,x)
note not differentiable at x = 0
17
Designing Neural Networks II: Activation Functions
function values:
derivatives:
18
Designing Neural Networks III: Structure
Network Structure
• Sequence of hidden layers l1, . . . , lL for a netword of depth L
• Each layer l has Kl parallel neurons (breadth)
• Many layers (depth) vs. many neurons per layer (breadth)? Empirical question, theoretically poorly understood.
Advanced tricks include allowing for exploiting data structure
• convolutions (convolutional neural networks; CNN), Computer Vision
• recurrencies (recurrent neural networks; RNN), Natural Language Processing
• attention (efficient alternative to recurrencies)
• ...
Beyond the scope of this class.
19
Designing Neural Networks IV: Output Function
Neural networks can learn different concepts: classification, regression, ... The output function depends on the concept of intereest.
• Binary classification:
• one neuron, with step function (as in the perceptron)
• Multiclass classification:
• typically softmax to normalize K outputs from the pre-final layer into
a probability distribution over classes
exp(zj ) p(yi = j|xi;θ) = Kk=1 exp(zk)
• Regression:
• identity function
• possibly other continuous functions such as sigmoid or tanh
20
Designing Neural Networks V: Loss Functions
Classification Loss: typically negative conditional log-likelihood ( cross-entropy)
Li =−logp(y(i)|x(i);θ)
L = − log p(y(i)|x(i); θ)
i
• Binary classification loss
yˆ(i) = p(y(i) = 1|x(i); θ)
1
forasingleinstancei for all instances
L = −[y(i) log(yˆ(i)) + (1 − y(i)) log(1 − yˆ(i))] 11
i
• Multiclass classification
yˆ(i) = p(y(i) = j|x(i); θ)
j
for j possible labels; y(i) = 1 if j is the true label for instance i, else 0. j
L = −y(i) log(yˆ(i)) jj
ij
21
Designing Neural Networks V: Loss Functions
Regression Loss: typically mean-squared error (MSE)
• Here, the output, as well as the target are real-valued numbers
1 N
L = N
i=1
( y i − yˆ ( i ) ) 2
21
Representational Power of Neural Nets
• The universal approximation theorem states that a feed-forward neural network with a single hidden layer (and finite neurons) is able to approximate any continuous function on Rn
• Note that the activation functions must be non-linear, as without this, the model is simply a (complex) linear model
22
How to Train a NN with Hidden Layers
• Unfortunately, the perceptron algorithm can’t be used to train neural nets with hidden layers, as we can’t directly observe the labels
• Instead, train neural nets with back propagation. Intuitively:
• compute errors at the output layer wrt each weight using partial
differentiation
• propagate those errors back to each of the input layers
• Essentially just gradient descent, but using the chain rule to make the calculations more efficient
Next lecture: Backpropagation for training neural networks
23
Reflections
When is Linear Classification Enough?
• If we know our classes are linearly (approximately) separable
• If the feature space is (very) high-dimensional
...i.e., the number of features exceeds the number of training instances
• If the traning set is small
• If interpretability is important, i.e., understanding how (combinations of) features explain different predictions
24
Pros and Cons of Neural Networks
Pros
• Powerful tool!
• Neural networks with at least 1 hidden layer can approximate any
(continuous) function. They are universal approximators
• Automatic feature learning
• Empirically, very good performance for many diverse tasks
Cons
• Powerful model increases the danger of ‘overfitting’ • Requires large training data sets
• Often requires powerful compute resources (GPUs) • Lack of interpretability
25
Summary
Today
• From perceptrons to neural networks • multilayer perceptron
• some examples
• features and limitations
Next Lecture
• Learning parameters of neural networks • The Backpropagation algorithm
26
References
Jacob Eisenstein (2019). Natural Language Processing. MIT Press. Chapters 3 (intro), 3.1, 3.2. https://github.com/jacobeisenstein/ gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
Dan Jurafsky and James H. Martin. Speech and Language Processing. Chapter 7.2, 7.3. Online Draft V3.0. https://web.stanford.edu/~jurafsky/slp3/
27
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Good?
Contains ‘story’
Contains ‘plot’
…
Contains ‘ending’
28
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
x3 x4 x5 x6 …
Contains ‘ending’ xF
Input layer, 1 unit, output layer
f(x;θ;b)
Good?
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
…
28
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
Length > 3
reasonably long
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
Contains ‘ending’
word sentiment
reviews actors
Good?
summarizes plot
…
28
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
???
Length > 3
???
Contains ‘actor’
Good?
Contains ‘story’
Contains ‘plot’
???
???
…
Contains ‘ending’
28
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
Contains ‘good’
f(x;θ;b)
Contains ‘awful’
Length > 3
x3 f(x;θ;b) x4
Good?
Contains ‘actor’
f(x;θ;b)
x5 x6 …
f(x;θ;b)
f(x;θ;b)
Contains ‘story’
Contains ‘plot’
…
Contains ‘ending’ xF
Input layer, 1 hidden layer, output layer
28
Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
Contains ‘good’
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
Contains ‘ending’ xF
Input layer, 2 hidden layer, output layer
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
Contains ‘awful’
Length > 3
x3 x4 x5 x
f(x;θ;b)
Good?
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
6 …
…
28