Lecture 9: Neural Networks
COMP90049
Introduction to Machine Learning
Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron
• Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Neural Networks
• Multilayer Perceptron
• Motivation and architecture
• Linear vs. non-linear classifiers
2
Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron
• Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Neural Networks
• Multilayer Perceptron
• Motivation and architecture
• Linear vs. non-linear classifiers
2
Introduction
Classifier Recap
Perceptron
ŷ = f (θ · x) =
1 if θ · x ≥ 0−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Logistic Regression
P(y = 1|x ; θ) =
1
1 + exp(−(
∑F
f=0 θf xf ))
• View 1: Model of P(y = 1|x), maximizing the data log likelihood
• View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)
3
Classifier Recap
Perceptron
ŷ = f (θ · x) =
1 if θ · x ≥ 0−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Logistic Regression
P(y = 1|x ; θ) =
1
1 + exp(−(
∑F
f=0 θf xf ))
• View 1: Model of P(y = 1|x), maximizing the data log likelihood
• View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)
3
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions
• Connected into many layers→ Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network
• Feed-forward
• Fully connected
• Supervised learner
4
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions
• Connected into many layers→ Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network
• Feed-forward
• Fully connected
• Supervised learner
Other types of neural networks
• Convolutional neural networks
• Recurrent neural networks
• Autoencoder (unsupervised)
4
Perceptron Unit (recap)
A single processing unit
x
1
x
2
…
x
F
f(x;θ) y
input weigthts
activation
output
1
y = f(θTx)
θ
2
…
θ
F
θ
0
θ
1
A neural network is a combination of lots of these units.
5
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
y
1
x
1
x
2
…
x
F
f(x;θ)
Input layer Output layer
6
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
y
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1
x
1
x
2
…
x
F
f(x;θ)
Input layer Hidden layer 1 Output layer
6
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
y
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1
x
1
x
2
…
x
F
f(x;θ)
f(x;θ)
f(x;θ)
Input layer Hidden layer 1 Hidden layer 2 Output layer
6
Why the Hype? I
Linear classification
• The perceptron, naive bayes, logistic regression are linear classifiers
• Decision boundary is a linear combination of features
∑
i θixi
• Cannot learn ‘feature interactions’ naturally
• Perceptron can solve only linearly separable problems
Non-linear classification
• Neural networks with at least 1 hidden layer and non-linear activations
are non-linear classifiers
• Decision boundary is a non-linear function of the inputs
• Capture ‘feature interactions’
7
Why the Hype? II
Feature Engineering
• (more next week!)
• The perceptron, naive Bayes and logistic regression require a fixed set
of informative features
• e.g., outlook ∈ {overcast , sunny , rainy}, wind ∈ {high, low} etc
• Requiring domain knowledge
Feature learning
• Neural networks take as input ‘raw’ data
• They learn features themselves as intermediate representations
• They learn features as part of their target task (e.g., classification)
• ‘Representation learning’: learning representations (or features) of the
data that are useful for the target task
• Note: often feature engineering is replaced at the cost of additional
paramter tuning (layers, activations, learning rates, …)
8
Multilayer Perceptron: Motivation I
Example Classification dataset
Outlook Temperature Humidity Windy True Label
sunny hot high FALSE no
sunny hot high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
. . .
We really observe raw data
Date measurements True Label
01/03/1966 0.4 4.7 1.5 12.7 … no
01/04/1966 3.4 -0.7 3.8 18.7 … no
01/05/1966 0.3 8.7 136.9 17 … yes
01/06/1966 5.5 5.7 65.5 2.7 … yes
9
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
m1
m2
m3
m4
…
Play?
mF
10
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
Play?
1
x
1
x
2
x
3
x
4
…
x
F
f(x;θ)
m1
m2
m3
m4
…
mF
Input layer, 1 unit, output layer
10
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
Play?
Outlook
Temperature
Humidity
Windy
m1
m2
m3
m4
…
mF
10
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
Play?
???
???
???
???
m1
m2
m3
m4
…
mF
10
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
Play?
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1
x
1
x
2
x
3
x
4
…
x
F
f(x;θ)
m1
m2
m3
m4
…
mF
Input layer, 1 hidden layer, output layer
10
Multilayer Perceptron: Motivation II
Example Problem: Weather Dataset
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ) Play?
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1
x
1
x
2
x
3
x
4
…
x
F
m1
m2
m3
m4
…
mF
Input layer, 2 hidden layer, output layer
10
Another Example: Face Recognition
• the hidden layers learn increasingly high-level feature representations
• e.g., given an image, predict the person:
Source: https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/
11
Multilayer Perceptron
Terminology
• input units xj , one per feature j
• Multiple layers l = 1…L of nodes. L is the depth of the network.
• Each layer l has a number of units Kl . Kl is the width of layer l .
• The width can vary from layer to layer
• output unit y
• Each layer l is fully connected to its neighboring layers l − 1 and l + 1
• one weight θ(l)ij for each connection ij (including ‘bias’ θ0)
• non-linear activation function for layer l as φ(l)
12
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
h(1)i = φ
(1)
(∑
j
θ
(1)
ij xj
)
h(2)i = φ
(2)
(∑
j
θ
(2)
ij h
(1)
j
)
yi = φ
(3)
(∑
j
θ
(3)
ij h
(2)
j
)
13
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
h(1)i = φ
(1)
(∑
j
θ
(1)
ij xj
)
h(2)i = φ
(2)
(∑
j
θ
(2)
ij h
(1)
j
)
yi = φ
(3)
(∑
j
θ
(3)
ij h
(2)
j
)
Or in vectorized form
h(1) = φ(1)
(
θ
(1)T x
)
h(2) = φ(2)
(
θ
(2)T h(1)
)
y = φ(3)
(
θ
(3)T h(2)
)
where the activation functions φ(l) are applied element-wise to all entries
13
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y
1 1 1
1 0 0
0 1 0
0 0 0
14
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 1
14
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0
14
A Multilayer Perceptron for XOR
φ(x) =
1 if x >= 00 if x < 0 and recall: h(l)i = φ(l)
(∑
j θ
(l)
ij h
(l−1)
j + b
(l)
j
)
Source: https:
//www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
15
https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
Designing Neural Networks I: Inputs
Inputs and feature functions
• x could be a patient with features {blood pressure, height, age, weight,
...}
• x could be a texts, i.e., a sequence of words
• x could be an image, i.e., a matrix of pixels
Non-numerical features need to be mapped to numerical
• For language, typical to map words to pre-trained embedding vectors
• for 1-hot: dim(x) = V (words in the vocabulary)
• for embedding: dim(x) = k , dimensionality of embedding vectors
• Alternative: 1-hot encoding
• For pixels, map to RGB, or other visual features
16
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmode, RelU, ...)
• Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
1. logistic (aka sigmoid) (“σ”):
f (x) =
1
1 + e−x
2. hyperbolic tan (“tanh”):
f (x) =
e2x − 1
e2x + 1
3. rectified linear unit (“ReLU”):
f (x) = max(0, x)
note not differentiable at x = 0
17
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmode, RelU, ...)
• Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
1. logistic (aka sigmoid) (“σ”):
f (x) =
1
1 + e−x
2. hyperbolic tan (“tanh”):
f (x) =
e2x − 1
e2x + 1
3. rectified linear unit (“ReLU”):
f (x) = max(0, x)
note not differentiable at x = 0
17
Designing Neural Networks II: Activation Functions
function values:
derivatives:
18
Designing Neural Networks III: Structure
Network Structure
• Sequence of hidden layers l1, . . . , lL for a netword of depth L
• Each layer l has Kl parallel neurons (breadth)
• Many layers (depth) vs. many neurons per layer (breadth)? Empirical
question, theoretically poorly understood.
Advanced tricks include allowing for exploiting data structure
• convolutions (convolutional neural networks; CNN), Computer Vision
• recurrencies (recurrent neural networks; RNN), Natural Language
Processing
• attention (efficient alternative to recurrencies)
• . . .
Beyond the scope of this class.
19
Designing Neural Networks IV: Output Function
Neural networks can learn different concepts: classification, regression, ...
The output function depends on the concept of intereest.
• Binary classification:
• one neuron, with step function (as in the perceptron)
• Multiclass classification:
• typically softmax to normalize K outputs from the pre-final layer into
a probability distribution over classes
p(yi = j|xi ; θ) =
exp(zj )∑K
k=1 exp(zk )
• Regression:
• identity function
• possibly other continuous functions such as sigmoid or tanh
20
Designing Neural Networks V: Loss Functions
Classification Loss: typically negative conditional log-likelihood (
cross-entropy)
Li = − log p(y (i)|x (i); θ) for a single instance i
L = −
∑
i
log p(y (i)|x (i); θ) for all instances
• Binary classification loss
ŷ (i)1 = p(y
(i)
= 1|x (i); θ)
L =
∑
i
−[y (i) log(ŷ (i)1 ) + (1− y
(i)
) log(1− ŷ (i)1 )]
• Multiclass classification
ŷ (i)j = p(y
(i)
= j |x (i); θ)
L = −
∑
i
∑
j
y (i)j log(ŷ
(i)
j )
for j possible labels; y (i)j = 1 if j is the true label for instance i , else 0.
21
Designing Neural Networks V: Loss Functions
Regression Loss: typically mean-squared error (MSE)
• Here, the output, as well as the target are real-valued numbers
L =
1
N
N∑
i=1
(y i − ŷ (i))2
21
Representational Power of Neural Nets
• The universal approximation theorem states that a feed-forward
neural network with a single hidden layer (and finite neurons) is able to
approximate any continuous function on Rn
• Note that the activation functions must be non-linear, as without this,
the model is simply a (complex) linear model
22
How to Train a NN with Hidden Layers
• Unfortunately, the perceptron algorithm can’t be used to train neural nets
with hidden layers, as we can’t directly observe the labels
• Instead, train neural nets with back propagation. Intuitively:
• compute errors at the output layer wrt each weight using partial
differentiation
• propagate those errors back to each of the input layers
• Essentially just gradient descent, but using the chain rule to make the
calculations more efficient
Next lecture: Backpropagation for training neural networks
23
Reflections
When is Linear Classification Enough?
• If we know our classes are linearly (approximately) separable
• If the feature space is (very) high-dimensional
...i.e., the number of features exceeds the number of training instances
• If the traning set is small
• If interpretability is important, i.e., understanding how (combinations of)
features explain different predictions
24
Pros and Cons of Neural Networks
Pros
• Powerful tool!
• Neural networks with at least 1 hidden layer can approximate any
(continuous) function. They are universal approximators
• Automatic feature learning
• Empirically, very good performance for many diverse tasks
Cons
• Powerful model increases the danger of ‘overfitting’
• Requires large training data sets
• Often requires powerful compute resources (GPUs)
• Lack of interpretability
25
Summary
Today
• From perceptrons to neural networks
• multilayer perceptron
• some examples
• features and limitations
Next Lecture
• Learning parameters of neural networks
• The Backpropagation algorithm
26
References
Jacob Eisenstein (2019). Natural Language Processing. MIT Press.
Chapters 3 (intro), 3.1, 3.2. https://github.com/jacobeisenstein/
gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
Dan Jurafsky and James H. Martin. Speech and Language Processing.
Chapter 7.2, 7.3. Online Draft V3.0.
https://web.stanford.edu/~jurafsky/slp3/
27
https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
https://web.stanford.edu/~jurafsky/slp3/
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Good?
Contains ‘plot’
…
Contains ‘ending’
28
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Good?
1
Contains ‘plot’
x
1
x
2
x
3
x
4
x
5
x
6
……
x
FContains ‘ending’
f(x;θ;b)
Input layer, 1 unit, output layer
28
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Good?
Contains ‘plot’
…
Contains ‘ending’
word sentiment
reasonably long
reviews actors
summarizes plot
28
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Good?
Contains ‘plot’
…
Contains ‘ending’
???
???
???
???
28
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Good?
Contains ‘plot’
…
Contains ‘ending’
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
1
x
1
x
2
x
3
x
4
x
5
x
6
…
x
F
f(x;θ;b)
Input layer, 1 hidden layer, output layer
28
Multilayer Perceptron: Motivation II
Another Example Problem: Sentiment analysis of movie reviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
…
Contains ‘ending’
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
1
x
1
x
2
x
3
x
4
x
5
x
6
…
x
F
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
f(x;θ;b) Good?
Input layer, 2 hidden layer, output layer
28
Introduction
Reflections