CS计算机代考程序代写 data structure chain deep learning algorithm Lecture 9: Neural Networks

Lecture 9: Neural Networks

COMP90049
Introduction to Machine Learning
Semester 1, 2020

Lea Frermann, CIS

1

Roadmap

So far … Classification and Evaluation

• Naive Bayes, Logistic Regression, Perceptron

• Probabilistic models

• Loss functions, and estimation

• Evaluation

Today… Neural Networks

• Multilayer Perceptron

• Motivation and architecture

• Linear vs. non-linear classifiers

2

Roadmap

So far … Classification and Evaluation

• Naive Bayes, Logistic Regression, Perceptron

• Probabilistic models

• Loss functions, and estimation

• Evaluation

Today… Neural Networks

• Multilayer Perceptron

• Motivation and architecture

• Linear vs. non-linear classifiers

2

Introduction

Classifier Recap

Perceptron

ŷ = f (θ · x) =


1 if θ · x ≥ 0−1 otherwise

• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)

Logistic Regression

P(y = 1|x ; θ) =
1

1 + exp(−(
∑F

f=0 θf xf ))

• View 1: Model of P(y = 1|x), maximizing the data log likelihood
• View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)

3

Classifier Recap

Perceptron

ŷ = f (θ · x) =


1 if θ · x ≥ 0−1 otherwise

• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)

Logistic Regression

P(y = 1|x ; θ) =
1

1 + exp(−(
∑F

f=0 θf xf ))

• View 1: Model of P(y = 1|x), maximizing the data log likelihood
• View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)

3

Neural Networks and Deep Learning

Neural Networks

• Connected sets of many such units

• Units must have continuous activation functions

• Connected into many layers→ Deep Learning

Multi-layer Perceptron

• This lecture!

• One specific type of neural network

• Feed-forward

• Fully connected

• Supervised learner

4

Neural Networks and Deep Learning

Neural Networks

• Connected sets of many such units
• Units must have continuous activation functions
• Connected into many layers→ Deep Learning

Multi-layer Perceptron

• This lecture!
• One specific type of neural network
• Feed-forward
• Fully connected
• Supervised learner

Other types of neural networks

• Convolutional neural networks
• Recurrent neural networks
• Autoencoder (unsupervised)

4

Perceptron Unit (recap)

A single processing unit

x
1

x
2

x
F

f(x;θ) y

input weigthts

activation

output

1

y = f(θTx)
θ

2

θ
F

θ
0

θ
1

A neural network is a combination of lots of these units.

5

Multi-layer Perceptron (schematic)

Three Types of layers

• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible

output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.

y

1

x
1

x
2

x
F

f(x;θ)

Input layer Output layer

6

Multi-layer Perceptron (schematic)

Three Types of layers

• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible

output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.

y

f(x;θ)

f(x;θ)

f(x;θ)

f(x;θ)

1

x
1

x
2

x
F

f(x;θ)

Input layer Hidden layer 1 Output layer

6

Multi-layer Perceptron (schematic)

Three Types of layers

• Input layer with input units x : the first layer, takes features x as inputs
• Output layer with output units y : the last layer, has one unit per possible

output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.

y

f(x;θ)

f(x;θ)

f(x;θ)

f(x;θ)

1

x
1

x
2

x
F

f(x;θ)

f(x;θ)

f(x;θ)

Input layer Hidden layer 1 Hidden layer 2 Output layer

6

Why the Hype? I

Linear classification

• The perceptron, naive bayes, logistic regression are linear classifiers

• Decision boundary is a linear combination of features

i θixi

• Cannot learn ‘feature interactions’ naturally

• Perceptron can solve only linearly separable problems

Non-linear classification

• Neural networks with at least 1 hidden layer and non-linear activations
are non-linear classifiers

• Decision boundary is a non-linear function of the inputs

• Capture ‘feature interactions’

7

Why the Hype? II

Feature Engineering

• (more next week!)
• The perceptron, naive Bayes and logistic regression require a fixed set

of informative features

• e.g., outlook ∈ {overcast , sunny , rainy}, wind ∈ {high, low} etc
• Requiring domain knowledge

Feature learning

• Neural networks take as input ‘raw’ data
• They learn features themselves as intermediate representations
• They learn features as part of their target task (e.g., classification)
• ‘Representation learning’: learning representations (or features) of the

data that are useful for the target task

• Note: often feature engineering is replaced at the cost of additional
paramter tuning (layers, activations, learning rates, …)

8

Multilayer Perceptron: Motivation I

Example Classification dataset

Outlook Temperature Humidity Windy True Label

sunny hot high FALSE no
sunny hot high TRUE no

overcast hot high FALSE yes
rainy mild high FALSE yes
. . .

We really observe raw data

Date measurements True Label

01/03/1966 0.4 4.7 1.5 12.7 … no
01/04/1966 3.4 -0.7 3.8 18.7 … no
01/05/1966 0.3 8.7 136.9 17 … yes
01/06/1966 5.5 5.7 65.5 2.7 … yes

9

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

m1

m2

m3

m4


Play?

mF

10

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

Play?

1

x
1

x
2

x
3

x
4

x
F

f(x;θ)

m1

m2

m3

m4

mF

Input layer, 1 unit, output layer

10

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

Play?

Outlook

Temperature

Humidity

Windy

m1

m2

m3

m4

mF

10

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

Play?

???

???

???

???

m1

m2

m3

m4

mF

10

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

Play?

f(x;θ)

f(x;θ)

f(x;θ)

f(x;θ)

1

x
1

x
2

x
3

x
4

x
F

f(x;θ)

m1

m2

m3

m4

mF

Input layer, 1 hidden layer, output layer

10

Multilayer Perceptron: Motivation II

Example Problem: Weather Dataset

f(x;θ)

f(x;θ)

f(x;θ)

f(x;θ) Play?

f(x;θ)

f(x;θ)

f(x;θ)

f(x;θ)

1

x
1

x
2

x
3

x
4

x
F

m1

m2

m3

m4

mF

Input layer, 2 hidden layer, output layer

10

Another Example: Face Recognition

• the hidden layers learn increasingly high-level feature representations

• e.g., given an image, predict the person:

Source: https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/

11

Multilayer Perceptron

Terminology

• input units xj , one per feature j

• Multiple layers l = 1…L of nodes. L is the depth of the network.

• Each layer l has a number of units Kl . Kl is the width of layer l .

• The width can vary from layer to layer

• output unit y

• Each layer l is fully connected to its neighboring layers l − 1 and l + 1

• one weight θ(l)ij for each connection ij (including ‘bias’ θ0)

• non-linear activation function for layer l as φ(l)

12

Prediction with a feedforward Network

Passing an input through a neural network with 2 hidden layers

h(1)i = φ
(1)
(∑

j

θ
(1)
ij xj

)
h(2)i = φ

(2)
(∑

j

θ
(2)
ij h

(1)
j

)
yi = φ

(3)
(∑

j

θ
(3)
ij h

(2)
j

)

13

Prediction with a feedforward Network

Passing an input through a neural network with 2 hidden layers

h(1)i = φ
(1)
(∑

j

θ
(1)
ij xj

)
h(2)i = φ

(2)
(∑

j

θ
(2)
ij h

(1)
j

)
yi = φ

(3)
(∑

j

θ
(3)
ij h

(2)
j

)
Or in vectorized form

h(1) = φ(1)
(
θ
(1)T x

)
h(2) = φ(2)

(
θ
(2)T h(1)

)
y = φ(3)

(
θ
(3)T h(2)

)
where the activation functions φ(l) are applied element-wise to all entries

13

Boolean Functions

1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?

x1 x2 y
1 1 1
1 0 0
0 1 0
0 0 0

14

Boolean Functions

1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?

x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 1

14

Boolean Functions

1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?

x1 x2 y
1 1 0
1 0 1
0 1 1
0 0 0

14

A Multilayer Perceptron for XOR

φ(x) =


1 if x >= 00 if x < 0 and recall: h(l)i = φ(l) (∑ j θ (l) ij h (l−1) j + b (l) j ) Source: https: //www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf 15 https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf Designing Neural Networks I: Inputs Inputs and feature functions • x could be a patient with features {blood pressure, height, age, weight, ...} • x could be a texts, i.e., a sequence of words • x could be an image, i.e., a matrix of pixels Non-numerical features need to be mapped to numerical • For language, typical to map words to pre-trained embedding vectors • for 1-hot: dim(x) = V (words in the vocabulary) • for embedding: dim(x) = k , dimensionality of embedding vectors • Alternative: 1-hot encoding • For pixels, map to RGB, or other visual features 16 Designing Neural Networks II: Activation Functions • Each layer has an associated activation function (e.g., sigmode, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input • Each hidden layer performs a non-linear transformation of the input • Popular choices include 1. logistic (aka sigmoid) (“σ”): f (x) = 1 1 + e−x 2. hyperbolic tan (“tanh”): f (x) = e2x − 1 e2x + 1 3. rectified linear unit (“ReLU”): f (x) = max(0, x) note not differentiable at x = 0 17 Designing Neural Networks II: Activation Functions • Each layer has an associated activation function (e.g., sigmode, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input • Each hidden layer performs a non-linear transformation of the input • Popular choices include 1. logistic (aka sigmoid) (“σ”): f (x) = 1 1 + e−x 2. hyperbolic tan (“tanh”): f (x) = e2x − 1 e2x + 1 3. rectified linear unit (“ReLU”): f (x) = max(0, x) note not differentiable at x = 0 17 Designing Neural Networks II: Activation Functions function values: derivatives: 18 Designing Neural Networks III: Structure Network Structure • Sequence of hidden layers l1, . . . , lL for a netword of depth L • Each layer l has Kl parallel neurons (breadth) • Many layers (depth) vs. many neurons per layer (breadth)? Empirical question, theoretically poorly understood. Advanced tricks include allowing for exploiting data structure • convolutions (convolutional neural networks; CNN), Computer Vision • recurrencies (recurrent neural networks; RNN), Natural Language Processing • attention (efficient alternative to recurrencies) • . . . Beyond the scope of this class. 19 Designing Neural Networks IV: Output Function Neural networks can learn different concepts: classification, regression, ... The output function depends on the concept of intereest. • Binary classification: • one neuron, with step function (as in the perceptron) • Multiclass classification: • typically softmax to normalize K outputs from the pre-final layer into a probability distribution over classes p(yi = j|xi ; θ) = exp(zj )∑K k=1 exp(zk ) • Regression: • identity function • possibly other continuous functions such as sigmoid or tanh 20 Designing Neural Networks V: Loss Functions Classification Loss: typically negative conditional log-likelihood ( cross-entropy) Li = − log p(y (i)|x (i); θ) for a single instance i L = − ∑ i log p(y (i)|x (i); θ) for all instances • Binary classification loss ŷ (i)1 = p(y (i) = 1|x (i); θ) L = ∑ i −[y (i) log(ŷ (i)1 ) + (1− y (i) ) log(1− ŷ (i)1 )] • Multiclass classification ŷ (i)j = p(y (i) = j |x (i); θ) L = − ∑ i ∑ j y (i)j log(ŷ (i) j ) for j possible labels; y (i)j = 1 if j is the true label for instance i , else 0. 21 Designing Neural Networks V: Loss Functions Regression Loss: typically mean-squared error (MSE) • Here, the output, as well as the target are real-valued numbers L = 1 N N∑ i=1 (y i − ŷ (i))2 21 Representational Power of Neural Nets • The universal approximation theorem states that a feed-forward neural network with a single hidden layer (and finite neurons) is able to approximate any continuous function on Rn • Note that the activation functions must be non-linear, as without this, the model is simply a (complex) linear model 22 How to Train a NN with Hidden Layers • Unfortunately, the perceptron algorithm can’t be used to train neural nets with hidden layers, as we can’t directly observe the labels • Instead, train neural nets with back propagation. Intuitively: • compute errors at the output layer wrt each weight using partial differentiation • propagate those errors back to each of the input layers • Essentially just gradient descent, but using the chain rule to make the calculations more efficient Next lecture: Backpropagation for training neural networks 23 Reflections When is Linear Classification Enough? • If we know our classes are linearly (approximately) separable • If the feature space is (very) high-dimensional ...i.e., the number of features exceeds the number of training instances • If the traning set is small • If interpretability is important, i.e., understanding how (combinations of) features explain different predictions 24 Pros and Cons of Neural Networks Pros • Powerful tool! • Neural networks with at least 1 hidden layer can approximate any (continuous) function. They are universal approximators • Automatic feature learning • Empirically, very good performance for many diverse tasks Cons • Powerful model increases the danger of ‘overfitting’ • Requires large training data sets • Often requires powerful compute resources (GPUs) • Lack of interpretability 25 Summary Today • From perceptrons to neural networks • multilayer perceptron • some examples • features and limitations Next Lecture • Learning parameters of neural networks • The Backpropagation algorithm 26 References Jacob Eisenstein (2019). Natural Language Processing. MIT Press. Chapters 3 (intro), 3.1, 3.2. https://github.com/jacobeisenstein/ gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf Dan Jurafsky and James H. Martin. Speech and Language Processing. Chapter 7.2, 7.3. Online Draft V3.0. https://web.stanford.edu/~jurafsky/slp3/ 27 https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf https://web.stanford.edu/~jurafsky/slp3/ Multilayer Perceptron: Motivation II Another Example Problem: Sentiment analysis of movie reviews Contains ‘good’ Contains ‘awful’ Length > 3

Contains ‘actor’

Contains ‘story’
Good?

Contains ‘plot’

Contains ‘ending’

28

Multilayer Perceptron: Motivation II

Another Example Problem: Sentiment analysis of movie reviews

Contains ‘good’

Contains ‘awful’

Length > 3

Contains ‘actor’

Contains ‘story’
Good?

1

Contains ‘plot’

x
1

x
2

x
3

x
4

x
5

x
6

……

x
FContains ‘ending’

f(x;θ;b)

Input layer, 1 unit, output layer

28

Multilayer Perceptron: Motivation II

Another Example Problem: Sentiment analysis of movie reviews

Contains ‘good’

Contains ‘awful’

Length > 3

Contains ‘actor’

Contains ‘story’
Good?

Contains ‘plot’

Contains ‘ending’

word sentiment

reasonably long

reviews actors

summarizes plot

28

Multilayer Perceptron: Motivation II

Another Example Problem: Sentiment analysis of movie reviews

Contains ‘good’

Contains ‘awful’

Length > 3

Contains ‘actor’

Contains ‘story’
Good?

Contains ‘plot’

Contains ‘ending’

???

???

???

???

28

Multilayer Perceptron: Motivation II

Another Example Problem: Sentiment analysis of movie reviews

Contains ‘good’

Contains ‘awful’

Length > 3

Contains ‘actor’

Contains ‘story’
Good?

Contains ‘plot’

Contains ‘ending’

f(x;θ;b)

f(x;θ;b)

f(x;θ;b)

f(x;θ;b)

1

x
1

x
2

x
3

x
4

x
5

x
6

x
F

f(x;θ;b)

Input layer, 1 hidden layer, output layer

28

Multilayer Perceptron: Motivation II

Another Example Problem: Sentiment analysis of movie reviews

Contains ‘good’

Contains ‘awful’

Length > 3

Contains ‘actor’

Contains ‘story’

Contains ‘plot’

Contains ‘ending’

f(x;θ;b)

f(x;θ;b)

f(x;θ;b)

f(x;θ;b)

1

x
1

x
2

x
3

x
4

x
5

x
6

x
F

f(x;θ;b)

f(x;θ;b)

f(x;θ;b)

f(x;θ;b) Good?

Input layer, 2 hidden layer, output layer

28

Introduction
Reflections