程序代写代做代考 deep learning chain algorithm data structure Lecture 9: Neural Networks

Lecture 9: Neural Networks
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1

Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
2

Roadmap
So far … Classification and Evaluation
• Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Neural Networks
• Multilayer Perceptron
• Motivation and architecture
• Linear vs. non-linear classifiers
2

Introduction

Classifier Recap
Perceptron

yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
3

Classifier Recap
Perceptron

yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Logistic Regression
1
P(y = 1|x;θ) = 1+exp(−(􏰀Ff=0 θfxf))
• View 1: Model of P(y = 1|x), maximizing the data log likelihood • View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)
3

Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
4

Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
Other types of neural networks
• Convolutional neural networks • Recurrent neural networks
• Autoencoder (unsupervised)
4

Perceptron Unit (recap)
A single processing unit
1 x1 x2 …
θF xF
θ0 θ1
θ2
f(x;θ) y
output

y = f(θTx)
activation
input weigthts
A neural network is a combination of lots of these units.
5

Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1
x1 x2
… xF
f(x;θ) y
Input layer
Output layer
6

Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1
x1 x2
… xF
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
y
Input layer
Hidden layer 1
Output layer
6

Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
1 x1
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
x … xF
f(x;θ)
f(x;θ)
2
y
f(x;θ)
Input layer
Hidden layer 1
Hidden layer 2
Output layer
6

Why the Hype? I
Linear classification
• The perceptron, naive bayes, logistic regression are linear classifiers • Decision boundary is a linear combination of features 􏰀i θi xi
• Cannot learn ‘feature interactions’ naturally
• Perceptron can solve only linearly separable problems
Non-linear classification
• Neural networks with at least 1 hidden layer and non-linear activations are non-linear classifiers
• Decision boundary is a non-linear function of the inputs
• Capture ‘feature interactions’
7

Why the Hype? II
Feature Engineering
• (more next week!)
• The perceptron, naive Bayes and logistic regression require a fixed set
of informative features
• e.g.,outlook∈{overcast,sunny,rainy},wind∈{high,low}etc
• Requiring domain knowledge
Feature learning
• Neural networks take as input ‘raw’ data
• They learn features themselves as intermediate representations
• They learn features as part of their target task (e.g., classification)
• ‘Representation learning’: learning representations (or features) of the
data that are useful for the target task
• Note: often feature engineering is replaced at the cost of additional paramter tuning (layers, activations, learning rates, …)
8

Multilayer Perceptron: Motivation I
Example Classification dataset
Outlook Temperature
sunny hot
sunny hot overcast hot rainy mild

We really observe raw data
Date
01/03/1966 01/04/1966 01/05/1966 01/06/1966
Humidity
high high high high
Windy
FALSE TRUE FALSE FALSE
True Label
no
no yes yes
measurements
True Label
no
no yes yes
0.4 4.7 3.4 -0.7 0.3 8.7 5.5 5.7
1.5 12.7 …
3.8 18.7 … 136.9 17 … 65.5 2.7 …
9

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
m2
m3
m4
Play?

mF
10

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
x1 x2
x3 x4 … xF
Input layer,
f(x;θ)
Play?
m1
m2
m3
m4

mF
1 unit,
output layer
10

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
m2
Outlook
m3
Temperature
m4
Play?
Humidity

mF
Windy
10

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset
m1
???
m2
m3
???
m4
Play?
???

mF
???
10

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
m1
x1 x2
x3 x4 … xF
Input layer,
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
1 hidden layer,
f(x;θ)
Play?
m2
m3
m4

mF
output layer
10

Multilayer Perceptron: Motivation II
ExampleProblem: WeatherDataset 1
m1
x1 x2 x3 x … xF
Input layer,
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
f(x;θ)
Play?
m2
m3
m4
4
f(x;θ)

mF
2 hidden layer,
output layer
10

Another Example: Face Recognition
• the hidden layers learn increasingly high-level feature representations • e.g., given an image, predict the person:
Source: https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/
11

Multilayer Perceptron
Terminology
• input units xj , one per feature j
• Multiple layers l = 1…L of nodes. L is the depth of the network.
• Each layer l has a number of units Kl . Kl is the width of layer l.
• The width can vary from layer to layer
• output unit y
• Each layer l is fully connected to its neighboring layers l − 1 and l + 1 • one weight θ(l) for each connection ij (including ‘bias’ θ0)
ij
• non-linear activation function for layer l as φ(l)
12

Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
h(1) =φ(1)􏰛􏰁θ(1)x􏰜 i ijj
j
h(2) =φ(2)􏰛􏰁θ(2)h(1)􏰜 i ijj
j
yi =φ(3)􏰛􏰁θ(3)h(2)􏰜 ij j
j
13

Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
Or in vectorized form
h(1) =φ(1)􏰛􏰁θ(1)x􏰜 i ijj
j
h(2) =φ(2)􏰛􏰁θ(2)h(1)􏰜 i ijj
j
yi =φ(3)􏰛􏰁θ(3)h(2)􏰜 ij j
j
h(1) = φ(1)􏰛θ(1)T x􏰜 h(2) = φ(2)􏰛θ(2)T h(1)􏰜 y = φ(3)􏰛θ(3)T h(2)􏰜
where the activation functions φ(l) are applied element-wise to all entries
13

Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 111 100 010 000
14

Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 110 101 011 001
14

Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 110 101 011 000
14

A Multilayer Perceptron for XOR

φ(x) = 1 if x >= 0 and recall: h(l) = φ(l)􏰛 􏰀 θ(l)h(l−1) + b(l)􏰜
0 ifx<0 i j ij j j Source: https: //www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf 15 Designing Neural Networks I: Inputs Inputs and feature functions • x could be a patient with features {blood pressure, height, age, weight, ...} • x could be a texts, i.e., a sequence of words • x could be an image, i.e., a matrix of pixels Non-numerical features need to be mapped to numerical • For language, typical to map words to pre-trained embedding vectors • for 1-hot: dim(x) = V (words in the vocabulary) • for embedding: dim(x) = k, dimensionality of embedding vectors • Alternative: 1-hot encoding • For pixels, map to RGB, or other visual features 16 Designing Neural Networks II: Activation Functions • Each layer has an associated activation function (e.g., sigmode, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input • Each hidden layer performs a non-linear transformation of the input • Popular choices include 17 Designing Neural Networks II: Activation Functions • Each layer has an associated activation function (e.g., sigmode, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input • Each hidden layer performs a non-linear transformation of the input • Popular choices include 1. logistic (aka sigmoid) (“σ”): f(x) = 1 1+e−x 2. hyperbolic tan (“tanh”): f(x)= e2x −1 e2x +1 3. rectified linear unit (“ReLU”): f(x) = max(0,x) note not differentiable at x = 0 17 Designing Neural Networks II: Activation Functions function values: derivatives: 18 Designing Neural Networks III: Structure Network Structure • Sequence of hidden layers l1, . . . , lL for a netword of depth L • Each layer l has Kl parallel neurons (breadth) • Many layers (depth) vs. many neurons per layer (breadth)? Empirical question, theoretically poorly understood. Advanced tricks include allowing for exploiting data structure • convolutions (convolutional neural networks; CNN), Computer Vision • recurrencies (recurrent neural networks; RNN), Natural Language Processing • attention (efficient alternative to recurrencies) • ... Beyond the scope of this class. 19 Designing Neural Networks IV: Output Function Neural networks can learn different concepts: classification, regression, ... The output function depends on the concept of intereest. • Binary classification: • one neuron, with step function (as in the perceptron) • Multiclass classification: • typically softmax to normalize K outputs from the pre-final layer into a probability distribution over classes exp(zj ) p(yi = j|xi;θ) = 􏰀Kk=1 exp(zk) • Regression: • identity function • possibly other continuous functions such as sigmoid or tanh 20 Designing Neural Networks V: Loss Functions Classification Loss: typically negative conditional log-likelihood ( cross-entropy) Li =−logp(y(i)|x(i);θ) L = − 􏰁 log p(y(i)|x(i); θ) i • Binary classification loss yˆ(i) = p(y(i) = 1|x(i); θ) 1 forasingleinstancei for all instances L = 􏰁 −[y(i) log(yˆ(i)) + (1 − y(i)) log(1 − yˆ(i))] 11 i • Multiclass classification yˆ(i) = p(y(i) = j|x(i); θ) j for j possible labels; y(i) = 1 if j is the true label for instance i, else 0. j L = −􏰁􏰁y(i) log(yˆ(i)) jj ij 21 Designing Neural Networks V: Loss Functions Regression Loss: typically mean-squared error (MSE) • Here, the output, as well as the target are real-valued numbers 1 􏰁N L = N i=1 ( y i − yˆ ( i ) ) 2 21 Representational Power of Neural Nets • The universal approximation theorem states that a feed-forward neural network with a single hidden layer (and finite neurons) is able to approximate any continuous function on Rn • Note that the activation functions must be non-linear, as without this, the model is simply a (complex) linear model 22 How to Train a NN with Hidden Layers • Unfortunately, the perceptron algorithm can’t be used to train neural nets with hidden layers, as we can’t directly observe the labels • Instead, train neural nets with back propagation. Intuitively: • compute errors at the output layer wrt each weight using partial differentiation • propagate those errors back to each of the input layers • Essentially just gradient descent, but using the chain rule to make the calculations more efficient Next lecture: Backpropagation for training neural networks 23 Reflections When is Linear Classification Enough? • If we know our classes are linearly (approximately) separable • If the feature space is (very) high-dimensional ...i.e., the number of features exceeds the number of training instances • If the traning set is small • If interpretability is important, i.e., understanding how (combinations of) features explain different predictions 24 Pros and Cons of Neural Networks Pros • Powerful tool! • Neural networks with at least 1 hidden layer can approximate any (continuous) function. They are universal approximators • Automatic feature learning • Empirically, very good performance for many diverse tasks Cons • Powerful model increases the danger of ‘overfitting’ • Requires large training data sets • Often requires powerful compute resources (GPUs) • Lack of interpretability 25 Summary Today • From perceptrons to neural networks • multilayer perceptron • some examples • features and limitations Next Lecture • Learning parameters of neural networks • The Backpropagation algorithm 26 References Jacob Eisenstein (2019). Natural Language Processing. MIT Press. Chapters 3 (intro), 3.1, 3.2. https://github.com/jacobeisenstein/ gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf Dan Jurafsky and James H. Martin. Speech and Language Processing. Chapter 7.2, 7.3. Online Draft V3.0. https://web.stanford.edu/~jurafsky/slp3/ 27 Multilayer Perceptron: Motivation II AnotherExampleProblem: Sentimentanalysisofmoviereviews Contains ‘good’ Contains ‘awful’ Length > 3
Contains ‘actor’
Good?
Contains ‘story’
Contains ‘plot’

Contains ‘ending’
28

Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
x3 x4 x5 x6 …
Contains ‘ending’ xF
Input layer, 1 unit, output layer
f(x;θ;b)
Good?
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Contains ‘plot’

28

Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
Length > 3
reasonably long
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
Contains ‘ending’
word sentiment
reviews actors
Good?
summarizes plot

28

Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
???
Length > 3
???
Contains ‘actor’
Good?
Contains ‘story’
Contains ‘plot’
???
???

Contains ‘ending’
28

Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
Contains ‘good’
f(x;θ;b)
Contains ‘awful’
Length > 3
x3 f(x;θ;b) x4
Good?
Contains ‘actor’
f(x;θ;b)
x5 x6 …
f(x;θ;b)
f(x;θ;b)
Contains ‘story’
Contains ‘plot’

Contains ‘ending’ xF
Input layer, 1 hidden layer, output layer
28

Multilayer Perceptron: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x1 x2
Contains ‘good’
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
Contains ‘ending’ xF
Input layer, 2 hidden layer, output layer
f(x;θ;b)
f(x;θ;b)
f(x;θ;b)
Contains ‘awful’
Length > 3
x3 x4 x5 x
f(x;θ;b)
Good?
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
6 …

28