Lecture 15: Neural Networks
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
So far … Classification and Evaluation
• KNN, Naive Bayes, Logistic Regression, Perceptron • Probabilistic models
• Loss functions, and estimation
• Evaluation
Today… Neural Networks
• Motivation and architecture
• Linear vs. non-linear classifiers
Introduction
Classifier Recap
Perceptron
yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Classifier Recap
Perceptron
yˆ = f (θ · x ) = 1 if θ · x ≥ 0
−1 otherwise
• Single processing ‘unit’
• Inspired by neurons in the brain
• Activation: step-function (discrete, non-differentiable)
Logistic Regression
P(y = 1|x;θ) = 1+exp(−(Ff=0 θfxf))
• View 1: Model of P(y = 1|x), maximizing the data log likelihood • View 2: Single processing ‘unit’
• Activation: sigmoid (continuous, differentiable)
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
Neural Networks and Deep Learning
Neural Networks
• Connected sets of many such units
• Units must have continuous activation functions • Connected into many layers → Deep Learning
Multi-layer Perceptron
• This lecture!
• One specific type of neural network • Feed-forward
• Fully connected
• Supervised learner
Other types of neural networks
• Convolutional neural networks • Recurrent neural networks
• Autoencoder (unsupervised)
Perceptron Unit (recap)
A single processing unit
1 x1 x2 …
y = f(θTx)
activation
input weigthts
A neural network is a combination of lots of these units.
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
Input layer
Output layer
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
Input layer
Hidden layer 1
Output layer
Multi-layer Perceptron (schematic)
Three Types of layers
• Input layer with input units x: the first layer, takes features x as inputs
• Output layer with output units y: the last layer, has one unit per possible
output (e.g., 1 unit for binary classification)
• Hidden layers with hidden units h: all layers in between.
Input layer
Hidden layer 1
Hidden layer 2
Output layer
Why the Hype? I
Linear classification
• The perceptron, naive bayes, logistic regression are linear classifiers • Decision boundary is a linear combination of features i θi xi
• Cannot learn ‘feature interactions’ naturally
• Perceptron can solve only linearly separable problems
Non-linear classification
• Neural networks with at least 1 hidden layer and non-linear activations are non-linear classifiers
• Decision boundary is a non-linear function of the inputs
• Capture ‘feature interactions’
Why the Hype? II
Feature Engineering
• (more next week!)
• The perceptron, naive Bayes and logistic regression require a fixed set
of informative features
• e.g.,outlook∈{overcast,sunny,rainy},wind∈{high,low}etc
• Requiring domain knowledge
Feature learning
• Neural networks take as input ‘raw’ data
• They learn features themselves as intermediate representations
• They learn features as part of their target task (e.g., classification)
• ‘Representation learning’: learning representations (or features) of the
data that are useful for the target task
• Note: often feature engineering is replaced at the cost of additional paramter tuning (layers, activations, learning rates, …)
: Motivation I
Example Classification dataset
Outlook Temperature
sunny hot overcast hot rainy mild
We really observe raw data
01/03/1966 01/04/1966 01/05/1966 01/06/1966
high high high high
FALSE TRUE FALSE FALSE
True Label
no yes yes
measurements
True Label
no yes yes
0.4 4.7 3.4 -0.7 0.3 8.7 5.5 5.7
1.5 12.7 …
3.8 18.7 … 136.9 17 … 65.5 2.7 …
: Motivation II
ExampleProblem: WeatherDataset
: Motivation II
ExampleProblem: WeatherDataset 1
x3 x4 … xF
Input layer,
output layer
: Motivation II
ExampleProblem: WeatherDataset
Temperature
: Motivation II
ExampleProblem: WeatherDataset
: Motivation II
ExampleProblem: WeatherDataset 1
x3 x4 … xF
Input layer,
1 hidden layer,
output layer
: Motivation II
ExampleProblem: WeatherDataset 1
x1 x2 x3 x … xF
Input layer,
2 hidden layer,
output layer
Another Example: Face Recognition
• the hidden layers learn increasingly high-level feature representations • e.g., given an image, predict the person:
Source: https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/
Terminology
• input units xj , one per feature j
• Multiple layers l = 1…L of nodes. L is the depth of the network.
• Each layer l has a number of units Kl . Kl is the width of layer l.
• The width can vary from layer to layer
• output unit y
• Each layer l is fully connected to its neighboring layers l − 1 and l + 1 • one weight θ(l) for each connection ij (including ‘bias’ θ0)
• non-linear activation function for layer l as φ(l)
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
h(1) =φ(1)θ(1)x i ijj
h(2) =φ(2)θ(2)h(1) i ijj
yi =φ(3)θ(3)h(2) ij j
Prediction with a feedforward Network
Passing an input through a neural network with 2 hidden layers
Or in vectorized form
h(1) =φ(1)θ(1)x i ijj
h(2) =φ(2)θ(2)h(1) i ijj
yi =φ(3)θ(3)h(2) ij j
h(1) = φ(1)θ(1)T x h(2) = φ(2)θ(2)T h(1) y = φ(3)θ(3)T h(2)
where the activation functions φ(l) are applied element-wise to all entries
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 111 100 010 000
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 111 101 011 000
Boolean Functions
1. Can the perceptron learn this function? Why (not)?
2. Can a multilayer perceptron learn this function? Why (not)?
x1 x2 y 110 101 011 000
A for XOR
φ(x) = 1 if x >= 0 and recall: h(l) = φ(l) θ(l)h(l−1) + b(l)
0 ifx<0 i j ij j j
Source: https: //www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
Designing Neural Networks I: Inputs
Inputs and feature functions
• x could be a patient with features {blood pressure, height, age, weight, ...}
• x could be a texts, i.e., a sequence of words
• x could be an image, i.e., a matrix of pixels
Non-numerical features need to be mapped to numerical
• For language, typical to map words to pre-trained embedding vectors • for 1-hot: dim(x) = V (words in the vocabulary)
• for embedding: dim(x) = k, dimensionality of embedding vectors
• Alternative: 1-hot encoding
• For pixels, map to RGB, or other visual features
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmoid, RelU, ...) • Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
Designing Neural Networks II: Activation Functions
• Each layer has an associated activation function (e.g., sigmoid, RelU, ...)
• Represents the extent to which a neuron is ‘activated’ given an input
• Each hidden layer performs a non-linear transformation of the input
• Popular choices include
1. logistic (aka sigmoid) (“σ”):
2. hyperbolic tan (“tanh”):
f(x)= e2x −1
3. rectified linear unit (“ReLU”):
f(x) = max(0,x)
note not differentiable at x = 0
Designing Neural Networks II: Activation Functions
function values:
derivatives:
Designing Neural Networks III: Structure
Network Structure
• Sequence of hidden layers l1, . . . , lL for a netword of depth L
• Each layer l has Kl parallel neurons (breadth)
• Many layers (depth) vs. many neurons per layer (breadth)? Empirical question, theoretically poorly understood.
Advanced tricks include allowing for exploiting data structure
• convolutions (convolutional neural networks; CNN), Computer Vision
• recurrencies (recurrent neural networks; RNN), Natural Language Processing
• attention (efficient alternative to recurrencies)
Beyond the scope of this class.
Designing Neural Networks IV: Output Function
Neural networks can learn different concepts: classification, regression, ... The output function depends on the concept of intereest.
• Binary classification:
• one neuron, with step function (as in the perceptron)
• Multiclass classification:
• typically softmax to normalize K outputs from the pre-final layer into a probability distribution over classes
exp(zj ) p(yi =j|xi;θ)=Kk=1exp(zk)
• Regression:
• identity function
• possibly other continuous functions such as sigmoid or tanh
Designing Neural Networks V: Loss Functions
Classification Loss: typically negative conditional log-likelihood ( cross-entropy)
Li =−logp(y(i)|x(i);θ)
L = − log p(y(i)|x(i); θ)
• Binary classification loss
yˆ(i) = p(y(i) = 1|x(i); θ) 1
forasingleinstancei for all instances
L = −[y(i) log(yˆ(i)) + (1 − y(i)) log(1 − yˆ(i))] 11
• Multiclass classification
yˆ(i) = p(y(i) = j|x(i); θ) j
L = −y(i) log(yˆ(i)) jj
for j possible labels; y(i) = 1 if j is the true label for instance i, else 0. j
Designing Neural Networks V: Loss Functions
Regression Loss: typically mean-squared error (MSE)
• Here, the output, as well as the target are real-valued numbers
( y i − yˆ ( i ) ) 2
Representational Power of Neural Nets
• The universal approximation theorem states that a feed-forward neural network with a single hidden layer (and finite neurons) is able to approximate any continuous function on Rn
• Note that the activation functions must be non-linear, as without this, the model is simply a (complex) linear model
How to Train a NN with Hidden Layers
• Unfortunately, the perceptron algorithm can’t be used to train neural nets with hidden layers, as we can’t directly observe the labels
• Instead, train neural nets with back propagation. Intuitively:
• compute errors at the output layer wrt each weight using partial
differentiation
• propagate those errors back to each of the input layers
• Essentially just gradient descent, but using the chain rule to make the calculations more efficient
Next lecture: Backpropagation for training neural networks
Reflections
When is Linear Classification Enough?
• If we know our classes are linearly (approximately) separable
• If the feature space is (very) high-dimensional
...i.e., the number of features exceeds the number of training instances
• If the traning set is small
• If interpretability is important, i.e., understanding how (combinations of) features explain different predictions
Pros and Cons of Neural Networks
• Powerful tool!
• Neural networks with at least 1 hidden layer can approximate any
(continuous) function. They are universal approximators
• Automatic feature learning
• Empirically, very good performance for many diverse tasks
• Powerful model increases the danger of ‘overfitting’ • Requires large training data sets
• Often requires powerful compute resources (GPUs) • Lack of interpretability
• From perceptrons to neural networks • multilayer perceptron
• some examples
• features and limitations
Next Lecture
• Learning parameters of neural networks • The Backpropagation algorithm
References
Jacob Eisenstein (2019). Natural Language Processing. MIT Press. Chapters 3 (intro), 3.1, 3.2. https://github.com/jacobeisenstein/ gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
and . Martin. Speech and Language Processing. Chapter 7.2, 7.3. Online Draft V3.0. https://web.stanford.edu/~jurafsky/slp3/
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
Contains ‘ending’
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
x3 x4 x5 x6 …
Contains ‘ending’ xF
Input layer, 1 unit, output layer
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
word sentiment
Contains ‘awful’
Length > 3
reasonably long
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
reviews actors
summarizes plot
Contains ‘ending’
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews
Contains ‘good’
Contains ‘awful’
Length > 3
Contains ‘actor’
Contains ‘plot’
Contains ‘story’
Contains ‘ending’
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
Contains ‘good’
Contains ‘awful’
Length > 3
x3 f(x;θ;b) x4
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
Contains ‘ending’ xF
Input layer, 1 hidden layer, output layer
: Motivation II
AnotherExampleProblem: Sentimentanalysisofmoviereviews 1
Contains ‘good’
Contains ‘ending’ xF
Input layer, 2 hidden layer, output layer
Contains ‘awful’
Length > 3
x3 x4 x5 x
Contains ‘actor’
Contains ‘story’
Contains ‘plot’
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com