Neural Networks CMPUT 366: Intelligent Systems
GBC §6.0-6.4.1
1. Recap
2. Nonlinear models
3. Feedforward neural networks
Lecture Outline
•
•
•
Partial derivatives are derivatives of “frozen” function: ∂ f(x,y) = d (f)y=y(x)
•
∂x dx
Gradient of a function is a vector of all its partial derivatives:
∂ ∂x ∂ ∂y
Recap: Calculus
Derivatives can be used for optimization
Minimization: Increase x if derivative is negative & vice versa
(∇f)(x,y) =
f(x, y) f(x, y)
•
• •
• •
weights T y=f(x;w)=g(w x)=g
inputs
( ∑n activation
Linear Models
Supervised models we have considered so far have been linear:
Linear model
wixi i=1
)
Linear classification / regression
Logistic regression
Advantages: Efficient to fit (closed form sometimes!) Disadvantages: Can be really limited
function
•
Question: What else could we do?
Figure 6.1, left
The function f(x1, x2) = (x1 XOR x2)
XOR is not linearly separable
•
is not linearly separable
•
•
There is no way to draw a straight line with all of the 1’s on one side and all of the 0’s on the other
This means that no linear model can represent XOR exactly; there will always be some errors
Example: XOR
Original x space
1
0
01
x1
Figure 6.1: Solving the XOR problem by learning a repr printed on the plot indicate the value that the learned func
(Image: Goodfellow 2017)
(Left) A linear model applied directly to the original inp function. When x = 0, the model’s output must increase
(Goodfello
x2
1
w
e t
u
Nonlinear Features
y = f(x;w) = g(wTx) = g(∑n wixi) i=1
One option: Learn a linear model on richer inputs
1. Define a feature mapping 𝜙(x) that returns functions of the original inputs
2. Learn a linear model of the features instead of the inputs
y = f(x;w) = g(wTφ(x)) = g(∑n wi[φ(x)]i) i=1
Nonlinear Features for XOR
•
What additional features would help?
•
• φ(x1, x2) = [1,×1, x2, x1x2]
XOR is not linearly separable
WORK
1
0
Original x space
S
g XOR
01
x1
Question:
The product of x1 and x2!
CHAPTER 6. DEEP FEEDFORWARD NET
Solvin
Original x space
Figure 6.1: S
printed on th (Left) A line
function. W
the model’s
coefficient w
the coefficie
represented
the problem.
collapsed int
x1
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers
Figure 6.1, left
Learned h space
(Goodfellow 2017)
olving the XOR problem by lea epe
ar e hen
ou c
t re
nt p
by e
0
2
1
lot indicate the value that th model applied directly to th
x1 = 0, the model’s output tput must decrease as x2 in o x2. The linear model the
on x2 and cannot solve this the features extracted by a n
our example solution, the t
In w o a single point in feature space.
012
hx=[1,0]> andx=[0,1]> to
w = [−0.2, 0.5, 0.5, − 2] •
•
f(x; w) = wTφ(x) > 0 for (0,1) and (1,0)
T0 f(x;w)=w φ(x)<0for(1,1)and(0,0)
1
rning a representation. T learned function must out
original input cannot im must increase as x2 increas
reases. A linear model m fore cannot use the valu roblem. (Right) In the t ural network, a linear mo
o points that must have o In other words, the nonli
01
mapped bot
The linear model can now describe the function as increasing in h1 and
Figure 6.1
representations can also help the model to generalize. printed on the plot indicate the value that the learned function must output at each point.
x + x 1h1 2
a single point in feature s
In this example, the motivation for learning the feature space is only t
(Image: Goodfellow 2017)
capacity greater so that it can fit the training set. In more realistic ap
(Goodfellow 2017)
x2
x×x 1h2 2
x2
(Left)A linear model applied directly to the original input cannot implement the XOR
p
u e
r d
u n
o p
• •
Learning Nonlinear Features Manually constructing good features is extremely hard
Manually constructed features are not transferrable between domains
e.g., SIFT features were a revolution in computer vision, but are only for Deep learning aims to learn φ automatically from the data
•
•
computer vision
• •
w1
Neural Units
Deep learning learns φ by composing little functions
These function are called units
b
h x2
(n) i=1
x1
T h(x;w,b)=g(b+w x)=g b+∑wixi
•
Question: How is this different from a linear model?
w2
offset
weights
activation
function
Feedforward Neural Network
A neural network is many units composed together Feedforward neural network: Units arranged into layers
Each layer takes outputs of previous layer as its inputs
• •
•
x1 h1
x2 h2
y
Example: XOR network
x +1 h 1 -1 1
x -1 h 2 +1 2
+1
y
• •
• •
Activation: g(z) = max{0,z} ("recified linear unit") Weights:
[+1, − 1] for h1; [−1, + 1] for h2 [+1, + 1] for y
+1
Matrix Representation
xhy
You can think of the outputs of
•
each layer as a vector h
•
The weights from all the outputs of a previous layer to each of the units of the layer can be collected into a matrix W
x1 h1
x2 h2
The offset term for each unit can h = g (Wx + b)
y
•
be collected into a vector b:
Architecture
x1 h1
x2 h2
Design decisions:
1. Depth: number of layers
2. Width: number of nodes in each layer
3. Fully connected?
y
Universal Approximation Theorem
Theorem: (Hornik et al. 1989; Cybenko 1989; Leshno et al. 1993)
A feedforward network with one hidden layer with a "squashing" activation or rectified linear activation and a linear output layer can approximate any function to within any given error bound, given enough hidden units.
So a wide but shallow feedforward network can represent any Question: Why bother with multiple layers? (i.e., depth > 1)
•
function we’re trying to learn!
•
•
•
Training
Neural networks are trained using variants of gradient descent
e.g., stochastic gradient descent
Back propagation is an algorithm that allows for efficient computation of
•
the gradient
Modern frameworks can compute the gradient in other ways (e.g.,
•
automatic differentiation) even for complicated units
•
Hidden Unit Activations Default choice: Rectified linear units (ReLU)
g(z) = max{0,z} Other common types:
•
•
tanh(z) 1
• •
1+e−z
(sigmoid)
Sigmoid suffers from vanishing gradients; ReLU does not
• •
•
•
Summary Generalized linear models are insufficiently expressive
Composing GLMs into a network is arbitrarily expressive
A neural network with a single hidden layer can approximate any function But the network might need to be impractically large, prone to overfitting, or
Neural networks are trained using variants of gradient descent
•
inefficient to train
Architectural choices can make a network easier to train, less prone to
•
overfitting