Foundations of Machine Learning Neural Networks
Kate Farrahi
ECS Southampton
November 23, 2020
1/20
The Multilayer Perceptron
2/20
Multilayer Perceptron
Input layer
w (1) ji
Hidden layer
Output layer
h1 h2
w (2) kj
x1
x2 11
o yˆ o yˆ
h3
x3 22
x4
h4 h5
3/20
The Multilayer Perceptron
MLPs are fully connected
MLPs consist of three of more layers of nodes
1 input layer, 1 output layer, 1 or more hidden layers
A 4 – 5 – 2 fully connected network is illustrated on the previous slide
4/20
The Multilayer Perceptron – Input Layer
d – dimensional input x
no neurons at the input layer, simply input units each input unit simply emits the input xi
5/20
The Multilayer Perceptron – Output Layer
c neurons in the output layer
each neuron in the output layer also uses a non-linear
activation function
c and the activation function at the output layer relate to the problem that is being solved – more details later
6/20
w, b w, b
Multilayer Perceptrons
Single unit One layer of units
x
w(1), b(1) w(2), b(2) w(3), b(3) Multiple layers of units
f (x; w, b)
EE-559 – Deep learning / 3.4. Multi-Layer Perceptrons
7/20
Multilayer Perceptron (MLP)
We can define the MLP formally as,
∀l =1,2,…,L, a(l) =σ(w(l)a(l−1) +b(l)) wherea(0) =x,andf(x;w,b)=a(L) =yˆ
Note, we will define the weighted input term,
z(l) = w(l)a(l−1) + b(l), for the backpropagation derivation
8/20
Multilayer Perceptron (MLP)
Define the following expressions:
1. z(1) 2. z(2) 3. a(1) 4. a(2) 5. a(3)
9/20
Activation Functions
10/20
Activation Functions
The activation function in a neural network is a function used to transform the activation level of a unit (neuron) into an output signal.
The activation function essentially divides the original space into typically two partitions, having a ”squashing” effect.
The activation function is usually required to be a non-linear function.
The input space is mapped to a different space in the output.
There have been many kinds of activation functions proposed over the years (640+), however, the most commonly used are the Sigmoid, Tanh, ReLU, and Softmax
11/20
The Logistic (or Sigmoid) Activation Function
The sigmoid function is a special case of a logistic function givenbyf(x)= 1
1+e −x non-linear (slope varies)
continuously differentiable
monotonically increasing
NB: e is the natural logarithm
12/20
Sigmoid Function – Derivative
The sigmoid function has an easily calculated derivative which is usesd in the back propagation algorithm
13/20
The Hyperbolic Tangent Activation Function
The tanh function is also ”s”-shaped like the sigmoidal function, but the output range is (-1, 1)
tanh(x) = 1−e−2x 1+e −2x
tanh′(x) = 1 − tanh2(x)
14/20
Rectified Linear Units (ReLU)
The ReLU (used for hidden layer neurons) is defined as: f (x) = max(0, x)
The range of the ReLU is between 0 to ∞
15/20
Softmax
The softmax is an activation function used at the output layer of a neural network that forces the outputs to sum to 1 so that they can represent a probability distribution across a discrete mutually exclusive alternatives.
ezj
yj = Kk=1 ezj , for j = 1,2,…,K
Note∂yi =yi(1−yi) zi
The output of a sofmax layer is a set of positive numbers which sum up to 1 and can be thought of as a probability distribution
16/20
Question
Given a binary classification problem, can you have 1 neuron at the output layer of an MLP?
If so, what is the neuron’s activation function?
17/20
Question
Given a binary classification problem, can you have 2 neurons at the output layer of an MLP?
If so, what are the neuron’s activation functions?
18/20
The Cost Function (measure of discrepancy)
Mean Squared Error (MSE) for M data points is given by MSE= 1 M (yˆ−y)2
2∗M i=1 i i
1 just a constant so can be replaced by 1 or 1
2∗M 2M
Other cost functions can be used as well, for example the KL
divergence or Hellinger distance 1
MSE can be slow to learn, especially if the predictions are very
far off the targets 2
Cross-Entropy Cost function is generally a better choice of
cost function (discussed next)
1 https://stats.stackexchange.com/questions/154879/a-list-of-cost- functions-used-in-neural-networks-alongside-applications
2 http://neuralnetworksanddeeplearning.com/chap3.html
19/20
Cross-Entropy Cost Function
J=−1 M [yIn(yˆ)+(1−y)In(1−yˆ)] Mi=1ii i i
where M is the number of training examples
The cross-entropy cost function is non-negative, J > 0
J ≈ 0 when the prediction and targets are equal (i.e. y = 0 and yˆ = 0 or when y = 1 and yˆ = 1)
∂J is proportional to the error in the output (yˆ − y) and ∂ wij
therefore, the larger the error, the faster the neuron will learn!
20/20