程序代写代做代考 deep learning algorithm Foundations of Machine Learning Neural Networks

Foundations of Machine Learning Neural Networks
Kate Farrahi
ECS Southampton
November 23, 2020
1/20

The Multilayer Perceptron
2/20

Multilayer Perceptron
Input layer
w (1) ji
Hidden layer
Output layer
h1 h2
w (2) kj
x1
x2 11
o yˆ o yˆ
h3
x3 22
x4
h4 h5
3/20

The Multilayer Perceptron
􏰂 MLPs are fully connected
􏰂 MLPs consist of three of more layers of nodes
􏰂 1 input layer, 1 output layer, 1 or more hidden layers
􏰂 A 4 – 5 – 2 fully connected network is illustrated on the previous slide
4/20

The Multilayer Perceptron – Input Layer
􏰂 d – dimensional input x
􏰂 no neurons at the input layer, simply input units 􏰂 each input unit simply emits the input xi
5/20

The Multilayer Perceptron – Output Layer
􏰂 c neurons in the output layer
􏰂 each neuron in the output layer also uses a non-linear
activation function
􏰂 c and the activation function at the output layer relate to the problem that is being solved – more details later
6/20

w, b w, b
Multilayer Perceptrons
Single unit One layer of units
x
w(1), b(1) w(2), b(2) w(3), b(3) Multiple layers of units
f (x; w, b)
EE-559 – Deep learning / 3.4. Multi-Layer Perceptrons
7/20

Multilayer Perceptron (MLP)
We can define the MLP formally as,
∀l =1,2,…,L, a(l) =σ(w(l)a(l−1) +b(l)) wherea(0) =x,andf(x;w,b)=a(L) =yˆ
Note, we will define the weighted input term,
z(l) = w(l)a(l−1) + b(l), for the backpropagation derivation
8/20

Multilayer Perceptron (MLP)
Define the following expressions:
1. z(1) 2. z(2) 3. a(1) 4. a(2) 5. a(3)
9/20

Activation Functions
10/20

Activation Functions
􏰂 The activation function in a neural network is a function used to transform the activation level of a unit (neuron) into an output signal.
􏰂 The activation function essentially divides the original space into typically two partitions, having a ”squashing” effect.
􏰂 The activation function is usually required to be a non-linear function.
􏰂 The input space is mapped to a different space in the output.
􏰂 There have been many kinds of activation functions proposed over the years (640+), however, the most commonly used are the Sigmoid, Tanh, ReLU, and Softmax
11/20

The Logistic (or Sigmoid) Activation Function
􏰂 The sigmoid function is a special case of a logistic function givenbyf(x)= 1
1+e −x 􏰂 non-linear (slope varies)
􏰂 continuously differentiable
􏰂 monotonically increasing
􏰂 NB: e is the natural logarithm
12/20

Sigmoid Function – Derivative
􏰂 The sigmoid function has an easily calculated derivative which is usesd in the back propagation algorithm
13/20

The Hyperbolic Tangent Activation Function
􏰂 The tanh function is also ”s”-shaped like the sigmoidal function, but the output range is (-1, 1)
􏰂 tanh(x) = 1−e−2x 1+e −2x
􏰂 tanh′(x) = 1 − tanh2(x)
14/20

Rectified Linear Units (ReLU)
􏰂 The ReLU (used for hidden layer neurons) is defined as: f (x) = max(0, x)
􏰂 The range of the ReLU is between 0 to ∞
15/20

Softmax
The softmax is an activation function used at the output layer of a neural network that forces the outputs to sum to 1 so that they can represent a probability distribution across a discrete mutually exclusive alternatives.
ezj
yj = 􏰀Kk=1 ezj , for j = 1,2,…,K
Note∂yi =yi(1−yi) zi
􏰂 The output of a sofmax layer is a set of positive numbers which sum up to 1 and can be thought of as a probability distribution
16/20

Question
􏰂 Given a binary classification problem, can you have 1 neuron at the output layer of an MLP?
􏰂 If so, what is the neuron’s activation function?
17/20

Question
􏰂 Given a binary classification problem, can you have 2 neurons at the output layer of an MLP?
􏰂 If so, what are the neuron’s activation functions?
18/20

The Cost Function (measure of discrepancy)
􏰂 Mean Squared Error (MSE) for M data points is given by MSE= 1 􏰀M (yˆ−y)2
2∗M i=1 i i
􏰂 1 just a constant so can be replaced by 1 or 1
2∗M 2M
􏰂 Other cost functions can be used as well, for example the KL
divergence or Hellinger distance 1
􏰂 MSE can be slow to learn, especially if the predictions are very
far off the targets 2
􏰂 Cross-Entropy Cost function is generally a better choice of
cost function (discussed next)
1 https://stats.stackexchange.com/questions/154879/a-list-of-cost- functions-used-in-neural-networks-alongside-applications
2 http://neuralnetworksanddeeplearning.com/chap3.html
19/20

Cross-Entropy Cost Function
􏰂 J=−1 􏰀M [yIn(yˆ)+(1−y)In(1−yˆ)] Mi=1ii i i
where M is the number of training examples
􏰂 The cross-entropy cost function is non-negative, J > 0
􏰂 J ≈ 0 when the prediction and targets are equal (i.e. y = 0 and yˆ = 0 or when y = 1 and yˆ = 1)
􏰂 ∂J is proportional to the error in the output (yˆ − y) and ∂ wij
therefore, the larger the error, the faster the neuron will learn!
20/20